The Puloto database (Watts et al. (2015)) contains many variables on religion, history, society, and the natural environment of 116 Austronesian cultures. As stated on the website, the database was specifically designed to analyse religious beliefs and practices and is therefore a wonderful candidate for some basic analysis and visualizations, which I’m leaving to a later date (maybe as a shiny
exercise). As the title of this post suggests, my objective is to explore the ability of R to analyse and visualize metadata, focusing on text notes and academic paper citations.
The dataset used for this analysis is available from the Pulotu website. The data table starts with variables describing the culture (name, notes, ISO / ABVD codes) and then a series pairs of columns with a similar prefix (vXX…) - one column containing the actual data and another column containing a citation of the source for the data. The table looks something like:
Culture | Culture Notes | isocode | ABVD Code | v1.Traditional Time Focus | v1.Source | v2.Number of islands inhabited by culture | v2.Source | … |
---|---|---|---|---|---|---|---|---|
Ajie | The indigenous people of … | aji | 1188 | 1825-1850 | Winslow (1991) pp 9 | 1 | Winslow (1991) pp 7 | … |
Ami | The Ami lived… | ami | 350 | 1875-1900 | Lebar (1975) pp 117 | 1 | Lebar (1975) pp 116 | … |
A full data dictionary can be found here.
One of my personal goals in writing this post is to analyse the data using the principles of Tidy Data. I wrote my first piece of R-code a while ago (2005) and have been coding mostly in base R, SQL for data manipulations (via sqldf
) and the basic plot
function. Adjusting to the new coding paradigm (dplyr
semantics, magrittr
pipes, etc.) and the “graphic grammar” of ggplot2
has been quite a challenge but a real “eye opener” to new ways of thinking about data, analysis and code. For new-comers into this brave new world I highly recommend starting with this post by Karl Broman.
So let’s get started be loading all the packages we need and reading the data file:
# Data manipulation packages
require(dplyr)
require(readr)
require(tidyr)
require(reshape2)
require(tidytext)
# Graphics packages
require(ggplot2)
require(ggmap)
# Load & filter data, focusing only on the "pacific" cultures-------------------
read_tsv('c:/Users/yizhart/Downloads/Datasets/Pulotu_Database_4_7_2015.txt') %>%
filter(substring(Culture, 1, 8) != 'Malagasy') ->
d
Let’s start with a simple example: the variable marked as “v1.Traditional_Time_Focus” contains a range of years (1XXX - 1XXX) to which observations about a culture are applicable. Since the structure is uniform it’s easy to extract the beginning and the end of the period and visualize the data
# adding some columns
d %>%
mutate(
Start = parse_date(paste(substring(v1.Traditional_Time_Focus, first = 1, last = 4), '-01-01', sep = ''), format = '%Y-%m-%d'),
End = parse_date(paste(substring(v1.Traditional_Time_Focus, first = 6), '-01-01', sep = ''), format = '%Y-%m-%d'),
n = row_number()
) ->
d
# Graph1 : time period overlap --------------------------------------------
d %>%
ggplot() +
geom_segment(aes(x=Start, xend=End, y=Culture, yend=Culture, color = Culture), size=1) +
labs(x = 'Years', y = 'Culture') +
theme(legend.position = "none", axis.text.y = element_text(size=4))
Another simple visualization is to take the longitude/latitude included in the files and use the to place some data on a map. As a first step let’s try to find the “centre” of the map so we can make the appropriate request from the map service:
d %>% summarise(
lon_mean = mean(v6.Longitude),
lat_mean = mean(v5.Latitude),
lon_med = median(v6.Longitude),
lat_med = median(v5.Latitude)
)
## # A tibble: 1 x 4
## lon_mean lat_mean lon_med lat_med
## <dbl> <dbl> <dbl> <dbl>
## 1 99.88684 -3.914912 124.95 -7.25
I ended up choosing a coordinate manually (140, -10). The following code is not particularly “tidy” but works with the ggmap
package (Kahle and Wickham (2013)) to show the relative population of the culture on a Google map:
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=-10,140&zoom=3&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
So far we’ve looked at some pretty straight forward visualizations. What if we want to analyse not the data itself but the sources of the data? Below is a piece of code that takes the “vXXX.Source” columns and tries to parse them into author and year of publication. I relied on the fact that the citations typically follow a consistent structure:
Author1 & Author2 & … (YYYY) pp XX-XX
where YYYY stands for a 4 digit year.
As you can see I rely heavily on “tidy” concepts and packages (dplyr, tidyr, tidytext & reshape2
)
# Regexp definition of the year pattern
year_pattern <- '[ ][(][0-9]{4}[A-Z]{0,1}[)][ ]'
d %>%
select(ends_with('Source'), Culture) %>%
# transpose each column to a row
melt(id = ('Culture')) %>%
# split multiple sources in a single line
mutate(value = strsplit(value, split = '; ')) %>%
# convert a multi-source line to multiple lines
unnest(value) %>%
# Locate year pattern - creates a colun of "regex" objects
mutate(regex = gregexpr(pattern = year_pattern, text = value)) %>%
# In case there's more than one match (still) build a column of DF's and then split again
mutate(regex_values = Map(function(y) {data.frame(start = unlist(y), length = attr(y, "match.length"))}, regex)) %>%
select(-regex) %>%
unnest(regex_values) %>%
mutate(
before = substr(value, 1, start-1),
matched = substr(value, start, start+length-1),
after = substr(value, start+length, stop = nchar(value))
) %>%
# everythin before the year is author name, and the match itself "year in brackets"
select(Culture, variable, Author = before, year = matched) %>%
mutate(year = as.numeric(gsub('[^0-9]', '', year))) %>%
# split multiple authors
mutate(Author = strsplit(Author, split = ' & ')) %>%
# convert a multi-Author line to multiple lines
unnest(Author) %>%
# filter author names that are all numbers (some errors occur)
filter(grepl('[a-zA-Z]', Author)) %>%
tbl_df() ->
source_freq
head(source_freq)
## # A tibble: 6 x 4
## Culture variable year Author
## <chr> <fctr> <dbl> <chr>
## 1 Ajie v1.Source 1991 Winslow
## 2 Ami v1.Source 1975 Lebar
## 3 Anuta v1.Source 1991 Feinberg
## 4 Anuta v1.Source 1995 Feinberg
## 5 Arosi v1.Source 2007 Scott
## 6 Ata Tana 'Ai v1.Source 1988 Lewis
My intention was to measure the contribution to single data points in the database, therefore the new table contains a quote per author per culture per data column. As a result, papers that contributed information to multiple cultures or papers that was used (and cited) for multiple variables will appear on multiple rows.
Now that we have the data in a tidy format, let’s visualize it!
As you can see some authors contribute multiple papers to a single culture, while others contribute a single paper for multiple culture (and are therefore counted 20 times)
Ignoring the special cases (Google maps, DaftLogic or the very prolific “Source not applicable”) we can see that some of the authors (e.g. Belvins) contributed across the board, while others made a concentrated contribution to a few cultures (e.g. Burrows and Forth).
Counting by publication year, we can look at the number of contributions by decade (as a bar-graph) or in more details by year (a continuous line)
source_freq %>%
group_by(year) %>%
summarise(source_count = n()) %>%
arrange(year) %>%
mutate(source_cumsum = cumsum(source_count)) ->
source_freq_year
source_freq %>%
mutate(decade = 10*floor(year/10)) %>%
group_by(decade) %>%
summarise(source_count = n()) %>%
arrange(decade) %>%
mutate(source_cumsum = cumsum(source_count)) ->
source_freq_decade
ggplot() +
geom_line(mapping = aes(x=year, y = source_cumsum), data = source_freq_year, color = 'black', lwd = 1) +
geom_bar( mapping = aes(x=decade, y = source_cumsum), data = source_freq_decade, stat = 'identity', fill = 'blue', alpha = 0.5) +
ylab('Contributions') +
xlab('Year of publication') +
ggtitle('Source accumulation over time')
The widest column in the database (text length ranging from 0 or NA to 1938 characters) is “Culture_Notes”, containing descriptive textual information about each culture .
sapply(d$Culture_Notes, nchar) %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.0 322.8 416.5 431.0 492.2 1938.0 30
As this is an academic text (and not a comment or a call transcript) I did not see a lot of value in diving into sentiment analysis, it was interesting to look at the frequency of words used in the notes. Since we are analysing an anthropological database focused on religion and culture, the result should not surprise anyone:
## Warning in wordcloud(.$word, .$n, colors = blues9[5:9]): religion could not
## be fit on page. It will not be plotted.
And now we can repeat the same geo-analysis we did before but with some metadata - how “verbose” were the writers of the notes over different geographic territories?
qmap(c(lon = 140, lat = -10), zoom = 3) +
geom_point(
data = d,
aes(x = v6.Longitude, y = v5.Latitude),
color = 'red',
show.legend = TRUE,
alpha = 0.3,
size = 1 + 40 * nchar(d$Culture_Notes) / max(nchar(d$Culture_Notes), na.rm = TRUE)
) +
geom_text(
data = d,
aes(x = v6.Longitude, y = v5.Latitude, label = Culture),
check_overlap = FALSE,
size = 3 # + 7 * nchar(d$Culture_Notes) / max(nchar(d$Culture_Notes), na.rm = TRUE)
)
Kahle, David, and Hadley Wickham. 2013. “Ggmap: Spatial Visualization with Ggplot2.” http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf.
Watts, Joseph, Oliver Sheehan, Simon J. Greenhill, Stephanie Gomes-Ng, Quentin D. Atkinson, Joseph Bulbulia, and Russell D. Gray. 2015. “Pulotu: Database of Austronesian Supernatural Beliefs and Practices.” doi:10.1371/journal.pone.0136783.