The Puloto database (Watts et al. (2015)) contains many variables on religion, history, society, and the natural environment of 116 Austronesian cultures. As stated on the website, the database was specifically designed to analyse religious beliefs and practices and is therefore a wonderful candidate for some basic analysis and visualizations, which I’m leaving to a later date (maybe as a shiny
exercise). As the title of this post suggests, my objective is to explore the ability of R to analyse and visualize metadata, focusing on text notes and academic paper citations.
The dataset used for this analysis is available from the Pulotu website. The data table starts with variables describing the culture (name, notes, ISO / ABVD codes) and then a series pairs of columns with a similar prefix (vXX…) - one column containing the actual data and another column containing a citation of the source for the data. The table looks something like:
Culture | Culture Notes | isocode | ABVD Code | v1.Traditional Time Focus | v1.Source | v2.Number of islands inhabited by culture | v2.Source | … |
---|---|---|---|---|---|---|---|---|
Ajie | The indigenous people of … | aji | 1188 | 1825-1850 | Winslow (1991) pp 9 | 1 | Winslow (1991) pp 7 | … |
Ami | The Ami lived… | ami | 350 | 1875-1900 | Lebar (1975) pp 117 | 1 | Lebar (1975) pp 116 | … |
A full data dictionary can be found here.
One of my personal goals in writing this post is to analyse the data using the principles of Tidy Data. I wrote my first piece of R-code a while ago (2005) and have been coding mostly in base R, SQL for data manipulations (via sqldf
) and the basic plot
function. Adjusting to the new coding paradigm (dplyr
semantics, magrittr
pipes, etc.) and the “graphic grammar” of ggplot2
has been quite a challenge but a real “eye opener” to new ways of thinking about data, analysis and code. For new-comers into this brave new world I highly recommend starting with this post by Karl Broman.
So let’s get started be loading all the packages we need and reading the data file:
# Data manipulation packages
require(dplyr)
require(readr)
require(tidyr)
require(reshape2)
require(tidytext)
# Graphics packages
require(ggplot2)
require(ggmap)
# Load & filter data, focusing only on the "pacific" cultures-------------------
read_tsv('c:/Users/yizhart/Downloads/Datasets/Pulotu_Database_4_7_2015.txt') %>%
filter(substring(Culture, 1, 8) != 'Malagasy') ->
d
Let’s start with a simple example: the variable marked as “v1.Traditional_Time_Focus” contains a range of years (1XXX - 1XXX) to which observations about a culture are applicable. Since the structure is uniform it’s easy to extract the beginning and the end of the period and visualize the data
# adding some columns
d %>%
mutate(
Start = parse_date(paste(substring(v1.Traditional_Time_Focus, first = 1, last = 4), '-01-01', sep = ''), format = '%Y-%m-%d'),
End = parse_date(paste(substring(v1.Traditional_Time_Focus, first = 6), '-01-01', sep = ''), format = '%Y-%m-%d'),
n = row_number()
) ->
d
# Graph1 : time period overlap --------------------------------------------
d %>%
ggplot() +
geom_segment(aes(x=Start, xend=End, y=Culture, yend=Culture, color = Culture), size=1) +
labs(x = 'Years', y = 'Culture') +
theme(legend.position = "none", axis.text.y = element_text(size=4))
Another simple visualization is to take the longitude/latitude included in the files and use the to place some data on a map. As a first step let’s try to find the “centre” of the map so we can make the appropriate request from the map service:
d %>% summarise(
lon_mean = mean(v6.Longitude),
lat_mean = mean(v5.Latitude),
lon_med = median(v6.Longitude),
lat_med = median(v5.Latitude)
)
## # A tibble: 1 x 4
## lon_mean lat_mean lon_med lat_med
## <dbl> <dbl> <dbl> <dbl>
## 1 99.88684 -3.914912 124.95 -7.25
I ended up choosing a coordinate manually (140, -10). The following code is not particularly “tidy” but works with the ggmap
package (Kahle and Wickham (2013)) to show the relative population of the culture on a Google map:
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=-10,140&zoom=3&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false