The Puloto database (Watts et al. (2015)) contains many variables on religion, history, society, and the natural environment of 116 Austronesian cultures. As stated on the website, the database was specifically designed to analyse religious beliefs and practices and is therefore a wonderful candidate for some basic analysis and visualizations, which I’m leaving to a later date (maybe as a shiny exercise). As the title of this post suggests, my objective is to explore the ability of R to analyse and visualize metadata, focusing on text notes and academic paper citations.

The database

The dataset used for this analysis is available from the Pulotu website. The data table starts with variables describing the culture (name, notes, ISO / ABVD codes) and then a series pairs of columns with a similar prefix (vXX…) - one column containing the actual data and another column containing a citation of the source for the data. The table looks something like:

Culture Culture Notes isocode ABVD Code v1.Traditional Time Focus v1.Source v2.Number of islands inhabited by culture v2.Source
Ajie The indigenous people of … aji 1188 1825-1850 Winslow (1991) pp 9 1 Winslow (1991) pp 7
Ami The Ami lived… ami 350 1875-1900 Lebar (1975) pp 117 1 Lebar (1975) pp 116

A full data dictionary can be found here.


Working in the tidy-verse

One of my personal goals in writing this post is to analyse the data using the principles of Tidy Data. I wrote my first piece of R-code a while ago (2005) and have been coding mostly in base R, SQL for data manipulations (via sqldf) and the basic plot function. Adjusting to the new coding paradigm (dplyr semantics, magrittr pipes, etc.) and the “graphic grammar” of ggplot2 has been quite a challenge but a real “eye opener” to new ways of thinking about data, analysis and code. For new-comers into this brave new world I highly recommend starting with this post by Karl Broman.

So let’s get started be loading all the packages we need and reading the data file:

# Data manipulation packages

# Graphics packages

# Load & filter data, focusing only on the "pacific" cultures-------------------
read_tsv('c:/Users/yizhart/Downloads/Datasets/Pulotu_Database_4_7_2015.txt') %>%
    filter(substring(Culture, 1, 8) != 'Malagasy') ->

Data visualizations

Let’s start with a simple example: the variable marked as “v1.Traditional_Time_Focus” contains a range of years (1XXX - 1XXX) to which observations about a culture are applicable. Since the structure is uniform it’s easy to extract the beginning and the end of the period and visualize the data

# adding some columns
d %>% 
    Start = parse_date(paste(substring(v1.Traditional_Time_Focus, first = 1, last = 4), '-01-01', sep = ''), format = '%Y-%m-%d'),
    End   = parse_date(paste(substring(v1.Traditional_Time_Focus, first = 6), '-01-01', sep = ''), format = '%Y-%m-%d'),
    n = row_number()
  ) ->

# Graph1 : time period overlap --------------------------------------------
d %>% 
  ggplot() + 
    geom_segment(aes(x=Start, xend=End, y=Culture, yend=Culture, color = Culture), size=1) + 
    labs(x = 'Years', y = 'Culture') +
    theme(legend.position = "none", axis.text.y = element_text(size=4))

Another simple visualization is to take the longitude/latitude included in the files and use the to place some data on a map. As a first step let’s try to find the “centre” of the map so we can make the appropriate request from the map service:

d %>% summarise(
  lon_mean = mean(v6.Longitude), 
  lat_mean = mean(v5.Latitude),
  lon_med = median(v6.Longitude),
  lat_med = median(v5.Latitude)
## # A tibble: 1 x 4
##   lon_mean  lat_mean lon_med lat_med
##      <dbl>     <dbl>   <dbl>   <dbl>
## 1 99.88684 -3.914912  124.95   -7.25

I ended up choosing a coordinate manually (140, -10). The following code is not particularly “tidy” but works with the ggmap package (Kahle and Wickham (2013)) to show the relative population of the culture on a Google map:

## Map from URL :,140&zoom=3&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false