Introduction

This workshop will cover the following facets of working with quantitative data:

  1. What tools to (not) use.
  2. Principles of tidy data.
  3. The Split-Apply-Combine approach to data analysis.
  4. Practical R code littered throughout.

Hopefully this workshop will be able to act as a starting point for some. There is no 45 minute workshop, or semester long course for that matter, that will be able to comprehensively teach you all you need to know to be quantitative researcher. That requires some self direction and an entrepreneurial spirit.

What tools to (not) use.

Simply put:

Don’t use Excel.
Do use R and Python.

Why not use Excel?

First, it is too easy to make mistakes and not realize it. To our Excel devotees out there, how many of you have spreadsheets and data that look like this?

excel

The results displayed in the spreadsheet above led the authors to conclude that when countries’ debt to GDP ratio approached 90%, their economies would shrink at a rate of -0.1%.1 This result was widely cited in the fallout of the 2008 financial crisis, especially by politicians supporting austerity measures. Without taking a stance here, it should suffice to say that the policy decisions connected to this spreadsheet are not uncontroversial.

One problem, though, is that there are coding and formula errors in that spreadsheet! When you fix the coding errors in the spreadsheet, it turns out that countries with a debt to GDP ratio of 90% actually grow at a rate of about 2.2%.2 There is, in fact, a European Spreadsheet Risks Interest Group that meets annually to discuss the risks about spreadsheet errors, and share horror stories.


Second, spreadsheets tend to encourage data formatting that is pleasing to the eye, which is rarely formatting that is useful. For example, here is a screenshot of a spreadsheet containing vowel fromant data from 4 speakers. Each speaker has their own set of columns, with demographic information in one, merged cell at the top.

bad format

This data formatting is almost worse than useless when it comes to doing your statistical analyses. You might spend more time reformatting the data into a usable format than you will on an analysis.

We’ll touch on tidy data further down, but the way this data ought to be formatted is with all speakers data concatenated together, length wise, with additional columns for the demographic data.

## Source: local data frame [2,000 x 7]
## Groups: file
## 
##                         file age sex plt_vclass  word    F1     F2
## 1  PH06-2-1-AB-Jean_meas.txt  61   f          e  WELL 611.8 1213.4
## 2  PH06-2-1-AB-Jean_meas.txt  61   f        eyF  THEY 546.4 2013.7
## 3  PH06-2-1-AB-Jean_meas.txt  61   f         iy TEACH 430.7 2549.6
## 4  PH06-2-1-AB-Jean_meas.txt  61   f        iyF    ME 448.2 2006.0
## 5  PH06-2-1-AB-Jean_meas.txt  61   f         ae    AT 603.9 1546.7
## 6  PH06-2-1-AB-Jean_meas.txt  61   f         ay  TIME 768.8 1374.2
## 7  PH06-2-1-AB-Jean_meas.txt  61   f         ay    MY 728.7 1344.1
## 8  PH06-2-1-AB-Jean_meas.txt  61   f         oy  BOYS 513.1 1158.7
## 9  PH06-2-1-AB-Jean_meas.txt  61   f         iy TEACH 383.9 2323.7
## 10 PH06-2-1-AB-Jean_meas.txt  61   f        iyF    ME 461.2 1970.9
## ..                       ... ... ...        ...   ...   ...    ...

Third, and maybe most importantly, Excel has had long standing errors in its statistical procedures.3. As Mélard (2014) said about Excel 2010:

Microsoft has fixed the errors in the statistical procedures of Excel neither quickly nor correctly. The recent improvements reported in this paper should not hide the fact that Microsoft is still marketing a product that contains known errors. We didn’t analyze Excel in Office 2013 but, according to Microsoft (2013), where the changes with respect to Office 2010 are collected, there are few changes to Excel and nothing about the statistical aspects is mentioned.

Why use R and Python?

Common Understanding

In equal parts, R and Python are becoming the lingua francas of research and analysis in the social sciences and beyond. That means we all wind up benefiting from the collective wisdom of other researchers who are also using these tools. There are large communities of support surrounding them, and community investment in improving them and expanding them.

Free and Open Source

R and Python are both free and open source. This means there is no need for student or institutional licenses to use them. After you leave Edinburgh, you’ll be able to re-run all of your analyses without worrying about your licence expiring.

Reproducibility

R and Python are both programming languages, meaning you’ll need to write code to do your analyses. While this may seem intimidating for those of you who don’t feel computationally inclined, writing, running, and retaining the code is crucial for doing reproducible research. You’ll always be able to precisely reproduce your earlier results providing you save your scripts. The same can’t be said for using spreadsheet programs like Excel.

Career Utility

If you’re doing quantitative research, and you want to pursue an academic career, employers are going to want you to be able to teach their students how to do quantitative research. That means they’ll want you to teach R and/or Python.

Outside of academia, knowing R and/or Python is a salable skill. “Data science” is still a growing sector of employment, and if you take the opportunity to learn these data skills to do your postgraduate research, you may be able to successfully leverage them into a career outside of academia.

Tidy Data

Organizing your data so that it is “tidy” is crucial to efficiently carrying out your analysis. I’ll be adopting the definition of “tidy data” from Wickham (2014).4 But first, let’s talk a little bit about data collection.

General Principles of Data Collection

Over-collect

When collecting data in the first place, over-collect if at all possible. The world is a very complex place, so there is no way you could cram it all into a bottle, but give it your best shot! If during the course of your data analysis, you find that it would have been really useful to have data on, say, duration, as well as formant frequencies, it becomes very costly to recollect that data, especially if you haven’t laid the proper trail for yourself.

If your data collection involves you typing individual observations into a spreadsheet, this recommendation may seem especially onerous. That is why you should try to learn as many computation tricks and time saving techniques as possible. If you’re working with speech data, this means learning some Praat and Python scripting. If you’re working with textual data, this means learning some Python scripting.

Preserve High Dimensional Info

Let’s say you’re broadly interested in the effect following consonants have on the preceding vowels. The following consonants have some of the following properties:

  • place of articulation
  • degree of closure
  • voicing
  • nasality

All of these properties are usually conveniently encodable in a single character.

encoding place closure voicing nasality
k dorsal stop voiceless non
n apical stop voiced nasal

We’re calling the coding k “High Dimensional” because if you know the following consonant was a /k/, you automatically know a lot of other things about the following context. My recommendation here is two fold. First, in a context like this, you shouldn’t just record that the following segment was “dorsal”, and not keep a record that it was specifically /k/. Preserve the high dimensional coding.

Second, take advantage of the high dimensionality of some encodings when you’re doing your data collection. For example, in the case of seeing what effect following segments have on vowels, in your initial data collection, you could just code for the identity of the following consonant:

dur fol_seg
0.05 B
0.05 B
0.06 D
0.20 D
0.05 F
0.08 F
0.14 P
0.13 P
0.05 S
0.09 S
0.18 T
0.12 T

Then, code all of the other information you need via a lookup table.

  features <- data.frame(fol_seg = c("P","T", 
                                     "B","D", 
                                     "F","S"),
                         voicing = c("voiceless", "voiceless",
                                     "voiced", "voiced",
                                     "voiceless","voiceless"),
                         place = c("labial", "apical",
                                   "labial", "apical",
                                   "labial", "apical"))
  features
##   fol_seg   voicing  place
## 1       P voiceless labial
## 2       T voiceless apical
## 3       B    voiced labial
## 4       D    voiced apical
## 5       F voiceless labial
## 6       S voiceless apical
  merge(dur_data, features)
##    fol_seg  dur   voicing  place
## 1        B 0.05    voiced labial
## 2        B 0.05    voiced labial
## 3        D 0.06    voiced apical
## 4        D 0.20    voiced apical
## 5        F 0.05 voiceless labial
## 6        F 0.08 voiceless labial
## 7        P 0.14 voiceless labial
## 8        P 0.13 voiceless labial
## 9        S 0.05 voiceless apical
## 10       S 0.09 voiceless apical
## 11       T 0.18 voiceless apical
## 12       T 0.12 voiceless apical

Leave A Trail of Crumbs

Be sure to answer this question: How can I preserve a record of this observation in such a way that I can quickly return to it and gather more data on it if necessary? If you fail to successfully answer this question, then you’ll be lost in the woods if you ever want to restudy, and the only way home is to replicate the study from scratch. For research involving speech data, keep a record of the coding you’re doing in a Praat TextGrid.

Give Meaningful Names

Give meaningful names to both the names of predictor columns, as well as to labels of nominal observations. Keeping a readme describing the data is still a good idea, but at least now the data is approachable at first glance.

Storing Data

When we store data, it should be:

  1. Raw Raw data is the most useful data. It’s impossible to move down to smaller granularity from a coarser, summarized granularity. Summary tables etc. are nice for publishing in a paper document, but raw data is what we need for asking novel research questions with old data.

  2. Open formatted Do not use proprietary database software for long term storage of your data. I have enough heard stories about interesting data sets that are no longer accessible for research either because the software they are stored in is defunct, or current versions are not backwards compatible. At that point, your data is property of Microsoft, or whoever. Store your data as raw text, delimited in some way (I prefer tabs).

  3. Consistent I think this is most important when you may have data in many separate files. Each file and its headers should be consistently named and formatted. They should be consistently delimited and commented also. There is nothing worse than inconsistent headers and erratic comments, labels, headers or NA characters in a corpus.

  4. Documented Produce a readme describing the data, how it was collected and processed, and describe every variable and its possible values.

Tidy Data

Wickham (2014) identifies the following properties of tidy data.

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

I’m going to focus on 1 and 2.

Let’s return to our example where we’re trying to explore the relationship between vowel duration and following segments. Here’s a table, like you might see published in a paper, that contains the mean duration of four vowels in six different segmental contexts.

Table 1: Mean vowel duration in ms
plt_vclass B D F P S T
ey 128 133 170 164 133 136
iy 94 116 110 107 112 135
ow 79 132 149 98 95 113
uw NaN 178 80 128 70 110

The variables in this table are

  • The vowel
  • The following consonant
  • The mean duration.

Each observation is

  • The mean duration of each vowel preceding each consonant

Table 1 violates the principles of tidy data in the following ways:

  • The values of the consonant variable aren’t stored in any columns (they’re being used to defined columns).
  • The values of mean duration variable are spread out across multiple columns.
  • There are multiple observations per row.

In order to conform to the tidy data format, we need:

  • 1 column for each variable (vowel, consonant, mean_duration)
  • 1 row for each observation (the mean duration for each vowel before each consonant)

That’s going to look like Table 2.

Table 2: Mean vowel duration in ms
plt_vclass consonant mean_dur
ey B 128
iy B 94
ow B 79
uw B NaN
ey D 133
iy D 116
ow D 132
uw D 178

With the data in this format, it’s possible to begin doing visualization & analysis.

  ggplot(melt_durs, aes(plt_vclass, mean_dur, fill = consonant))+
    geom_bar(position = "dodge", color = "black", stat = "identity")+
    scale_fill_hue(limits = c("B","P","F",
                              "D","T","S"))

Tools for tidying data:

In R, there are two key packages for tidying data:

  • tidyr
  • reshape2

In Python, similar functionality can be found in the pandas library.

Split-Apply-Combine

When doing data analysis, you’re going to find yourself doing these following steps a lot:

  1. Splitting the data up into subsets.
  2. Applying some kind of function to those subsets.
  3. Combining the results back together

Let’s take the tidy data from before:

dur_data <- data.frame(plt_vclass = rep(c("ey", "iy", "ow"), 6), 
                       consonant = rep(c("B","D","F", "P","S","T"), each = 3),
                       mean_dur = c(128, 94, 79, 
                                    133,116, 132, 
                                    170, 110, 149, 
                                    164, 107, 98,  
                                    133, 112, 95,  
                                    136, 135, 113))
plt_vclass consonant mean_dur
ey B 128
iy B 94
ow B 79
ey D 133
iy D 116
ow D 132
ey F 170
iy F 110
ow F 149
ey P 164
iy P 107
ow P 98
ey S 133
iy S 112
ow S 95
ey T 136
iy T 135
ow T 113

One thing we might want to calculate is the average duration of each vowel. To do that we’ll

Split the data up

First, split the data up into subsets based on the plt_vclass column:

plt_vclass consonant mean_dur
ey B 128
ey D 133
ey F 170
ey P 164
ey S 133
ey T 136
plt_vclass consonant mean_dur
iy B 94
iy D 116
iy F 110
iy P 107
iy S 112
iy T 135
plt_vclass consonant mean_dur
ow B 79
ow D 132
ow F 149
ow P 98
ow S 95
ow T 113


Apply some function to the data

In each subset, calculate the average duration.

plt_vclass mean_dur
ey 144
plt_vclass mean_dur
iy 112.3333
plt_vclass mean_dur
ow 111


Combine the result

Combine these results into a new table.

plt_vclass mean_dur
ey 144.0000
iy 112.3333
ow 111.0000

Split-Apply-Combine in R

The relatively new dplyr package in R is designed to implement this Split-Apply-Combine workflow in an easy to read fashion. It’s key functionality derives from

  • The pipe operator: %>%
  • Its data operation “verbs”
  library(dplyr)

The %>% (“pipe”)

%>%

We’ll pronounce %>% as “pipe”.

The way %>% works is it takes a data frame on the left side, and inserts it as the first argument to the function on its right side. For example the head() function prints the first 6 rows of a data frame.

  head(dur_data)
##   plt_vclass consonant mean_dur
## 1         ey         B      128
## 2         iy         B       94
## 3         ow         B       79
## 4         ey         D      133
## 5         iy         D      116
## 6         ow         D      132

With %>%, you’d do it like this:

  dur_data %>% head()
##   plt_vclass consonant mean_dur
## 1         ey         B      128
## 2         iy         B       94
## 3         ow         B       79
## 4         ey         D      133
## 5         iy         D      116
## 6         ow         D      132

How useful is that really? Not very until you start chaining them together. If you wanted to get the number of rows in the data frame after you’ve applied head() to it, normally you’d write it out like this:

  nrow(head(dur_data))
## [1] 6

Nested functions are kind of tough to read. You need to read them from the inside out. With dplyr, you can chain each function you want to use with %>%.

  dur_data %>% head() %>% nrow()
## [1] 6

The way to read that is “Take the ing data frame, and pipe it into head(). Then take the output of head() and pipe it into nrow().”

Verbs

dplyr comes with a few “verbs” specially developed for chaining together.

verb description
filter() This works almost exactly like subset()
summarise() This takes a data frame, and outputs a new data frame based on the summary you asked for
mutate() This takes a data frame, and adds additional columns based on the formula you give it formula
select() This takes a data frame, and returns only the columns you ask for
arrange() Reorders the rows of the data frame
group_by() Defines sub-groupings in a data frame

The group_by() function is the crucial one for doing Split-Apply-Combine in dplyr. First, let’s look at how we’ll use the summarise() function.

  dur_data %>%
    summarise(dur = mean(mean_dur))
##        dur
## 1 122.4444

By just passing dur_data to summarise, it creates a new data frame with one column, dur. The value of dur is calculated by applying mean() to mean_dur. It’s possible to create as many columns as you want like this:

  dur_data%>%
    summarise(dur = mean(mean_dur),
              dur_sd = sd(mean_dur),
              n = length(mean_dur),
              arbitrary = "foo")
##        dur   dur_sd  n arbitrary
## 1 122.4444 24.32675 18       foo

The summarise() verb gets more powerful in combination with group_by().

  dur_data%>%
    group_by(plt_vclass)%>%           ## Grouping the data by vowel
    summarise(dur = mean(mean_dur),
              dur_sd = sd(mean_dur),
              n = length(mean_dur),
              arbitrary = "foo")
## Source: local data frame [3 x 5]
## 
##   plt_vclass      dur   dur_sd n arbitrary
## 1         ey 144.0000 18.09972 6       foo
## 2         iy 112.3333 13.39652 6       foo
## 3         ow 111.0000 25.83796 6       foo

Strategy for doing Split-Apply-Combine

A general strategy for cracking difficult Split-Apply-Combine nuts would be to first figure out how to solve the problem for a subset of the data, then try to figure out how to generalize it.

Let’s ask the following question: How much do speakers vary with respect to vowel centralization. Shorter vowels tend to be more centralized, as do more frequent words. We’ll investigate this question using data from the phoneticChange package, which can be installed like so:

  library(devtools)
  install_github("jofrhwld/phoneticChange")
  library(phoneticChange)
  library(magrittr)

First, let’s trim down the data a little bit, just to look at the data we’re interested in.

  ay <- ays %>%
    filter(plt_vclass == "ay",
           !word %in% c("i","my"))%>%    ## I and MY are super frequent pronouns
    select(idstring, sex, age, year, F1_n, F2_n, dur, SUBTLWF)%>%
    mutate(dur_ms = (dur * 1000),
           logdur = log10(dur_ms),
           center_dur = logdur - median(logdur),
           zipf = log10(SUBTLWF) + 3)%>% ## The "Zipf scale", after http://crr.ugent.be/archives/1352
  select(idstring, sex, age, year, F1_n, F2_n, center_dur, zipf)
  
  head(ay)
##    idstring sex age year        F1_n        F2_n  center_dur     zipf
## 1 PH00-1-1-   m  21 2000  1.75932668 -0.50321928 -0.19188553 5.252853
## 2 PH00-1-1-   m  21 2000  0.06850389  0.04812934 -0.44715803 5.290035
## 3 PH00-1-1-   m  21 2000 -0.08123688 -0.85704126 -0.36797679       NA
## 4 PH00-1-1-   m  21 2000  1.46109299 -0.51536493  0.17609126 6.291952
## 5 PH00-1-1-   m  21 2000  1.56341584  0.01488861  0.10914447 5.882302
## 6 PH00-1-1-   m  21 2000  1.81423162 -0.40956992  0.08432089 5.290035
  nrow(ay)
## [1] 22672

Solving the problem for one speaker

First, we’ll take out the data from one speaker:

  one_speaker <- ay %>% filter(idstring == "PH00-1-1-")

We can estimate the effect of duration on F1 and F2 of /ay/.

  f1_model = lm(F1_n ~ center_dur, data = one_speaker)
  f1_model
## 
## Call:
## lm(formula = F1_n ~ center_dur, data = one_speaker)
## 
## Coefficients:
## (Intercept)   center_dur  
##        1.44         2.03
  f2_model = lm(F2_n ~ center_dur, data = one_speaker)
  f2_model
## 
## Call:
## lm(formula = F2_n ~ center_dur, data = one_speaker)
## 
## Coefficients:
## (Intercept)   center_dur  
##    -0.45371     -0.08705

Expanding it for all speakers

  speaker_models <- ay %>%
    group_by(idstring)%>%
    filter(n() > 40)%>%
    do(f1_model = lm(F1_n ~ center_dur, data = .),
       f2_model = lm(F2_n ~ center_dur, data = .))
  
  speaker_parameters <- speaker_models %>%
                          rowwise()%>%
                          do(data.frame(idstring = .$idstring,
                                        f1_intercept = coef(.$f1_model)[1],
                                        f1_slope = coef(.$f1_model)[2],
                                        f2_intercept = coef(.$f2_model)[1],
                                        f2_slope = coef(.$f2_model)[2]))
## Warning in rbind_all(out[[1]]): Unequal factor levels: coercing to
## character
  ggplot(speaker_parameters, aes(f2_slope, f1_slope))+
      geom_vline(x = 0)+
      geom_hline(y = 0)+
      geom_point(color = 'red')+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()

  library(tidyr)
  
  tidy_params <- speaker_parameters %>%
    gather(formant_param, estimate, f1_intercept:f2_slope)%>%
    separate(formant_param, c("formant","parameter"), sep = "_")%>%
    spread(parameter, value = estimate)
  
  tidy_params  
## Source: local data frame [442 x 4]
## 
##     idstring formant  intercept       slope
## 1  PH00-1-1-      f1  1.4398331  2.02999192
## 2  PH00-1-1-      f2 -0.4537147 -0.08704735
## 3  PH00-1-2-      f1  1.3733503  2.03325076
## 4  PH00-1-2-      f2 -0.5786406 -0.58856425
## 5  PH00-1-3-      f1  1.5592401  2.09601622
## 6  PH00-1-3-      f2 -0.3787459  0.18241027
## 7  PH00-1-4-      f1  1.5342921  1.37077420
## 8  PH00-1-4-      f2 -0.5535961 -0.17450608
## 9  PH00-1-5-      f1  1.3693443  1.10219011
## 10 PH00-1-5-      f2 -0.3890248  0.14457321
## ..       ...     ...        ...         ...
  ggplot(tidy_params, aes(intercept, slope))+
      geom_hline(y= 0)+
      geom_point()+
      facet_wrap(~formant, scales = "free")


  1. Reinhart, Carmen M., and Kenneth S. Rogoff. “Growth in a Time of Debt (Digest Summary).” American Economic Review 100.2 (2010): 573-578.

  2. Herndon, Thomas, Michael Ash, and Robert Pollin. “Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff.” Cambridge Journal of Economics 38.2 (2014): 257-279.

  3. Mélard, Guy. “On the accuracy of statistical procedures in Microsoft Excel 2010.” Computational statistics 29.5 (2014): 1095-1128.

  4. Wickham Rstudio, H. (2014). Tidy Data. JSS Journal of Statistical Software, 59(10). Retrieved from http://www.jstatsoft.org/