This workshop will cover the following facets of working with quantitative data:
Hopefully this workshop will be able to act as a starting point for some. There is no 45 minute workshop, or semester long course for that matter, that will be able to comprehensively teach you all you need to know to be quantitative researcher. That requires some self direction and an entrepreneurial spirit.
First, it is too easy to make mistakes and not realize it. To our Excel devotees out there, how many of you have spreadsheets and data that look like this?
The results displayed in the spreadsheet above led the authors to conclude that when countries’ debt to GDP ratio approached 90%, their economies would shrink at a rate of -0.1%.1 This result was widely cited in the fallout of the 2008 financial crisis, especially by politicians supporting austerity measures. Without taking a stance here, it should suffice to say that the policy decisions connected to this spreadsheet are not uncontroversial.
One problem, though, is that there are coding and formula errors in that spreadsheet! When you fix the coding errors in the spreadsheet, it turns out that countries with a debt to GDP ratio of 90% actually grow at a rate of about 2.2%.2 There is, in fact, a European Spreadsheet Risks Interest Group that meets annually to discuss the risks about spreadsheet errors, and share horror stories.
Second, spreadsheets tend to encourage data formatting that is pleasing to the eye, which is rarely formatting that is useful. For example, here is a screenshot of a spreadsheet containing vowel fromant data from 4 speakers. Each speaker has their own set of columns, with demographic information in one, merged cell at the top.
This data formatting is almost worse than useless when it comes to doing your statistical analyses. You might spend more time reformatting the data into a usable format than you will on an analysis.
We’ll touch on tidy data further down, but the way this data ought to be formatted is with all speakers data concatenated together, length wise, with additional columns for the demographic data.
## Source: local data frame [2,000 x 7] ## Groups: file ## ## file age sex plt_vclass word F1 F2 ## 1 PH06-2-1-AB-Jean_meas.txt 61 f e WELL 611.8 1213.4 ## 2 PH06-2-1-AB-Jean_meas.txt 61 f eyF THEY 546.4 2013.7 ## 3 PH06-2-1-AB-Jean_meas.txt 61 f iy TEACH 430.7 2549.6 ## 4 PH06-2-1-AB-Jean_meas.txt 61 f iyF ME 448.2 2006.0 ## 5 PH06-2-1-AB-Jean_meas.txt 61 f ae AT 603.9 1546.7 ## 6 PH06-2-1-AB-Jean_meas.txt 61 f ay TIME 768.8 1374.2 ## 7 PH06-2-1-AB-Jean_meas.txt 61 f ay MY 728.7 1344.1 ## 8 PH06-2-1-AB-Jean_meas.txt 61 f oy BOYS 513.1 1158.7 ## 9 PH06-2-1-AB-Jean_meas.txt 61 f iy TEACH 383.9 2323.7 ## 10 PH06-2-1-AB-Jean_meas.txt 61 f iyF ME 461.2 1970.9 ## .. ... ... ... ... ... ... ...
Third, and maybe most importantly, Excel has had long standing errors in its statistical procedures.3. As Mélard (2014) said about Excel 2010:
Microsoft has fixed the errors in the statistical procedures of Excel neither quickly nor correctly. The recent improvements reported in this paper should not hide the fact that Microsoft is still marketing a product that contains known errors. We didn’t analyze Excel in Office 2013 but, according to Microsoft (2013), where the changes with respect to Office 2010 are collected, there are few changes to Excel and nothing about the statistical aspects is mentioned.
In equal parts, R and Python are becoming the lingua francas of research and analysis in the social sciences and beyond. That means we all wind up benefiting from the collective wisdom of other researchers who are also using these tools. There are large communities of support surrounding them, and community investment in improving them and expanding them.
R and Python are both free and open source. This means there is no need for student or institutional licenses to use them. After you leave Edinburgh, you’ll be able to re-run all of your analyses without worrying about your licence expiring.
R and Python are both programming languages, meaning you’ll need to write code to do your analyses. While this may seem intimidating for those of you who don’t feel computationally inclined, writing, running, and retaining the code is crucial for doing reproducible research. You’ll always be able to precisely reproduce your earlier results providing you save your scripts. The same can’t be said for using spreadsheet programs like Excel.
If you’re doing quantitative research, and you want to pursue an academic career, employers are going to want you to be able to teach their students how to do quantitative research. That means they’ll want you to teach R and/or Python.
Outside of academia, knowing R and/or Python is a salable skill. “Data science” is still a growing sector of employment, and if you take the opportunity to learn these data skills to do your postgraduate research, you may be able to successfully leverage them into a career outside of academia.
Organizing your data so that it is “tidy” is crucial to efficiently carrying out your analysis. I’ll be adopting the definition of “tidy data” from Wickham (2014).4 But first, let’s talk a little bit about data collection.
When collecting data in the first place, over-collect if at all possible. The world is a very complex place, so there is no way you could cram it all into a bottle, but give it your best shot! If during the course of your data analysis, you find that it would have been really useful to have data on, say, duration, as well as formant frequencies, it becomes very costly to recollect that data, especially if you haven’t laid the proper trail for yourself.
If your data collection involves you typing individual observations into a spreadsheet, this recommendation may seem especially onerous. That is why you should try to learn as many computation tricks and time saving techniques as possible. If you’re working with speech data, this means learning some Praat and Python scripting. If you’re working with textual data, this means learning some Python scripting.
Let’s say you’re broadly interested in the effect following consonants have on the preceding vowels. The following consonants have some of the following properties:
All of these properties are usually conveniently encodable in a single character.
We’re calling the coding k “High Dimensional” because if you know the following consonant was a /k/, you automatically know a lot of other things about the following context. My recommendation here is two fold. First, in a context like this, you shouldn’t just record that the following segment was “dorsal”, and not keep a record that it was specifically /k/. Preserve the high dimensional coding.
Second, take advantage of the high dimensionality of some encodings when you’re doing your data collection. For example, in the case of seeing what effect following segments have on vowels, in your initial data collection, you could just code for the identity of the following consonant:
Then, code all of the other information you need via a lookup table.
features <- data.frame(fol_seg = c("P","T", "B","D", "F","S"), voicing = c("voiceless", "voiceless", "voiced", "voiced", "voiceless","voiceless"), place = c("labial", "apical", "labial", "apical", "labial", "apical")) features
## fol_seg voicing place ## 1 P voiceless labial ## 2 T voiceless apical ## 3 B voiced labial ## 4 D voiced apical ## 5 F voiceless labial ## 6 S voiceless apical
## fol_seg dur voicing place ## 1 B 0.05 voiced labial ## 2 B 0.05 voiced labial ## 3 D 0.06 voiced apical ## 4 D 0.20 voiced apical ## 5 F 0.05 voiceless labial ## 6 F 0.08 voiceless labial ## 7 P 0.14 voiceless labial ## 8 P 0.13 voiceless labial ## 9 S 0.05 voiceless apical ## 10 S 0.09 voiceless apical ## 11 T 0.18 voiceless apical ## 12 T 0.12 voiceless apical
Be sure to answer this question: How can I preserve a record of this observation in such a way that I can quickly return to it and gather more data on it if necessary? If you fail to successfully answer this question, then you’ll be lost in the woods if you ever want to restudy, and the only way home is to replicate the study from scratch. For research involving speech data, keep a record of the coding you’re doing in a Praat TextGrid.
Give meaningful names to both the names of predictor columns, as well as to labels of nominal observations. Keeping a readme describing the data is still a good idea, but at least now the data is approachable at first glance.
When we store data, it should be:
Raw Raw data is the most useful data. It’s impossible to move down to smaller granularity from a coarser, summarized granularity. Summary tables etc. are nice for publishing in a paper document, but raw data is what we need for asking novel research questions with old data.
Open formatted Do not use proprietary database software for long term storage of your data. I have enough heard stories about interesting data sets that are no longer accessible for research either because the software they are stored in is defunct, or current versions are not backwards compatible. At that point, your data is property of Microsoft, or whoever. Store your data as raw text, delimited in some way (I prefer tabs).
Consistent I think this is most important when you may have data in many separate files. Each file and its headers should be consistently named and formatted. They should be consistently delimited and commented also. There is nothing worse than inconsistent headers and erratic comments, labels, headers or NA characters in a corpus.
Documented Produce a readme describing the data, how it was collected and processed, and describe every variable and its possible values.
Wickham (2014) identifies the following properties of tidy data.
I’m going to focus on 1 and 2.
Let’s return to our example where we’re trying to explore the relationship between vowel duration and following segments. Here’s a table, like you might see published in a paper, that contains the mean duration of four vowels in six different segmental contexts.
The variables in this table are
Each observation is
Table 1 violates the principles of tidy data in the following ways:
In order to conform to the tidy data format, we need:
That’s going to look like Table 2.
With the data in this format, it’s possible to begin doing visualization & analysis.
ggplot(melt_durs, aes(plt_vclass, mean_dur, fill = consonant))+ geom_bar(position = "dodge", color = "black", stat = "identity")+ scale_fill_hue(limits = c("B","P","F", "D","T","S"))
In R, there are two key packages for tidying data:
In Python, similar functionality can be found in the
When doing data analysis, you’re going to find yourself doing these following steps a lot:
Let’s take the tidy data from before:
dur_data <- data.frame(plt_vclass = rep(c("ey", "iy", "ow"), 6), consonant = rep(c("B","D","F", "P","S","T"), each = 3), mean_dur = c(128, 94, 79, 133,116, 132, 170, 110, 149, 164, 107, 98, 133, 112, 95, 136, 135, 113))
One thing we might want to calculate is the average duration of each vowel. To do that we’ll
mean()function to the
First, split the data up into subsets based on the
In each subset, calculate the average duration.
Combine these results into a new table.
The relatively new
dplyr package in R is designed to implement this Split-Apply-Combine workflow in an easy to read fashion. It’s key functionality derives from
%>% as “pipe”.
%>% works is it takes a data frame on the left side, and inserts it as the first argument to the function on its right side. For example the
head() function prints the first 6 rows of a data frame.
## plt_vclass consonant mean_dur ## 1 ey B 128 ## 2 iy B 94 ## 3 ow B 79 ## 4 ey D 133 ## 5 iy D 116 ## 6 ow D 132
%>%, you’d do it like this:
dur_data %>% head()
## plt_vclass consonant mean_dur ## 1 ey B 128 ## 2 iy B 94 ## 3 ow B 79 ## 4 ey D 133 ## 5 iy D 116 ## 6 ow D 132
How useful is that really? Not very until you start chaining them together. If you wanted to get the number of rows in the data frame after you’ve applied
head() to it, normally you’d write it out like this:
##  6
Nested functions are kind of tough to read. You need to read them from the inside out. With
dplyr, you can chain each function you want to use with
dur_data %>% head() %>% nrow()
##  6
The way to read that is “Take the
ing data frame, and pipe it into
head(). Then take the output of
head() and pipe it into
dplyr comes with a few “verbs” specially developed for chaining together.
||This works almost exactly like
||This takes a data frame, and outputs a new data frame based on the summary you asked for|
||This takes a data frame, and adds additional columns based on the formula you give it formula|
||This takes a data frame, and returns only the columns you ask for|
||Reorders the rows of the data frame|
||Defines sub-groupings in a data frame|
group_by() function is the crucial one for doing Split-Apply-Combine in
dplyr. First, let’s look at how we’ll use the
dur_data %>% summarise(dur = mean(mean_dur))
## dur ## 1 122.4444
By just passing
dur_data to summarise, it creates a new data frame with one column,
dur. The value of
dur is calculated by applying
mean_dur. It’s possible to create as many columns as you want like this:
dur_data%>% summarise(dur = mean(mean_dur), dur_sd = sd(mean_dur), n = length(mean_dur), arbitrary = "foo")
## dur dur_sd n arbitrary ## 1 122.4444 24.32675 18 foo
summarise() verb gets more powerful in combination with
dur_data%>% group_by(plt_vclass)%>% ## Grouping the data by vowel summarise(dur = mean(mean_dur), dur_sd = sd(mean_dur), n = length(mean_dur), arbitrary = "foo")
## Source: local data frame [3 x 5] ## ## plt_vclass dur dur_sd n arbitrary ## 1 ey 144.0000 18.09972 6 foo ## 2 iy 112.3333 13.39652 6 foo ## 3 ow 111.0000 25.83796 6 foo
A general strategy for cracking difficult Split-Apply-Combine nuts would be to first figure out how to solve the problem for a subset of the data, then try to figure out how to generalize it.
Let’s ask the following question: How much do speakers vary with respect to vowel centralization. Shorter vowels tend to be more centralized, as do more frequent words. We’ll investigate this question using data from the
phoneticChange package, which can be installed like so:
library(devtools) install_github("jofrhwld/phoneticChange") library(phoneticChange) library(magrittr)
First, let’s trim down the data a little bit, just to look at the data we’re interested in.
ay <- ays %>% filter(plt_vclass == "ay", !word %in% c("i","my"))%>% ## I and MY are super frequent pronouns select(idstring, sex, age, year, F1_n, F2_n, dur, SUBTLWF)%>% mutate(dur_ms = (dur * 1000), logdur = log10(dur_ms), center_dur = logdur - median(logdur), zipf = log10(SUBTLWF) + 3)%>% ## The "Zipf scale", after http://crr.ugent.be/archives/1352 select(idstring, sex, age, year, F1_n, F2_n, center_dur, zipf) head(ay)
## idstring sex age year F1_n F2_n center_dur zipf ## 1 PH00-1-1- m 21 2000 1.75932668 -0.50321928 -0.19188553 5.252853 ## 2 PH00-1-1- m 21 2000 0.06850389 0.04812934 -0.44715803 5.290035 ## 3 PH00-1-1- m 21 2000 -0.08123688 -0.85704126 -0.36797679 NA ## 4 PH00-1-1- m 21 2000 1.46109299 -0.51536493 0.17609126 6.291952 ## 5 PH00-1-1- m 21 2000 1.56341584 0.01488861 0.10914447 5.882302 ## 6 PH00-1-1- m 21 2000 1.81423162 -0.40956992 0.08432089 5.290035
##  22672
First, we’ll take out the data from one speaker:
one_speaker <- ay %>% filter(idstring == "PH00-1-1-")
We can estimate the effect of duration on F1 and F2 of /ay/.
f1_model = lm(F1_n ~ center_dur, data = one_speaker) f1_model
## ## Call: ## lm(formula = F1_n ~ center_dur, data = one_speaker) ## ## Coefficients: ## (Intercept) center_dur ## 1.44 2.03
f2_model = lm(F2_n ~ center_dur, data = one_speaker) f2_model
## ## Call: ## lm(formula = F2_n ~ center_dur, data = one_speaker) ## ## Coefficients: ## (Intercept) center_dur ## -0.45371 -0.08705
speaker_models <- ay %>% group_by(idstring)%>% filter(n() > 40)%>% do(f1_model = lm(F1_n ~ center_dur, data = .), f2_model = lm(F2_n ~ center_dur, data = .)) speaker_parameters <- speaker_models %>% rowwise()%>% do(data.frame(idstring = .$idstring, f1_intercept = coef(.$f1_model), f1_slope = coef(.$f1_model), f2_intercept = coef(.$f2_model), f2_slope = coef(.$f2_model)))
## Warning in rbind_all(out[]): Unequal factor levels: coercing to ## character
ggplot(speaker_parameters, aes(f2_slope, f1_slope))+ geom_vline(x = 0)+ geom_hline(y = 0)+ geom_point(color = 'red')+ scale_y_reverse()+ scale_x_reverse()+ coord_fixed()
library(tidyr) tidy_params <- speaker_parameters %>% gather(formant_param, estimate, f1_intercept:f2_slope)%>% separate(formant_param, c("formant","parameter"), sep = "_")%>% spread(parameter, value = estimate) tidy_params
## Source: local data frame [442 x 4] ## ## idstring formant intercept slope ## 1 PH00-1-1- f1 1.4398331 2.02999192 ## 2 PH00-1-1- f2 -0.4537147 -0.08704735 ## 3 PH00-1-2- f1 1.3733503 2.03325076 ## 4 PH00-1-2- f2 -0.5786406 -0.58856425 ## 5 PH00-1-3- f1 1.5592401 2.09601622 ## 6 PH00-1-3- f2 -0.3787459 0.18241027 ## 7 PH00-1-4- f1 1.5342921 1.37077420 ## 8 PH00-1-4- f2 -0.5535961 -0.17450608 ## 9 PH00-1-5- f1 1.3693443 1.10219011 ## 10 PH00-1-5- f2 -0.3890248 0.14457321 ## .. ... ... ... ...
ggplot(tidy_params, aes(intercept, slope))+ geom_hline(y= 0)+ geom_point()+ facet_wrap(~formant, scales = "free")
Reinhart, Carmen M., and Kenneth S. Rogoff. “Growth in a Time of Debt (Digest Summary).” American Economic Review 100.2 (2010): 573-578.↩
Herndon, Thomas, Michael Ash, and Robert Pollin. “Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff.” Cambridge Journal of Economics 38.2 (2014): 257-279.↩
Mélard, Guy. “On the accuracy of statistical procedures in Microsoft Excel 2010.” Computational statistics 29.5 (2014): 1095-1128.↩