This week is going to mostly be about how you should strive to organize your data. Having a good idea of how the data should be organized before you start collecting will make you eventual data analysis easier.

Data Collection and Storage

General Principles of Data Collection

Over-collect

When collecting data in the first place, over-collect if at all possible. The world is a very complex place, so there is no way you could cram it all into a bottle, but give it your best shot! If during the course of your data analysis, you find that it would have been really useful to have data on, say, duration, as well as formant frequencies, it becomes very costly to recollect that data, especially if you haven’t laid the proper trail for yourself.

Preserve HiD Info

If, for instance, you’re collecting data on the effect of voicing on preceding vowel duration, preserve high dimensional data coding, like Lexical Item, or the transcription of the following segment. These high dimensional codings probably won’t be too useful for your immediate analysis, but they will allow you to procedurally exract additional features from them at a later time. By preserving your high dimensional information, you’re preserving the data’s usefulness for your own later reanalysis, as well as for future researchers.

Leave A Trail of Crumbs

Be sure to answer this question: How can I preserve a record of this observation in such a way that I can quickly return to it and gather more data on it if necessary? If you fail to successfully answer this question, then you’ll be lost in the woods if you ever want to restudy, and the only way home is to replicate the study from scratch.

Give Meaningful Names

Give meaningful names to both the names of predictor columns, as well as to labels of nominal observations. Keeping a readme describing the data is still a good idea, but at least now the data is approachable at first glance.

Distinguish between 0 and NA

Storing Data

When we store data, it should be:

  1. Raw Raw data is the most useful data. It’s impossible to move down to smaller granularity from a coarser, summarized granularity. Summary tables etc. are nice for publishing in a paper document, but raw data is what we need for asking novel research questions with old data. Also, it will make Tim Berners-Lee happy.

  2. Open formatted Do not use proprietary database software for long term storage of your data. I have enough heard stories about interesting data sets that are no longer accessible for research either because the software they are stored in is defunct, or current versions are not backwards compatible. At that point, your data is property of Microsoft, or whoever. Store your data as raw text, delimited in some way (I prefer tabs).

  3. Consistent I think this is most important when you may have data in many separate files. Each file and its headers should be consistently named and formatted. They should be consistently delimited and commented also. There is nothing worse than inconsistent headers and erratic comments, labels, headers or NA characters in a corpus.

  4. Documented Produce a readme describing the data, how it was collected and processed, and describe every variable and its possible values.

Structuring Data

A good paper to skim is Hadley Wickham’s on tidy data.

For any given data set there will be two kinds of variables.

ID Variables: These variables are identifiers or features of each unique observation. Essentially anything you are testing to see if it has an effect on outcomes will be an ID variable.

Measure Variable: These variables record your measurement of each unique observation.

What counts as an ID Variable or a Measure Variable will depend upon the study. For instance, in most studies gender of a person will usually be an ID Variable, and something about the subject’s response will be a Measure Variable. However, if you’re doing a study as to whether men or women are more likely to show up to your experiment, gender of subject would be a Measurement Variable.

You’re going to want to set up your data so that every row is an observation, consisting of a unique combination of ID variables, and than an additional column for each measure variable. For example, if you were wanting to study the mismatch between how much fruit people bought versus how much they ate, you’d set up a table like this:

Person Fruit Bought Ate
John Apple 5 1
John Orange 5 3
Mary Apple 3 2
Mary Orange 4 3

The ID Variables are Person and Fruit, and the measure variables are Bought and Ate. There are, of course, a lot of different ways you could report this data. For example, there is this way of summarizing things:

Apples Oranges
Person Bought Ate Bought Ate
John 5 1 5 3
Mary 3 2 4 3

If you’re still in a very spreadsheety way of thinking, you might be tempted to store one worksheet for every person’s data. Don’t do that.

Summarizing Data

table()

The very simplest function for summarising data in R is the table() function.

  library("devtools")
  install_github("jofrhwld/languageVariationAndChangeData")
  library("languageVariationAndChangeData")
  library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  library("knitr")
  table(buckeye$DepVar)
## 
##   del  glot palat   ret 
## 18272  1665    26  7711
  table(buckeye$DepVar, buckeye$Gram2)
##        
##          and justT mono nochange   nt past semiweak stemchange went
##   del   9442    14 4912       14 2824  702      123         37  204
##   glot    18    11  697        0  840    8       28          0   63
##   palat    0     0    0        0    0   26        0          0    0
##   ret    794    28 4237       10  498 1722      255         34  133

The dplyr package

The dplyr package is the latest and greatest thing to be added to CRAN. It’s set up to work really well with data bases, and operates by chaining together functions with a %>% operator.

The %>% (“pipe”)

%>%

We’ll pronounce %>% as “pipe”.

The way %>% works is it takes a data frame on the left side, and inserts it as the first argument to the function on its right side. Normally, you’d look at the first 6 rows of a data frame this way:

  head(ing)
##        Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 1     eating    Ing   careful  participle         vowel   m   6     Irish
## 2 processing    Ing   tangent      gerund             0   m   6     Irish
## 3 processing    Ing   tangent      gerund             0   m   6     Irish
## 4     saying     In   tangent progressive         vowel   m   6     Irish
## 5     living    Ing   tangent      gerund       palatal   m   6     Irish
## 6    sitting     In narrative progressive         vowel   m   6     Irish

With %>%, you’d do it like this:

  ing %>% head()
##        Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 1     eating    Ing   careful  participle         vowel   m   6     Irish
## 2 processing    Ing   tangent      gerund             0   m   6     Irish
## 3 processing    Ing   tangent      gerund             0   m   6     Irish
## 4     saying     In   tangent progressive         vowel   m   6     Irish
## 5     living    Ing   tangent      gerund       palatal   m   6     Irish
## 6    sitting     In narrative progressive         vowel   m   6     Irish

How useful is that really? Not very until you start chaining them together. If you wanted to get the number of rows in the data frame after you’ve applied head() to it, normally you’d write it out like this:

  nrow(head(ing))
## [1] 6

Nested functions are kind of tough to read. You need to read them from the inside out. With dplyr, you can chain each function you want to use with %>%.

  ing %>% head() %>% nrow()
## [1] 6

The way to read that is “Take the ing data frame, and pipe it into head(). Then take the output of head() and pipe it into nrow().”

Verbs

dplyr comes with a few “verbs” specially developed for chaining together.

verb description
filter() This works almost exactly like subset()
summarise() This takes a data frame, and outputs a new data frame based on the summary you asked for
mutate() This takes a data frame, and adds additional columns based on the formula you give it
select() This takes a data frame, and returns only the columns you ask for
arrange() Reorders the rows of the data frame

Let’s test this out first by seeing how many gerund tokens in ing were pronounced “-ing” and “-in”

  ing %>%
    filter(DepVar == "Ing", GramStatus == "gerund") %>%
    nrow()
## [1] 85
  ing %>%
    filter(DepVar == "In", GramStatus == "gerund") %>%
    nrow()
## [1] 28

The summarise() function is also pretty useful. Let’s use it on the joe_vowels data frame, which is a data set of my own vowel measurements.

  head(joe_vowels)
##   year plt_vclass    word   dur F1_20 F1_35 F1_80  F2_20  F2_35  F2_80
## 1 2006        Tuw      DO 0.099 470.9 451.1 459.3 1388.2 1288.9 1307.2
## 2 2006        Tuw      DO 0.150 455.0 496.2 497.7 1583.0 1334.9 1129.6
## 3 2006        ahr     ARE 0.180 546.5 615.2 557.0  956.8  975.1 1250.2
## 4 2006        *hr WORKING 0.060 538.8 538.0 473.5 1380.4 1380.1 1463.3
## 5 2006          i WORKING 0.050 534.0 534.0 533.8 1967.9 1967.9 1785.6
## 6 2006        owr      OR 0.170 440.8 417.4 499.5  754.5  712.4 1056.4
  joe_vowels%>%
    summarise(mean_dur = mean(dur),
              mean_F1_35 = mean(F1_35),
              mean_F2_35 = mean(F2_35))
##     mean_dur mean_F1_35 mean_F2_35
## 1 0.09844073   529.1255   1447.538

It took in the whole joe_vowels data frame, and returned a data frame with a column for each formula we defined.


Split-Apply-Combine

When doing data analysis, you’re going to find yourself doing these following steps a lot:

  1. Splitting the data up into subsets.
  2. Applying some kind of function to those subsets.
  3. Combining the results back together
  cheese <- data_frame(cheese = rep(c("blue","cheddar","brie"), times =  c(3,3,2)),
                       turned_right = c(0.7, 0.69, 0.8, 0.9, 0.85, 0.6, 0.65, 0.7))
cheese turned_right
blue 0.70
blue 0.69
blue 0.80
cheddar 0.90
cheddar 0.85
cheddar 0.60
brie 0.65
brie 0.70

One thing we might want to know is what the average turning-right proportion is for each cheese type.

Split the data up

First, split the data up into subsets based on the “cheese” column:

cheese turned_right
blue 0.70
blue 0.69
blue 0.80
cheese turned_right
brie 0.65
brie 0.70
cheese turned_right
cheddar 0.90
cheddar 0.85
cheddar 0.60


Apply some function to the data

In each subset, calculate the average turning rate.

cheese mean_right_turn
blue 0.73
cheese mean_right_turn
brie 0.675
cheese mean_right_turn
cheddar 0.7833333


Combine the result

Combine these results into a new table.

cheese mean_right_turn
blue 0.7300000
brie 0.6750000
cheddar 0.7833333

Split-Apply-Combine in dplyr

The dplyr verbs were constructed exactly for this purpose.

  cheese %>% 
    group_by(cheese) %>%
    summarise(mean_right_turn = mean(turned_right))
## # A tibble: 3 x 2
##    cheese mean_right_turn
##     <chr>           <dbl>
## 1    blue       0.7300000
## 2    brie       0.6750000
## 3 cheddar       0.7833333
  1. Use group_by() to split the data frame up.
  2. Use summarise() to apply a function to every group.
  3. They’re automatically combined back together.

Data Analysis Recipes

Calculating a proportion

Counting up

The (ING) data:

  head(ing)
##        Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 1     eating    Ing   careful  participle         vowel   m   6     Irish
## 2 processing    Ing   tangent      gerund             0   m   6     Irish
## 3 processing    Ing   tangent      gerund             0   m   6     Irish
## 4     saying     In   tangent progressive         vowel   m   6     Irish
## 5     living    Ing   tangent      gerund       palatal   m   6     Irish
## 6    sitting     In narrative progressive         vowel   m   6     Irish

Let’s calculate the observed proportion of the “-ing” variant over-all. First, we’ll use group_by() and summarise() together to make a table of counts:

  ing %>%
    group_by(DepVar) %>%
    summarise(count = n())
## # A tibble: 2 x 2
##   DepVar count
##   <fctr> <int>
## 1     In   577
## 2    Ing   562

n() is a special function that returns the number of rows in the current group. You could replace it by something like this:

  ing %>%
    group_by(DepVar)%>%
    summarise(count = length(DepVar),
              another_count = length(GramStatus),
              Count3 = n())
## # A tibble: 2 x 4
##   DepVar count another_count Count3
##   <fctr> <int>         <int>  <int>
## 1     In   577           577    577
## 2    Ing   562           562    562

Coming back to our original group_by() %>% summarise() approach, we can create a new column with mutate(), plus a few other columns just to show how mutate() works.

  ing %>%
    group_by(DepVar) %>%
    summarise(count = n())%>%
    mutate(prop = count/sum(count),
           a_percentage = prop * 100,
           all = sum(count),
           the_most = max(count),
           the_least = min(count),
           lower_case = tolower(DepVar),
           upper_case = toupper(DepVar))
## # A tibble: 2 x 9
##   DepVar count      prop a_percentage   all the_most the_least lower_case
##   <fctr> <int>     <dbl>        <dbl> <int>    <int>     <int>      <chr>
## 1     In   577 0.5065847     50.65847  1139      577       562         in
## 2    Ing   562 0.4934153     49.34153  1139      577       562        ing
## # ... with 1 more variables: upper_case <chr>

If we just wanted to know what the percent of “-ing” was, using this method, we need to filter just to see the “-ing” variant.

  ing %>%
    group_by(DepVar) %>%
    summarise(count = n())%>%
    mutate(prop = count/sum(count)) %>%
    filter(DepVar == "Ing")
## # A tibble: 1 x 3
##   DepVar count      prop
##   <fctr> <int>     <dbl>
## 1    Ing   562 0.4934153

Averaging 0 and 1

Let’s flip an unfair coin 100 times

  flips <- sample(c(0,1), 100, replace = T, prob = c(0.3, 0.7))

We could calculate the proportion heads this way:

  table(flips)
## flips
##  0  1 
## 33 67
  flips_tab <- table(flips)
  flips_tab/sum(flips_tab)
## flips
##    0    1 
## 0.33 0.67
  flip_prop <- flips_tab/sum(flips_tab)
  flip_prop["1"]
##    1 
## 0.67

This is basically the approach we took above, but if we just took the average of flips, it would be equivalent.

  mean(flips)
## [1] 0.67

But the ing dataframe doens’t have any column coded 0 and 1.

  head(ing)
##        Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 1     eating    Ing   careful  participle         vowel   m   6     Irish
## 2 processing    Ing   tangent      gerund             0   m   6     Irish
## 3 processing    Ing   tangent      gerund             0   m   6     Irish
## 4     saying     In   tangent progressive         vowel   m   6     Irish
## 5     living    Ing   tangent      gerund       palatal   m   6     Irish
## 6    sitting     In narrative progressive         vowel   m   6     Irish

For this example, let’s just focus on the first 10 tokens.

  ing_snip <- ing$DepVar[1:10]
  ing_snip
##  [1] Ing Ing Ing In  Ing In  In  Ing Ing Ing
## Levels: In Ing

By utilizing an R trick, we can convert this into a vector of 0 and 1. First, we need to decide what we want to call 1 and what we want to call 0. Let’s go with this coding

variant code
Ing 1
In 0

First, we’ll create a vector of T and F values.

  ing_snip == "Ing"
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

Logical values like this can be coerced into being 0 and 1 values:

variant code
T 1
F 0
  (ing_snip == "Ing") * 1
##  [1] 1 1 1 0 1 0 0 1 1 1

Comparing methods:

  ing_snip_tab <- table(ing_snip)
  (ing_snip_tab/sum(ing_snip_tab))["Ing"]
## Ing 
## 0.7
  mean((ing_snip=="Ing")*1)
## [1] 0.7

We can scale this up, easily, in dplyr, by creating a new column of 0 and 1 with mutate()

  ing %>%
    mutate(is_ing = (DepVar == "Ing")*1)%>%
    head()
##        Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 1     eating    Ing   careful  participle         vowel   m   6     Irish
## 2 processing    Ing   tangent      gerund             0   m   6     Irish
## 3 processing    Ing   tangent      gerund             0   m   6     Irish
## 4     saying     In   tangent progressive         vowel   m   6     Irish
## 5     living    Ing   tangent      gerund       palatal   m   6     Irish
## 6    sitting     In narrative progressive         vowel   m   6     Irish
##   is_ing
## 1      1
## 2      1
## 3      1
## 4      0
## 5      1
## 6      0

Now, instead of doing group_by(DepVar), we can just take the average of is_ing

  ing %>%
    mutate(is_ing = (DepVar == "Ing")*1)%>%
    summarise(ing_prop = mean(is_ing))
##    ing_prop
## 1 0.4934153

More interesting group_by()

  ing %>%
    mutate(is_ing = (DepVar == "Ing")*1)%>%
    group_by(GramStatus) %>%
    summarise(total = n(),
              ing_prop = mean(is_ing))%>%
    arrange(ing_prop)
## # A tibble: 7 x 3
##    GramStatus total  ing_prop
##        <fctr> <int>     <dbl>
## 1      during     9 0.2222222
## 2 progressive   464 0.3750000
## 3  participle   309 0.4012945
## 4       thing   110 0.5090909
## 5      gerund   113 0.7522124
## 6   adjective    68 0.8970588
## 7        noun    66 0.9090909

Sequential group_by()

  ing %>%
    group_by(GramStatus, Token)%>%
    summarise(count = n()) %>%
    summarise(most_freq = max(count), 
              least_freq = min(count),
              total = sum(count))%>%
    mutate(prop_of_total = most_freq/total)%>%
    arrange(prop_of_total)
## # A tibble: 7 x 5
##    GramStatus most_freq least_freq total prop_of_total
##        <fctr>     <int>      <int> <int>         <dbl>
## 1      gerund         7          1   113    0.06194690
## 2  participle        21          1   309    0.06796117
## 3 progressive        46          1   464    0.09913793
## 4        noun        11          1    66    0.16666667
## 5   adjective        22          1    68    0.32352941
## 6       thing        87          1   110    0.79090909
## 7      during         9          9     9    1.00000000

For some of these grammatical classes, just one particular word accounts for most of the data for the whole class. There might be something weird about these super-frequent words, and they might skew the over-all calculation for the grammatical class. It’s a good idea to “flatten” out the effects of these words by calculating their average first, then calculatin the average “ing” from those averages.

  ing %>%
    mutate(is_ing = DepVar == "Ing")%>%
    group_by(GramStatus, Token)%>%
    summarise(prop_ing = mean(is_ing))
## Source: local data frame [373 x 3]
## Groups: GramStatus [?]
## 
##    GramStatus        Token prop_ing
##        <fctr>       <fctr>    <dbl>
## 1   adjective  aggravating      1.0
## 2   adjective      amazing      0.8
## 3   adjective       boring      1.0
## 4   adjective       caring      1.0
## 5   adjective     charming      1.0
## 6   adjective      closing      1.0
## 7   adjective compromising      0.0
## 8   adjective        dying      1.0
## 9   adjective    easygoing      1.0
## 10  adjective  embarassing      1.0
## # ... with 363 more rows

Notice this:

  1. there is one row for each word in each grammatical class
  2. After running summarise(), the outermost grouping variable has been dropped as a grouping variable.
  ing %>%
    mutate(is_ing = DepVar == "Ing")%>%
    group_by(GramStatus, Token)%>%
    summarise(prop_ing = mean(is_ing))%>%
    summarise(prop_ing = mean(prop_ing))%>%
    arrange(prop_ing)
## # A tibble: 7 x 2
##    GramStatus  prop_ing
##        <fctr>     <dbl>
## 1      during 0.2222222
## 2 progressive 0.4607932
## 3  participle 0.4631897
## 4       thing 0.6685824
## 5      gerund 0.7523810
## 6   adjective 0.8531250
## 7        noun 0.9188492

There’s probably something weird about “going”.

  ing %>%
    filter(Token != "going")%>%
    mutate(is_ing = DepVar == "Ing")%>%
    group_by(GramStatus, Token)%>%
    summarise(prop_ing = mean(is_ing))%>%
    summarise(prop_ing = mean(prop_ing))%>%
    arrange(prop_ing)
## # A tibble: 7 x 2
##    GramStatus  prop_ing
##        <fctr>     <dbl>
## 1      during 0.2222222
## 2 progressive 0.4628213
## 3  participle 0.4652176
## 4       thing 0.6685824
## 5      gerund 0.7632850
## 6   adjective 0.8483871
## 7        noun 0.9188492

~