This week is going to mostly be about how you should strive to organize your data. Having a good idea of how the data should be organized before you start collecting will make you eventual data analysis easier.

# Data Collection and Storage

## General Principles of Data Collection

### Over-collect

When collecting data in the first place, over-collect if at all possible. The world is a very complex place, so there is no way you could cram it all into a bottle, but give it your best shot! If during the course of your data analysis, you find that it would have been really useful to have data on, say, duration, as well as formant frequencies, it becomes very costly to recollect that data, especially if you haven’t laid the proper trail for yourself.

### Preserve HiD Info

If, for instance, you’re collecting data on the effect of voicing on preceding vowel duration, preserve high dimensional data coding, like Lexical Item, or the transcription of the following segment. These high dimensional codings probably won’t be too useful for your immediate analysis, but they will allow you to procedurally exract additional features from them at a later time. By preserving your high dimensional information, you’re preserving the data’s usefulness for your own later reanalysis, as well as for future researchers.

### Leave A Trail of Crumbs

Be sure to answer this question: How can I preserve a record of this observation in such a way that I can quickly return to it and gather more data on it if necessary? If you fail to successfully answer this question, then you’ll be lost in the woods if you ever want to restudy, and the only way home is to replicate the study from scratch.

### Give Meaningful Names

Give meaningful names to both the names of predictor columns, as well as to labels of nominal observations. Keeping a readme describing the data is still a good idea, but at least now the data is approachable at first glance.

## Storing Data

When we store data, it should be:

1. Raw Raw data is the most useful data. It’s impossible to move down to smaller granularity from a coarser, summarized granularity. Summary tables etc. are nice for publishing in a paper document, but raw data is what we need for asking novel research questions with old data. Also, it will make Tim Berners-Lee happy.

2. Open formatted Do not use proprietary database software for long term storage of your data. I have enough heard stories about interesting data sets that are no longer accessible for research either because the software they are stored in is defunct, or current versions are not backwards compatible. At that point, your data is property of Microsoft, or whoever. Store your data as raw text, delimited in some way (I prefer tabs).

3. Consistent I think this is most important when you may have data in many separate files. Each file and its headers should be consistently named and formatted. They should be consistently delimited and commented also. There is nothing worse than inconsistent headers and erratic comments, labels, headers or NA characters in a corpus.

4. Documented Produce a readme describing the data, how it was collected and processed, and describe every variable and its possible values.

# Structuring Data

A good paper to skim is Hadley Wickham’s on tidy data.

For any given data set there will be two kinds of variables.

ID Variables: These variables are identifiers or features of each unique observation. Essentially anything you are testing to see if it has an effect on outcomes will be an ID variable.

Measure Variable: These variables record your measurement of each unique observation.

What counts as an ID Variable or a Measure Variable will depend upon the study. For instance, in most studies gender of a person will usually be an ID Variable, and something about the subject’s response will be a Measure Variable. However, if you’re doing a study as to whether men or women are more likely to show up to your experiment, gender of subject would be a Measurement Variable.

You’re going to want to set up your data so that every row is an observation, consisting of a unique combination of ID variables, and than an additional column for each measure variable. For example, if you were wanting to study the mismatch between how much fruit people bought versus how much they ate, you’d set up a table like this:

Person Fruit Bought Ate
John Apple 5 1
John Orange 5 3
Mary Apple 3 2
Mary Orange 4 3

The ID Variables are Person and Fruit, and the measure variables are Bought and Ate. There are, of course, a lot of different ways you could report this data. For example, there is this way of summarizing things:

 Apples Oranges Person Bought Ate Bought Ate John 5 1 5 3 Mary 3 2 4 3

If you’re still in a very spreadsheety way of thinking, you might be tempted to store one worksheet for every person’s data. Don’t do that.

# Summarizing Data

## table()

The very simplest function for summarising data in R is the table() function.

  library("devtools")
install_github("jofrhwld/languageVariationAndChangeData")
  library("languageVariationAndChangeData")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
  library("knitr")
  table(buckeye$DepVar) ## ## del glot palat ret ## 18272 1665 26 7711  table(buckeye$DepVar, buckeye$Gram2) ## ## and justT mono nochange nt past semiweak stemchange went ## del 9442 14 4912 14 2824 702 123 37 204 ## glot 18 11 697 0 840 8 28 0 63 ## palat 0 0 0 0 0 26 0 0 0 ## ret 794 28 4237 10 498 1722 255 34 133 ## The dplyr package The dplyr package is the latest and greatest thing to be added to CRAN. It’s set up to work really well with data bases, and operates by chaining together functions with a %>% operator. ### The %>% (“pipe”) %>% We’ll pronounce %>% as “pipe”. The way %>% works is it takes a data frame on the left side, and inserts it as the first argument to the function on its right side. Normally, you’d look at the first 6 rows of a data frame this way:  head(ing) ## Token DepVar Style GramStatus Following.Seg Sex Age Ethnicity ## 1 eating Ing careful participle vowel m 6 Irish ## 2 processing Ing tangent gerund 0 m 6 Irish ## 3 processing Ing tangent gerund 0 m 6 Irish ## 4 saying In tangent progressive vowel m 6 Irish ## 5 living Ing tangent gerund palatal m 6 Irish ## 6 sitting In narrative progressive vowel m 6 Irish With %>%, you’d do it like this:  ing %>% head() ## Token DepVar Style GramStatus Following.Seg Sex Age Ethnicity ## 1 eating Ing careful participle vowel m 6 Irish ## 2 processing Ing tangent gerund 0 m 6 Irish ## 3 processing Ing tangent gerund 0 m 6 Irish ## 4 saying In tangent progressive vowel m 6 Irish ## 5 living Ing tangent gerund palatal m 6 Irish ## 6 sitting In narrative progressive vowel m 6 Irish How useful is that really? Not very until you start chaining them together. If you wanted to get the number of rows in the data frame after you’ve applied head() to it, normally you’d write it out like this:  nrow(head(ing)) ## [1] 6 Nested functions are kind of tough to read. You need to read them from the inside out. With dplyr, you can chain each function you want to use with %>%.  ing %>% head() %>% nrow() ## [1] 6 The way to read that is “Take the ing data frame, and pipe it into head(). Then take the output of head() and pipe it into nrow().” ### Verbs dplyr comes with a few “verbs” specially developed for chaining together. verb description filter() This works almost exactly like subset() summarise() This takes a data frame, and outputs a new data frame based on the summary you asked for mutate() This takes a data frame, and adds additional columns based on the formula you give it select() This takes a data frame, and returns only the columns you ask for arrange() Reorders the rows of the data frame Let’s test this out first by seeing how many gerund tokens in ing were pronounced “-ing” and “-in”  ing %>% filter(DepVar == "Ing", GramStatus == "gerund") %>% nrow() ## [1] 85  ing %>% filter(DepVar == "In", GramStatus == "gerund") %>% nrow() ## [1] 28 The summarise() function is also pretty useful. Let’s use it on the joe_vowels data frame, which is a data set of my own vowel measurements.  head(joe_vowels) ## year plt_vclass word dur F1_20 F1_35 F1_80 F2_20 F2_35 F2_80 ## 1 2006 Tuw DO 0.099 470.9 451.1 459.3 1388.2 1288.9 1307.2 ## 2 2006 Tuw DO 0.150 455.0 496.2 497.7 1583.0 1334.9 1129.6 ## 3 2006 ahr ARE 0.180 546.5 615.2 557.0 956.8 975.1 1250.2 ## 4 2006 *hr WORKING 0.060 538.8 538.0 473.5 1380.4 1380.1 1463.3 ## 5 2006 i WORKING 0.050 534.0 534.0 533.8 1967.9 1967.9 1785.6 ## 6 2006 owr OR 0.170 440.8 417.4 499.5 754.5 712.4 1056.4  joe_vowels%>% summarise(mean_dur = mean(dur), mean_F1_35 = mean(F1_35), mean_F2_35 = mean(F2_35)) ## mean_dur mean_F1_35 mean_F2_35 ## 1 0.09844073 529.1255 1447.538 It took in the whole joe_vowels data frame, and returned a data frame with a column for each formula we defined. # Split-Apply-Combine When doing data analysis, you’re going to find yourself doing these following steps a lot: 1. Splitting the data up into subsets. 2. Applying some kind of function to those subsets. 3. Combining the results back together  cheese <- data_frame(cheese = rep(c("blue","cheddar","brie"), times = c(3,3,2)), turned_right = c(0.7, 0.69, 0.8, 0.9, 0.85, 0.6, 0.65, 0.7)) cheese turned_right blue 0.70 blue 0.69 blue 0.80 cheddar 0.90 cheddar 0.85 cheddar 0.60 brie 0.65 brie 0.70 One thing we might want to know is what the average turning-right proportion is for each cheese type. ## Split the data up First, split the data up into subsets based on the “cheese” column: cheese turned_right blue 0.70 blue 0.69 blue 0.80 cheese turned_right brie 0.65 brie 0.70 cheese turned_right cheddar 0.90 cheddar 0.85 cheddar 0.60 ## Apply some function to the data In each subset, calculate the average turning rate. cheese mean_right_turn blue 0.73 cheese mean_right_turn brie 0.675 cheese mean_right_turn cheddar 0.7833333 ## Combine the result Combine these results into a new table. cheese mean_right_turn blue 0.7300000 brie 0.6750000 cheddar 0.7833333 ## Split-Apply-Combine in dplyr The dplyr verbs were constructed exactly for this purpose.  cheese %>% group_by(cheese) %>% summarise(mean_right_turn = mean(turned_right)) ## # A tibble: 3 x 2 ## cheese mean_right_turn ## <chr> <dbl> ## 1 blue 0.7300000 ## 2 brie 0.6750000 ## 3 cheddar 0.7833333 1. Use group_by() to split the data frame up. 2. Use summarise() to apply a function to every group. 3. They’re automatically combined back together. # Data Analysis Recipes ## Calculating a proportion ### Counting up The (ING) data:  head(ing) ## Token DepVar Style GramStatus Following.Seg Sex Age Ethnicity ## 1 eating Ing careful participle vowel m 6 Irish ## 2 processing Ing tangent gerund 0 m 6 Irish ## 3 processing Ing tangent gerund 0 m 6 Irish ## 4 saying In tangent progressive vowel m 6 Irish ## 5 living Ing tangent gerund palatal m 6 Irish ## 6 sitting In narrative progressive vowel m 6 Irish Let’s calculate the observed proportion of the “-ing” variant over-all. First, we’ll use group_by() and summarise() together to make a table of counts:  ing %>% group_by(DepVar) %>% summarise(count = n()) ## # A tibble: 2 x 2 ## DepVar count ## <fctr> <int> ## 1 In 577 ## 2 Ing 562 n() is a special function that returns the number of rows in the current group. You could replace it by something like this:  ing %>% group_by(DepVar)%>% summarise(count = length(DepVar), another_count = length(GramStatus), Count3 = n()) ## # A tibble: 2 x 4 ## DepVar count another_count Count3 ## <fctr> <int> <int> <int> ## 1 In 577 577 577 ## 2 Ing 562 562 562 Coming back to our original group_by() %>% summarise() approach, we can create a new column with mutate(), plus a few other columns just to show how mutate() works.  ing %>% group_by(DepVar) %>% summarise(count = n())%>% mutate(prop = count/sum(count), a_percentage = prop * 100, all = sum(count), the_most = max(count), the_least = min(count), lower_case = tolower(DepVar), upper_case = toupper(DepVar)) ## # A tibble: 2 x 9 ## DepVar count prop a_percentage all the_most the_least lower_case ## <fctr> <int> <dbl> <dbl> <int> <int> <int> <chr> ## 1 In 577 0.5065847 50.65847 1139 577 562 in ## 2 Ing 562 0.4934153 49.34153 1139 577 562 ing ## # ... with 1 more variables: upper_case <chr> If we just wanted to know what the percent of “-ing” was, using this method, we need to filter just to see the “-ing” variant.  ing %>% group_by(DepVar) %>% summarise(count = n())%>% mutate(prop = count/sum(count)) %>% filter(DepVar == "Ing") ## # A tibble: 1 x 3 ## DepVar count prop ## <fctr> <int> <dbl> ## 1 Ing 562 0.4934153 ### Averaging 0 and 1 Let’s flip an unfair coin 100 times  flips <- sample(c(0,1), 100, replace = T, prob = c(0.3, 0.7)) We could calculate the proportion heads this way:  table(flips) ## flips ## 0 1 ## 33 67  flips_tab <- table(flips) flips_tab/sum(flips_tab) ## flips ## 0 1 ## 0.33 0.67  flip_prop <- flips_tab/sum(flips_tab) flip_prop["1"] ## 1 ## 0.67 This is basically the approach we took above, but if we just took the average of flips, it would be equivalent.  mean(flips) ## [1] 0.67 But the ing dataframe doens’t have any column coded 0 and 1.  head(ing) ## Token DepVar Style GramStatus Following.Seg Sex Age Ethnicity ## 1 eating Ing careful participle vowel m 6 Irish ## 2 processing Ing tangent gerund 0 m 6 Irish ## 3 processing Ing tangent gerund 0 m 6 Irish ## 4 saying In tangent progressive vowel m 6 Irish ## 5 living Ing tangent gerund palatal m 6 Irish ## 6 sitting In narrative progressive vowel m 6 Irish For this example, let’s just focus on the first 10 tokens.  ing_snip <- ing$DepVar[1:10]
ing_snip
##  [1] Ing Ing Ing In  Ing In  In  Ing Ing Ing
## Levels: In Ing

By utilizing an R trick, we can convert this into a vector of 0 and 1. First, we need to decide what we want to call 1 and what we want to call 0. Let’s go with this coding

variant code
Ing 1
In 0

First, we’ll create a vector of T and F values.

  ing_snip == "Ing"
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

Logical values like this can be coerced into being 0 and 1 values:

variant code
T 1
F 0
  (ing_snip == "Ing") * 1
##  [1] 1 1 1 0 1 0 0 1 1 1

Comparing methods:

  ing_snip_tab <- table(ing_snip)
(ing_snip_tab/sum(ing_snip_tab))["Ing"]
## Ing
## 0.7
  mean((ing_snip=="Ing")*1)
## [1] 0.7

We can scale this up, easily, in dplyr, by creating a new column of 0 and 1 with mutate()

  ing %>%
mutate(is_ing = (DepVar == "Ing")*1)%>%
head()
##        Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 1     eating    Ing   careful  participle         vowel   m   6     Irish
## 2 processing    Ing   tangent      gerund             0   m   6     Irish
## 3 processing    Ing   tangent      gerund             0   m   6     Irish
## 4     saying     In   tangent progressive         vowel   m   6     Irish
## 5     living    Ing   tangent      gerund       palatal   m   6     Irish
## 6    sitting     In narrative progressive         vowel   m   6     Irish
##   is_ing
## 1      1
## 2      1
## 3      1
## 4      0
## 5      1
## 6      0

Now, instead of doing group_by(DepVar), we can just take the average of is_ing

  ing %>%
mutate(is_ing = (DepVar == "Ing")*1)%>%
summarise(ing_prop = mean(is_ing))
##    ing_prop
## 1 0.4934153

## More interesting group_by()

  ing %>%
mutate(is_ing = (DepVar == "Ing")*1)%>%
group_by(GramStatus) %>%
summarise(total = n(),
ing_prop = mean(is_ing))%>%
arrange(ing_prop)
## # A tibble: 7 x 3
##    GramStatus total  ing_prop
##        <fctr> <int>     <dbl>
## 1      during     9 0.2222222
## 2 progressive   464 0.3750000
## 3  participle   309 0.4012945
## 4       thing   110 0.5090909
## 5      gerund   113 0.7522124
## 6   adjective    68 0.8970588
## 7        noun    66 0.9090909

## Sequential group_by()

  ing %>%
group_by(GramStatus, Token)%>%
summarise(count = n()) %>%
summarise(most_freq = max(count),
least_freq = min(count),
total = sum(count))%>%
mutate(prop_of_total = most_freq/total)%>%
arrange(prop_of_total)
## # A tibble: 7 x 5
##    GramStatus most_freq least_freq total prop_of_total
##        <fctr>     <int>      <int> <int>         <dbl>
## 1      gerund         7          1   113    0.06194690
## 2  participle        21          1   309    0.06796117
## 3 progressive        46          1   464    0.09913793
## 4        noun        11          1    66    0.16666667
## 5   adjective        22          1    68    0.32352941
## 6       thing        87          1   110    0.79090909
## 7      during         9          9     9    1.00000000

For some of these grammatical classes, just one particular word accounts for most of the data for the whole class. There might be something weird about these super-frequent words, and they might skew the over-all calculation for the grammatical class. It’s a good idea to “flatten” out the effects of these words by calculating their average first, then calculatin the average “ing” from those averages.

  ing %>%
mutate(is_ing = DepVar == "Ing")%>%
group_by(GramStatus, Token)%>%
summarise(prop_ing = mean(is_ing))
## Source: local data frame [373 x 3]
## Groups: GramStatus [?]
##
##    GramStatus        Token prop_ing
##        <fctr>       <fctr>    <dbl>
## 1   adjective  aggravating      1.0
## 2   adjective      amazing      0.8
## 3   adjective       boring      1.0
## 4   adjective       caring      1.0
## 5   adjective     charming      1.0
## 6   adjective      closing      1.0
## 7   adjective compromising      0.0
## 8   adjective        dying      1.0
## 9   adjective    easygoing      1.0
## 10  adjective  embarassing      1.0
## # ... with 363 more rows

Notice this:

1. there is one row for each word in each grammatical class
2. After running summarise(), the outermost grouping variable has been dropped as a grouping variable.
  ing %>%
mutate(is_ing = DepVar == "Ing")%>%
group_by(GramStatus, Token)%>%
summarise(prop_ing = mean(is_ing))%>%
summarise(prop_ing = mean(prop_ing))%>%
arrange(prop_ing)
## # A tibble: 7 x 2
##    GramStatus  prop_ing
##        <fctr>     <dbl>
## 1      during 0.2222222
## 2 progressive 0.4607932
## 3  participle 0.4631897
## 4       thing 0.6685824
## 5      gerund 0.7523810
## 6   adjective 0.8531250
## 7        noun 0.9188492

There’s probably something weird about “going”.

  ing %>%
filter(Token != "going")%>%
mutate(is_ing = DepVar == "Ing")%>%
group_by(GramStatus, Token)%>%
summarise(prop_ing = mean(is_ing))%>%
summarise(prop_ing = mean(prop_ing))%>%
arrange(prop_ing)
## # A tibble: 7 x 2
##    GramStatus  prop_ing
##        <fctr>     <dbl>
## 1      during 0.2222222
## 2 progressive 0.4628213
## 3  participle 0.4652176
## 4       thing 0.6685824
## 5      gerund 0.7632850
## 6   adjective 0.8483871
## 7        noun 0.9188492

~