Let’s start by updating and loading the languageVariationAndChangeData package.

  library("devtools")
  install_github("jofrhwld/languageVariationAndChangeData")
  library("languageVariationAndChangeData")

Data Frames

So far we have discussed the following types of values in R:

And we’ve discussedthe following data structures.

Here, we’ll cover these two data structures:

Data Frames are the data structure we’ll be using the most in R. When you begin thinking about data frames, a useful starting place is to think of them as spreadsheets, with columns and rows (but we’ll eventually abandon spreadsheet thinking). Let’s start out by creating a very simple data frame using the data.frame() function.

  rat <- data.frame(turned = c("left", "right", "left", "right"),
                    cheese = c("yes", "yes", "no", "no"),
                    trials = c(210, 490, 90, 201))
  rat
##   turned cheese trials
## 1   left    yes    210
## 2  right    yes    490
## 3   left     no     90
## 4  right     no    201

Finding your way around

The rat data frame has 4 rows, and three columns. The rows are just numbered 1 through 4, and the three columns are named turned, cheese and trials. To find out how many rows and columns a data frame has, you can use the nrow() and ncol() functions.

  nrow(rat)
## [1] 4
  ncol(rat)
## [1] 3

Most data frames you’re going to work with have a lot more rows than that. For example, ing is a data frame that is bundled in the languageVariationAndChangeData package.

  nrow(ing)
## [1] 1139

That’s too many rows to look at just in the console. One option is to use the head() function, that just prints the first 6 rows.

  head(ing)
##        Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 1     eating    Ing   careful  participle         vowel   m   6     Irish
## 2 processing    Ing   tangent      gerund             0   m   6     Irish
## 3 processing    Ing   tangent      gerund             0   m   6     Irish
## 4     saying     In   tangent progressive         vowel   m   6     Irish
## 5     living    Ing   tangent      gerund       palatal   m   6     Irish
## 6    sitting     In narrative progressive         vowel   m   6     Irish

Another option is to use the summary() function.

  summary(ing)
##        Token     DepVar          Style           GramStatus  Following.Seg
##  something: 92   In :577   careful  :467   adjective  : 68   0      :201  
##  going    : 67   Ing:562   narrative:324   during     :  9   apical :318  
##  doing    : 57             soapbox  :133   gerund     :113   labial :161  
##  saying   : 49             response : 89   noun       : 66   palatal: 42  
##  getting  : 37             tangent  : 88   participle :309   velar  : 37  
##  talking  : 32             group    : 23   progressive:464   vowel  :380  
##  (Other)  :805             (Other)  : 15   thing      :110                
##  Sex          Age          Ethnicity  
##  f:546   Min.   :2.000   Irish  :224  
##  m:593   1st Qu.:3.000   Italian:540  
##          Median :6.000   other  :279  
##          Mean   :4.684   polish : 96  
##          3rd Qu.:6.000                
##          Max.   :6.000                
## 

summary() is a function that works on almost every kind of object.

Indexing Data Frames

Since data frames are 2 dimensional (rows are one dimension, columns are another), the way you index them is a little bit more complicated than with vectors. It still uses square brackets, though, but these square brackets have two positions:

df[row number, column number]

If you specify a specific row number, but leave the column number blank, you’ll get back that row and all columns.

  rat[1,]
##   turned cheese trials
## 1   left    yes    210

Alternatively, if you specify just the column number, but leave the rows blank, you’ll get back all of the values for that column.

  rat[,2]
## [1] yes yes no  no 
## Levels: no yes

When you specify both, you get back the value in the specified row and column

  rat[1,2]
## [1] yes
## Levels: no yes

However, there is a special indexing operator for data frames that take advantage of their named columns: $.

df$column_name

  rat$cheese
## [1] yes yes no  no 
## Levels: no yes

After accessing the column of a data frame, you can index it just like it’s a vector.

  rat$cheese[1]
## [1] yes
## Levels: no yes

If you really want to, you can do logical indexing of data frames like so:

  rat[rat$cheese == "yes", ]
##   turned cheese trials
## 1   left    yes    210
## 2  right    yes    490

But there’s also a function called subset() that you can use to do the same thing. Subset takes a data frame as its first argument, and then a logical statement referring to one or more of the data frame’s columns.

  subset(rat, cheese == "yes")
##   turned cheese trials
## 1   left    yes    210
## 2  right    yes    490
  subset(rat, cheese == "yes" & turned == "right")
##   turned cheese trials
## 2  right    yes    490

Reading in data

R can easily read comma-separated (.csv) files and tab-delimited files into its memory.1 You can read them in with read.csv() and read.delim(), respectively. If you have

When loading a data file into R, you are just loading it into the R workspace. Any alterations or modifications you make to the data frame will not be reflected in the file in your system, just in the copy in the R workspace.

The tricky thing now is that the way that is most natural or normal for you to organize and name your files and folders doesn’t necessarilly translate into a good way for R (or other programming language) to look at them. In order to load a file into R, you need to provide read.csv() or read.delim() with the “path” to the file, which is just a text string. For example, I have a data file saved in a folder called Fieldwork, which is in a folder called 2013, which is in a folder called Empirical Methods, which is in a folder called Courses, which is in turn is in my home Documents directory. On my system, that looks like:

  svlr <- read.delim("~/Documents/Courses/Empirical Methods/2013/Fieldwork/LEL2B_full_data.txt")

If you’re not sure what it looks like on your system, use the file.choose() function.

  file.choose()

That’ll launch the default visual file browser for your system. After browsing around and clicking on a file, file.choose() will print the character string that represents the path to that file into the console.

One pretty cool thing is that if a data file is up on a website somewhere, you can just access it by passing the url to read.csv() or read.delim().2 Here is some sample data on the Donner Party.3

  donner <- read.csv("http://jofrhwld.github.io/data/donner.csv")
  head(donner)
##     NAME GENDER AGE  FATE FAMILY NGENDER NFATE
## 1 Donner      F  45  died      y       0     0
## 2 Donner      F  45  died      y       0     0
## 3 Donner      F  14 lived      y       0     1
## 4 Donner      F  12 lived      y       0     1
## 5 Donner      F   7 lived      y       0     1
## 6 Donner      F   6 lived      y       0     1

Factors

When it comes to working with data in R, factors are an unfortunate fact of life. By default, when you read a data frame into R, any rows which are coded with character values are converted into factors.

A good way to grasp the structure of factors is to build one from scratch. Let’s say part of our probability matching experiment with the rats was to see if different cheeses differentially affected the rat’s probability matching. We’ll start off by setting up a numerical code for each kind of cheese. 1 = blue cheese, 2 = cheddar, and 3 = parmesan.

  # more like cheese nom
  cheese_num <- sample(c(1,2,3), size = 10, replace = T)
  cheese_num
##  [1] 2 1 2 3 1 1 1 1 2 2

This isn’t an optimal coding scheme for the cheeses for a few reasons. First, it’s hard to remember. It’d be better to label the kind of cheese directly, instead of having a numerical code. Second, the numerical sequence, (1, 2, 3) seems to imply a kind of order, or directionality to the cheeses. But they’re just three kinds of cheese with no particular order. We can make this better by creating a factor from cheese_num, and labelling the “levels” of the factor with the cheese names.

  cheese_factor <- factor(cheese_num, labels = c("blue", "cheddar", "parmesan"))
  cheese_factor
##  [1] cheddar  blue     cheddar  parmesan blue     blue     blue    
##  [8] blue     cheddar  cheddar 
## Levels: blue cheddar parmesan

The internal structure of the factor is something like this:

Values Dictionary
2 1 = “blue”
1 2 = “cheddar”
2 3 = “parmesan”
3
1
1
1
1
2
2

The way factors is going to be most important when it comes to making plots, and doing statistical analysis. For now, it’s important to know how to change the order of factors. By default, factors are ordered alphabetically when they’re read into R.

  levels(donner$FATE)
## [1] "died"  "lived"

In order to change the order of factors, you have to use the fuction relevel(), which takes the factor as its first argument, and the level that you want to promote to first as the second.

  donner$FATE2 <- relevel(donner$FATE, "lived")
  head(donner)
##     NAME GENDER AGE  FATE FAMILY NGENDER NFATE FATE2
## 1 Donner      F  45  died      y       0     0  died
## 2 Donner      F  45  died      y       0     0  died
## 3 Donner      F  14 lived      y       0     1 lived
## 4 Donner      F  12 lived      y       0     1 lived
## 5 Donner      F   7 lived      y       0     1 lived
## 6 Donner      F   6 lived      y       0     1 lived
  levels(donner$FATE)
## [1] "died"  "lived"
  levels(donner$FATE2)
## [1] "lived" "died"

  1. My personal aesthetic preference is for tab-delimited files.

  2. This doesn’t work if the file is behind encryption, i.e. if it begins with https://.

  3. “The Donner Party (sometimes called the Donner-Reed Party) was a group of American pioneer migrants who set out for California in a wagon train. Delayed by a series of mishaps, they spent the winter of 1846–47 snowbound in the Sierra Nevadas. Some of the migrants resorted to cannibalism to survive, eating those who had succumbed to starvation and sickness.” https://en.wikipedia.org/wiki/Donner_Party


~