*-May 1, 2016*

Today I went to Hadley Wickham’s talk to the EdinbR users’ group called Managing Many Models in R. It highlighted some workflows for cleanly fitting many models to subsets of a data set, then comparing across model fits I thought I’d experiment by applying them to some linguistic data that I’m familiar with.

Very briefly, “TD Deletion”, as I’m calling it, is a process common to every variety of English where a word final /t/ or /d/ in a cluster is deleted. Some example words are:

Word | Deleted |
---|---|

mist |
mis’ |

kept |
kep’ |

missed |
miss’ |

It’s more common in words where the /t d/ is just part of the word, like *mist* than in words where it’s part of the past tense marker, like *missed*.
In the data, the *mist* case is labeled `mono`

for “monomorpheme” and the *missed* case is labeled `past`

.

The TD Deletion data I’m using is derived from The Buckeye Corpus, and is available in the `grammarOfVariatonData`

package.

```
library(devtools)
if(!require("grammarOfVariationData")){
install_github("jofrhwld/grammarOfVariationData")
library(grammarOfVariationData)
}
```

The functions in these workflows come from what some call the “Hadleyverse”, but he’s not the primary author of two of these packages (`broom`

and `magrittr`

).
When I’m using functions from these packages, I’ll use the `package::function()`

notation so it’s clear which functions come from which package, but in normal practice, you probably won’t need to include the `package::`

part yourself.

```
library(dplyr)
library(purrr)
library(broom)
library(tidyr)
library(magrittr)
library(ggplot2)
```

I’ll also be fitting some more standard mixed effects models for comparison sake.

` library(lme4)`

I’ll be looking at two factors that seem to have an influence on the rate of TD Deletion

- Speech Rate
- Word Frequency

Specifically, I’m going to see if there are any meaningful individual differences between how individual speakers are affected by these factors.

This is a subset of columns from the full Buckeye TD deletion data.

```
buckeye %>%
dplyr::select(Speaker, Word, seg, DictRate, Gram2, td)%>%
head()
```

```
## Speaker Word seg DictRate Gram2 td
## 1 s01 lived d 6.036144 past 1
## 2 s01 and d 6.182479 and 0
## 3 s01 raised d 6.431375 past 1
## 4 s01 west t 4.227088 mono 0
## 5 s01 kind d 6.612163 mono 0
## 6 s01 received d 4.600222 past 1
```

`DictRate`

corresponds to how many syllables per second there were in an 8 word window surrounding the target word, based on cannonical pronunciations.
`td`

is the outcome measure, and has a value of `1`

if the /t d/ was pronounced and a value of `0`

if it wasn’t.

`DictRate`

is heavily leftward skewed, with some extremely (implausibly) high syllable per second rates.
Things look more plausible if we exclude data where the target word was not right or left aligned with the window in which speech rate was estimated.
So, we’ll use this subset for the speech rate modeling.

```
buckeye %>%
ggplot(aes(DictRate))+
geom_density()
```

```
buckeye %>% filter(PreWindow > 1 | PostWindow > 1 ) %>%
ggplot(aes(DictRate))+
geom_density()
```

Step 1 is to create a data frame of data frames. First, let’s select the rows and columns of the full dataset we want to use.

```
rate_to_use <- buckeye %>%
dplyr::filter(PreWindow > 1 | PostWindow > 1,
Gram2 %in% c("past", "mono")) %>%
dplyr::mutate(RateCenter = DictRate - median(DictRate)) %>%
dplyr::select(Speaker, Word, RateCenter, Gram2, td)
```

I want to fit 1 model for each speaker for each grammatical class (past tense or monomorpheme), so I need to group this data frame by `Speaker`

and `Gram2`

.
Then, I’ll use `nest()`

to create a new column that contains a data frame with the remaining columns.

```
rate_nest <- rate_to_use %>%
dplyr::group_by(Speaker, Gram2) %>%
tidyr::nest()
```

Here’s what the data frame of data frames looks like:

` rate_nest`

```
## Source: local data frame [76 x 3]
##
## Speaker Gram2 data
## (fctr) (fctr) (chr)
## 1 s01 past <tbl_df [58,3]>
## 2 s01 mono <tbl_df [171,3]>
## 3 s02 mono <tbl_df [218,3]>
## 4 s02 past <tbl_df [46,3]>
## 5 s03 mono <tbl_df [219,3]>
## 6 s03 past <tbl_df [56,3]>
## 7 s04 mono <tbl_df [240,3]>
## 8 s04 past <tbl_df [41,3]>
## 9 s05 mono <tbl_df [184,3]>
## 10 s05 past <tbl_df [38,3]>
## .. ... ... ...
```

Here’s what the first item in the `data`

column looks like.

` rate_nest$data[[1]]`

```
## Source: local data frame [58 x 3]
##
## Word RateCenter td
## (fctr) (dbl) (int)
## 1 lived 0.8881393 1
## 2 raised 1.2833697 1
## 3 received -0.5477829 1
## 4 sustained 0.4536305 1
## 5 moved -1.1036793 1
## 6 involved 0.3295448 1
## 7 involved 3.3815070 1
## 8 looked -3.4224800 1
## 9 informed -1.5658356 1
## 10 watched 0.7513170 1
## .. ... ... ...
```

The next step is to write a little function that takes one of our small data frames as input, and will produce a model as an output.

```
fit_one_rate_mod <- function(data){
mod <- glm(td ~ RateCenter , data = data, family = binomial)
return(mod)
}
```

Here’s what happens when we pass just one of our smaller data frames to `fit_one_rate_mod`

.

` fit_one_rate_mod(rate_nest$data[[1]])`

```
##
## Call: glm(formula = td ~ RateCenter, family = binomial, data = data)
##
## Coefficients:
## (Intercept) RateCenter
## 1.35381 0.02941
##
## Degrees of Freedom: 57 Total (i.e. Null); 56 Residual
## Null Deviance: 59.14
## Residual Deviance: 59.12 AIC: 63.12
```

But what we want to do is apply `fit_one_rate_mod`

to *all* of the nested data frames.
For this, we an use the `purrr::map()`

function.
It takes a list as its first argument, and a function as its second argument.
It then applies that function to every item in the list.

```
rate_nest <- rate_nest %>%
dplyr::mutate(mod = purrr::map(data, fit_one_rate_mod))
rate_nest
```

```
## Source: local data frame [76 x 4]
##
## Speaker Gram2 data mod
## (fctr) (fctr) (chr) (chr)
## 1 s01 past <tbl_df [58,3]> <S3:glm, lm>
## 2 s01 mono <tbl_df [171,3]> <S3:glm, lm>
## 3 s02 mono <tbl_df [218,3]> <S3:glm, lm>
## 4 s02 past <tbl_df [46,3]> <S3:glm, lm>
## 5 s03 mono <tbl_df [219,3]> <S3:glm, lm>
## 6 s03 past <tbl_df [56,3]> <S3:glm, lm>
## 7 s04 mono <tbl_df [240,3]> <S3:glm, lm>
## 8 s04 past <tbl_df [41,3]> <S3:glm, lm>
## 9 s05 mono <tbl_df [184,3]> <S3:glm, lm>
## 10 s05 past <tbl_df [38,3]> <S3:glm, lm>
## .. ... ... ... ...
```

` rate_nest$mod[[1]]`

```
##
## Call: glm(formula = td ~ RateCenter, family = binomial, data = data)
##
## Coefficients:
## (Intercept) RateCenter
## 1.35381 0.02941
##
## Degrees of Freedom: 57 Total (i.e. Null); 56 Residual
## Null Deviance: 59.14
## Residual Deviance: 59.12 AIC: 63.12
```

In our nested data frame, we now have a coulum which contains a logistic regression model for each speaker for each grammatical class.
In order to to do anything meaningful now, we need to extract important information from each model, ideally storing it again in another data frame.
Here’s where `broom::glance()`

and `broom::tidy()`

come in.
`glance()`

will produce a one line summary for the entire model, and `tidy()`

will produce a summary of the model coefficients.

` broom::glance(rate_nest$mod[[1]])`

```
## null.deviance df.null logLik AIC BIC deviance df.residual
## 1 59.13862 57 -29.5594 63.11879 67.23968 59.11879 56
```

` broom::tidy(rate_nest$mod[[1]])`

```
## term estimate std.error statistic p.value
## 1 (Intercept) 1.35381379 0.3329521 4.0660918 0.0000478081
## 2 RateCenter 0.02941017 0.2086622 0.1409463 0.8879123804
```

What we’ll do is apply `broom::tidy()`

to each model, again using `purr::map()`

, and extract how many data points each model was fit with.
Then, with `tidyr::unnest()`

, we’ll pop this out into a free-standing data frame.

```
rate_nest <- rate_nest %>%
dplyr::mutate(tidy = purrr::map(mod, broom::tidy),
n = purrr::map(data, nrow) %>% simplify())
rate_nest
```

```
## Source: local data frame [76 x 6]
##
## Speaker Gram2 data mod tidy n
## (fctr) (fctr) (chr) (chr) (chr) (int)
## 1 s01 past <tbl_df [58,3]> <S3:glm, lm> <data.frame [2,5]> 58
## 2 s01 mono <tbl_df [171,3]> <S3:glm, lm> <data.frame [2,5]> 171
## 3 s02 mono <tbl_df [218,3]> <S3:glm, lm> <data.frame [2,5]> 218
## 4 s02 past <tbl_df [46,3]> <S3:glm, lm> <data.frame [2,5]> 46
## 5 s03 mono <tbl_df [219,3]> <S3:glm, lm> <data.frame [2,5]> 219
## 6 s03 past <tbl_df [56,3]> <S3:glm, lm> <data.frame [2,5]> 56
## 7 s04 mono <tbl_df [240,3]> <S3:glm, lm> <data.frame [2,5]> 240
## 8 s04 past <tbl_df [41,3]> <S3:glm, lm> <data.frame [2,5]> 41
## 9 s05 mono <tbl_df [184,3]> <S3:glm, lm> <data.frame [2,5]> 184
## 10 s05 past <tbl_df [38,3]> <S3:glm, lm> <data.frame [2,5]> 38
## .. ... ... ... ... ... ...
```

```
rate_coefs <- rate_nest %>%
tidyr::unnest(tidy) %>%
dplyr::select(-(std.error:p.value)) %>%
tidyr::spread(term, estimate)
rate_coefs
```

```
## Source: local data frame [76 x 5]
##
## Speaker Gram2 n (Intercept) RateCenter
## (fctr) (fctr) (int) (dbl) (dbl)
## 1 s01 mono 171 0.3109309 -0.08511930
## 2 s01 past 58 1.3538138 0.02941017
## 3 s02 mono 218 0.2834635 -0.30801155
## 4 s02 past 46 0.3795177 -0.04756514
## 5 s03 mono 219 -0.3751245 -0.42651949
## 6 s03 past 56 0.1228607 -0.34137119
## 7 s04 mono 240 -0.7632547 -0.22761177
## 8 s04 past 41 0.8512747 -0.18682255
## 9 s05 mono 184 0.8195723 -0.14171049
## 10 s05 past 38 1.5298129 -0.64609481
## .. ... ... ... ... ...
```

Now it’s just a matter of plotting it out.

```
rate_coefs%>%
ggplot(aes(`(Intercept)`, RateCenter, color = Gram2))+
geom_point(aes(size = n), alpha = 0.6)+
scale_size_area()+
coord_fixed()
```

It looks like the predicted effect that as speech rate increases, /t d/ retention decreases is held up. Most of the points for each speaker are below 0. There is a fair bit of inter-speaker variation in their baseline rate of /t d/ retention (along the x axis), but much less for the rate of speech effect, and there seems to be very little relationship between the two. That is, most speakers are likely react in a very similar way to increasing the rate of speech, regardless of whether they have a high or low baseline rate of /t d/ retention.

Let’s look at how this would all shake out in a normal mixed-effects model:

```
rate_data <- rate_nest%>%unnest(data)
rate_mod <- glmer(td ~ RateCenter*Gram2 + (RateCenter|Speaker) + (1|Word),
data = rate_data,
family = binomial)
summary(rate_mod)
```

```
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: td ~ RateCenter * Gram2 + (RateCenter | Speaker) + (1 | Word)
## Data: rate_data
##
## AIC BIC logLik deviance df.resid
## 12846.1 12904.8 -6415.0 12830.1 11359
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -7.4473 -0.6971 0.3012 0.6463 4.0799
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Word (Intercept) 1.044249 1.02189
## Speaker (Intercept) 0.255016 0.50499
## RateCenter 0.002359 0.04857 -0.03
## Number of obs: 11367, groups: Word, 1000; Speaker, 38
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.74630 0.10817 6.900 5.21e-12 ***
## RateCenter -0.17929 0.01742 -10.294 < 2e-16 ***
## Gram2past 0.67300 0.11483 5.861 4.61e-09 ***
## RateCenter:Gram2past 0.08665 0.03441 2.518 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) RtCntr Grm2ps
## RateCenter -0.010
## Gram2past -0.363 0.002
## RtCntr:Grm2 0.001 -0.356 0.002
```

Looks mostly like we’d expect given the speaker-level modelling. The fixed effect of speech rate is negative, meaning a higher rate of speech has a higher rate of deletion. Looking at the random effects, we see a larger standard deviation for speaker-level intercepts than for speaker-level speech rate effects, just like in the figure above, and they are very weakly correlated.

Let’s do the same thing for frequency effects. I’ll be using word frequency norms from SUBTLEX-US, specifically a centered version of the Zipf Scale.

I’m going to compress this down into as few lines as possible just to show off the power of the workflow.

```
subtlex <- read.delim("../data/subtlex/SUBTLEXus74286wordstextversion-3.txt")
# This is just transforming and centering the frequency data for modelling.
# It's important to center the frequencies over word types in this way.
zipf_scores <- subtlex %>%
dplyr::mutate(Word = tolower(Word),
zipf = log10(SUBTLWF)) %>%
dplyr::select(Word, zipf) %>%
dplyr::semi_join(buckeye %>%
mutate(Word = tolower(Word)) %>%
filter(Gram2 %in% c("past", "mono"))) %>%
dplyr::mutate(zipf_center = zipf - median(zipf))
```

```
## Joining by: "Word"
```

```
# getting the subset of the data to use for analysis
freq_to_use <- buckeye%>%
dplyr::filter(Gram2 %in% c("past", "mono"))%>%
dplyr::mutate(Word = tolower(Word))%>%
dplyr::left_join(zipf_scores)%>%
dplyr::select(Speaker, Word, zipf_center, Gram2, td)
```

```
## Joining by: "Word"
```

```
# The model fitting function
fit_one_freq_mod <- function(data){
mod <- glm(td ~ zipf_center, data = data, family = binomial)
return(mod)
}
# this block does all of the modelling and summaries of the models.
freq_nest <- freq_to_use %>%
dplyr::group_by(Speaker, Gram2) %>%
tidyr::nest()%>%
dplyr::mutate(mod = purrr::map(data, fit_one_freq_mod),
tidy = map(mod, broom::tidy),
n = purrr::map(data, nrow)%>%simplify())
# extrating the tidy summaries for plotting
freq_coefs <- freq_nest %>%
tidyr::unnest(tidy) %>%
dplyr::select(-(std.error:p.value)) %>%
tidyr::spread(term, estimate)
```

```
freq_coefs%>%
ggplot(aes(`(Intercept)`, zipf_center, color = Gram2))+
geom_point(aes(size = n), alpha = 0.6)+
coord_fixed()
```

Unlike the rate of speech, it looks like there is quite a bit of interspeaker variation in their sensitivity to word frequency, *and* it’s related to their baseline rate of /t d/ retention.
Speakers with a higher baseline deletion rate have a *weaker* frequency effect.
You can kind of see this in the output of a standard mixed effects model, but it’s rolled up in the random-effects report that we usually don’t pay enough attention to.

```
freq_data <- freq_nest%>%unnest(data)
freq_mod <- glmer(td ~ zipf_center*Gram2 + (zipf_center|Speaker) + (1|Word),
data = freq_data,
family = binomial)
summary(freq_mod)
```

```
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: td ~ zipf_center * Gram2 + (zipf_center | Speaker) + (1 | Word)
## Data: freq_data
##
## AIC BIC logLik deviance df.resid
## 13746.6 13805.9 -6865.3 13730.6 12163
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.1754 -0.6912 0.3073 0.6549 3.9010
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Word (Intercept) 1.05171 1.0255
## Speaker (Intercept) 0.28874 0.5373
## zipf_center 0.07198 0.2683 -0.43
## Number of obs: 12171, groups: Word, 1007; Speaker, 38
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.88120 0.12105 7.279 3.35e-13 ***
## zipf_center -0.13821 0.09446 -1.463 0.143
## Gram2past 0.65529 0.12451 5.263 1.42e-07 ***
## zipf_center:Gram2past -0.07027 0.13884 -0.506 0.613
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) zpf_cn Grm2ps
## zipf_center -0.480
## Gram2past -0.432 0.321
## zpf_cntr:G2 0.218 -0.530 -0.328
```

Up there in the random effects summary, you can see that the correlation between speaker-level intercepts and speaker-level frequency effects is -0.43, which means more or less the same thing as the figure above.

I think there’s a tendency when we have a nice big data set in hand to skip straight to fitting one big mixed-effects model to rule them all. But in rushing headlong into the one big model and focusing on the p-values on the fixed effects, we might be missing some important and interesting patterns in our data.

*
posted
by
Joe
at 17:56 (more or less).
*