Intro

Setup

library(tidyverse)
library(lsa2017)
library(broom)
library(lme4)
ah <- iy_ah %>% filter(plt_vclass == "ah", 
                       context %in% c("internal", "final"))

The Agenda For today

  1. Introduce the justification for fitting LMEs
  2. Explore no-pooling models
  3. Fit some LMEs
  4. Discuss when & how to include random intercepts / slopes

The rationale behind LMEs

The original reasoning behind moving to linear mixed effects modelling has a lot to do with sophisticated statistical reasoning which I’ll just briefly discuss. Let’s return to our ah data.

ah_tomod <- ah %>% 
              mutate(log2dur_c = log2(dur) - median(log2(dur)))
ah_tomod %>% 
  ggplot(aes(log2dur_c, F1_n))+
    geom_point(alpha = 0.05)+
    stat_smooth(method = 'lm', color = 'red')+
    theme_bw()

NA

You may not even be able to tell that the regression line has has a confidence interval because it is narrower than the width of the line! That’s because the model has low uncertainty about the slope:

summary(lm(F1_n ~ log2dur_c, data = ah_tomod))$coef
             Estimate  Std. Error   t value Pr(>|t|)
(Intercept) 1.0605806 0.004278806 247.86834        0
log2dur_c   0.3896566 0.006997420  55.68575        0

But in an important sense, this model is overly confident. It thinks it’s working with 20813 independent observations. As if 20813 different people walked up to a microphone and said ah. But that is exactly not what happened here. Instead, a smaller number of speakers (292) had many tokens of ah, and not always the same number of tokens each.

To make the example more extreme, let’s imagine we had data from only two speakers, one who produced very many tokens, and one who produced relatively few.

ah_tomod %>%
  mutate(idstring = as.character(idstring))%>%
  group_by(idstring)%>%
  mutate(N = n())%>%
  filter(N > 10)%>%
  ungroup() %>%  
  filter(N == max(N) | N == min(N))%>%
  ggplot(aes(log2dur_c, F1_n))+
    geom_point()+
    stat_smooth(method = 'lm')+
    facet_wrap(~idstring)+
    theme_bw()

The speaker with very little data has a pretty shallow slope in the relationship between duration and F1, and the speaker with a lot of data has a much steeper slope. However, if you fit a model with just these two speakers’ data in the model, the estimated slope would be be dominated by the speaker with a lot of data, and almost none given to the speaker with very little. However, that’s not ideal, since with should give a bit more weight to the fact that at least one speaker has a weaker effect, even if they have less data.

An even more perverse situation could arise if there’s a strong correlation between a the speaker and their intercept. For example, in these figures it looks like the relationship between duration and F1 is positive, but it’s conceptually possible that the relationship within every individual is actually negative if they were arranged like this figure:

Thinking about Pooling

One way to start thinking about mixed effects models is to think about how you are “pooling” the data from your data sources. In this example, we’re talking about speakers, but this could also be study sites, lexical items, etc.

Complete Pooling

What people have been doing for a long time is “complete pooling.” All of the data from all of the sources all just goes into one big data bucket. You then fit a model with all of this data in the bucket, undifferentiated by who contributed what data point.

For the reasons we’ve discussed already, this isn’t a great approach.

  1. It violates some statistical assumptions.
  2. The parameter estimates are over-confident.
  3. The parameter estimates are too skewed towards speakers/words with more data.
  4. It is possible for the structure of the data to give you exactly the wrong estimate.

No Pooling

One way you could try to resolve some of these problems would be to fit one model for every speaker. Here, you don’t pool the data at all. Each speaker just has their own bucket of data.

We’ll actually do this in a second, and then discuss the pros and cons.

Partial Pooling

The final option is to do “partial pooling”. You model with all of the data, but preserve information about which speaker contributed which data. This approach allows the “model” for each speaker to mutually inform eachother.

This is a “mixed effects model”, which we’ll also look at.

Fitting no-pooling models

In this section, we’re going to fit some no-pooling models. It is not necessary to be able to do this to be able to fit mixed effects models, but I think this can help you understand what mixed effecs models are doing.

While the no-pooling approach to modelling isn’t the most sophisticated (and you definitely can’t use their p-values), I think it is often a good idea to try it out as an exploratory technique. You can check out a talk by Hadley Wickham about it here.

It is pretty easy to fit one model for one speaker, we just take a subset of the data and fit the model:

speakers <- unique(ah_tomod$idstring)
sp1 <- ah_tomod %>%
          filter(idstring == speakers[1])
sp1_mod <- lm(F1_n ~ log2dur_c, data = sp1)
summary(sp1_mod)

Call:
lm(formula = F1_n ~ log2dur_c, data = sp1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.66684 -0.37403  0.02055  0.47308  1.74784 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.33411    0.05620  23.739  < 2e-16 ***
log2dur_c    0.48085    0.09345   5.145 7.97e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6659 on 155 degrees of freedom
Multiple R-squared:  0.1459,    Adjusted R-squared:  0.1404 
F-statistic: 26.48 on 1 and 155 DF,  p-value: 7.969e-07

There is even a handy dandy function in the broom package called tidy() that will convert a linear model object (and many others) into a data frame.

tidy(sp1_mod)

But there are 292 speakers in this data set. We can’t go repeating this procedure that many times. Defining the problem this way should start giving you the sense that we should be taking a Split-Apply-Combine approach!

Step 1 - Write a function that will fit the model when given a data frame.

We have unfortunately not been able to cover how to write functions in R, but here’s how this one will work:

fit_ah_mod <- function(df){
  lm(F1_n ~ log2dur_c, data = df)
}

What function() does is create a new function. It says “When passed some kind of data (let’s call it df), I’m going to try to use df to fit a linear model.” Here’s how it works in action:

fit_ah_mod(sp1)

Call:
lm(formula = F1_n ~ log2dur_c, data = df)

Coefficients:
(Intercept)    log2dur_c  
     1.3341       0.4809  

~2 minute break

Go ahead and create the fit_ah_mod function, and run it on the subset for one speaker. All the code to do that again is:

fit_ah_mod <- function(df){
  lm(F1_n ~ log2dur_c, data = df)
}

For fun here, choose some other number between 1 and 292.

speakers <- unique(ah_tomod$idstring)
sp_dat <- ah_tomod %>% filter(idstring == speakers[1])
fit_ah_mod(sp_dat)

Step 2 - Split up the data by speaker

fit_ah_mod() is now the function we want to apply to subsets of the data, but now we need to figure out how to split it up according to speaker. To start approaching that problem, we’re going to use the nest() function.

~2 minute break

Run this code:

ah_nested <- ah_tomod %>%
              group_by(idstring) %>%
              nest()
ah_nested

What happened?

This is a little mind bending, but in ah_nested, there is now one row per speaker (because we told it to group_by() speaker). Each cell in the idstring column has the speaker’s id. Each cell in the data column is a whole nother data frame.

ah_nested$data[[1]]

And, we can apply our fit_ah_mod() function to each of these data frames:

fit_ah_mod(ah_nested$data[[10]])

Call:
lm(formula = F1_n ~ log2dur_c, data = df)

Coefficients:
(Intercept)    log2dur_c  
     0.8434       0.0897  
fit_ah_mod(ah_nested$data[[210]])

Call:
lm(formula = F1_n ~ log2dur_c, data = df)

Coefficients:
(Intercept)    log2dur_c  
     1.3421       0.6129  

Step 3 - Apply the function to the data

But to do this in a clear and consistent way, we need to use map().

~2 minute break Run the following codeL

ah_nested <- ah_nested %>%
                mutate(model = map(data, fit_ah_mod))
ah_nested

What happened?

In the model column, which we created with mutate() there is now a linear model for each speaker.

ah_nested$model[[1]]

Call:
lm(formula = F1_n ~ log2dur_c, data = df)

Coefficients:
(Intercept)    log2dur_c  
     1.3341       0.4809  

Step 4 - Combining & Tidying

We now need to convert these linear models into a more useful data structure for plotting and analysis, i.e. a data frame. We can turn individual models into data frames with tidy().

tidy(ah_nested$model[[1]])

We can apply tidy() to each model in the model column with map() again.

~2 minute break Run this code

ah_nested <- ah_nested %>%
                mutate(params = map(model, tidy))
ah_nested

What happened?

We now have the model estimates in data frames in the params column. Getting them into a fully usable format is now as easy as unnest().

ah_params <- ah_nested %>%
                unnest(params)
ah_params

I want to make a plot where every speaker’s Intercept value is plotted against their slope value, so we need to use just the idstring, term and estimate columns, and use spread()

ah_params %>%
  select(idstring, term, estimate)%>%
  spread(term, estimate)%>%
  ggplot(aes(`(Intercept)`, log2dur_c))+
    geom_hline(yintercept = 0, linetype = 2)+  
    geom_point()+
    theme_minimal()+
    coord_fixed()

Here’s another version with the points scaled by size.

speaker_n <- ah_tomod %>%
                group_by(idstring)%>%
                tally()
ah_nested %>%
  unnest(params) %>%
  select(idstring, term, estimate) %>%
  spread(term, estimate) %>%
  left_join(speaker_n)%>%
  ggplot(aes(`(Intercept)`, log2dur_c))+
    geom_hline(yintercept = 0, linetype = 2)+  
    geom_point(aes(size = n), alpha = 0.5)+
    scale_size_area()+
    theme_minimal()+
    coord_fixed()
Joining, by = "idstring"

This no-pooling approach is both very informative, but also has some serious weaknesses. First of all, we can see that most speakers have the expected slope (on the y-axis) between duration and F1. That is, the slope is positive, meaning as duration increases, F1 increases (lowering the vowel). However, there is a lot if inter-speaker variability. Some speakers are estimated to have a fairly steep slope, and others have slopes closer to 0 or even negative.

Here’s a plot of the actual lines for each speaker.

ah_nested %>%
  unnest(params) %>%
  select(idstring, term, estimate) %>%
  spread(term, estimate) %>%
  ggplot()+
    geom_abline(aes(intercept = `(Intercept)`, 
                    slope = log2dur_c),
                alpha = 0.1)+
    xlim(-1.2, 2.85)+
    xlab("log2dur_c")+
    ylim(-2,4.3)+
    ylab("F1_n")+
    theme_minimal()

NA

Discussion

What are the upsides and the downsides to the no-pooling approach.

Partial Pooling (Mixed Effects)

The biggest downside to the no pooling approach is that there is no shared information between the models. It’s possible that speakers who had a negative slope were only estimated to have a negative slope because they had very little data. If there were some way to incorporate information about the slopes of every other speaker into these speakers’ slope estimates, they might be considerably different.

This is exactly what Mixed Effects Models do. You define “grouping variables” or “pooling variables” or “random effects” in the model to tell the model to estimate effects for each speaker, but not to do so completely independently for each speaker.

~2 Minute Activity

Fit a complete pooling model with lm and a partial pooling model with lmer with the following code.

ah_lm <-lm(F1_n ~ log2dur_c, data = ah_tomod)
ah_lme <- lmer(F1_n ~ log2dur_c + (1 + log2dur_c | idstring), data = ah_tomod)

The big difference between the complete pooling and the partial pooling models are the definition of the “random effects” structures.

(1 + log2dur_c | idstring)

  • Everything to the left of the |
    • Everything that can vary within the grouping variable.
    • 1 refers to the intercept
  • Everything to the right of the |
    • The grouping variable.

With (1 + log2dur_c | idstring ), we’re saying

Speakers can have variable intercepts, and variable slopes.

Let’s look at the summary of the mixed effects model.

summary(ah_lme)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ log2dur_c + (log2dur_c | idstring)
   Data: ah_tomod

REML criterion at convergence: 37151.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.7512 -0.5717  0.0248  0.6012  5.6456 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 idstring (Intercept) 0.04204  0.2050       
          log2dur_c   0.01975  0.1405   0.37
 Residual             0.33511  0.5789       
Number of obs: 20813, groups:  idstring, 292

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.04005    0.01307   79.58
log2dur_c    0.40885    0.01141   35.84

Correlation of Fixed Effects:
          (Intr)
log2dur_c 0.249 

The “Random Effects”

By default, the summary doesn’t print out all of the “random effects”, because in this case there are two for each of the 292 speakers (intercept and slope). But it does give us estimates of the variance of these random effects, and their correlation. Here, it looks like speakers are more variable with respect to their baseline F1 than they are with respect to the effect duration has on their F1.

Fixed Effects

These are the estimated “fixed effect”. They can be interpreted a lot like the coefficients of the models we’ve been fitting already, in the sense that these are part of the formula to draw a line. Another way is to interpet these as the average intercept and slope between speakers.

Looking at the random effects

It is possible to extract the random effects using the function ranef().

head(ranef(ah_lme)$idstring)

These are the estimated differences in Intercept and slope for each speaker from the fixed effects. That is, if we added the (Intercept) value from the fixed effects to these random effects, we would get the estimated slope for each speaker. Let’s do that:

ah_lme_fixef <- fixef(ah_lme)
ranef(ah_lme)$idstring %>%
  ggplot(aes(`(Intercept)` + ah_lme_fixef[1], 
             log2dur_c + ah_lme_fixef[2]))+
    geom_point()+
    geom_hline(yintercept = 0, linetype = 2)+
    theme_minimal()+
    coord_fixed()

ah_lme_fixef <- fixef(ah_lme)
ranef(ah_lme)$idstring %>%
  ggplot()+
       geom_abline(aes(intercept = `(Intercept)` + ah_lme_fixef[1], 
                    slope = log2dur_c + ah_lme_fixef[2]),
                alpha = 0.1)+
    xlim(-1.2, 2.85)+
    xlab("log2dur_c")+
    ylim(-2,4.3)+
    ylab("F1_n")+
    theme_minimal()

We can get an idea of what the mixed-effects modelling has done by comparing these results to the no-pooling model. First, the clustering of individual speakers’ random effects is tighter. Also, no speaker is estiamted to have an individual slope less than 0.

It is also possible to see the positive correlation between the random effects. Speakers who have a baseline lower ah appear to be more affected by duration changes.

Multiple random effects

This is where it gets impossible to make illustrative graphs. It is also possible to include multiple grouping factors in a mixed effects model.

~5 minute break

We’re going to add word as a grouping factor. For now, we’ll treat word as only being able to modulate the baseline F1_n, which means we’ll add it with (1 | word) (a.k.a. a random intercept).

Go ahead and fit this model:

ah_lme_multi <- lmer(F1_n ~ log2dur_c + 
                       (1 + log2dur_c | idstring) + 
                       (1 | word), data = ah_tomod)

What happened?

However, it is possible to include a random slope by word too, so let’s do that as well:

ah_lme_multi2 <- ah_lme_multi <- lmer(F1_n ~ log2dur_c + 
                       (1 + log2dur_c | idstring) + 
                       (1 + log2dur_c| word), data = ah_tomod)
summary(ah_lme_multi2)
Linear mixed model fit by REML ['lmerMod']
Formula: 
F1_n ~ log2dur_c + (1 + log2dur_c | idstring) + (1 + log2dur_c |  
    word)
   Data: ah_tomod

REML criterion at convergence: 34226.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.8651 -0.5523  0.0174  0.5813  6.5283 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 word     (Intercept) 0.07667  0.2769        
          log2dur_c   0.01929  0.1389   -0.40
 idstring (Intercept) 0.03487  0.1867        
          log2dur_c   0.01588  0.1260   0.50 
 Residual             0.27992  0.5291        
Number of obs: 20807, groups:  word, 939; idstring, 292

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.14155    0.01810   63.05
log2dur_c    0.28555    0.01689   16.90

Correlation of Fixed Effects:
          (Intr)
log2dur_c -0.078

One thing you might notice now is that the “fixed effect” of duration has gotten a bunch smaller.

coef(ah_lm)
(Intercept)   log2dur_c 
  1.0605806   0.3896566 
fixef(ah_lme_multi2)
(Intercept)   log2dur_c 
  1.1415491   0.2855541 

This may look scary, especially if the effect size gradually vanishes away, but this is exactly how mixed effects models are supposed to work. In some sense, in the complete-pooling model, we thought we were modelling the effect of duration, but actually some of that effect was actually due to the peculiarities of the words spoken, and who the speakers were.

When to add random intercepts/slopes?

Barr et al (2012) recommend (command?) fitting models with “maximal” random effects structures. That is, for every possible grouping factor, include a random slope for every grouping factor that’s logically possible. I think this is a good starting point, but isn’t always practical.

Logically Possible & Impossible Random Slopes

In the ah model we just fit, it was possible for us to fit a random slope of duration by speaker and word, because speakers produced tokens with different durations, and words were produced with different durations.

However, if we wanted to include context again (word initial or word final), we could include a random slope of context by speaker, but not by word. This isn’t a prohibition, it’s just not logically possible and might blow up your model. That’s because every word only has one vowel context. NAH, only has a word final vowel context, and there’s no way for it to be affected by being word internal.

Similarly, if we wanted to include year of birth in the model, we could include a random slope of dob by word, because many words were spoken by people across many ages. But we couldn’t include a random slope of dob by speaker, because each speaker only has one date of birth.

Necessary Random Effects

Any time you are looking at a between-grouping factor variable, you should include the grouping factor with a random intercept. This is especially true for any kind of research done outside of an experimental lab context.

  • age: (1 | speaker)
  • gender: (1 | speaker)
  • word frequency: (1 | word)
  • sentence grammaticality: (1 | token)

Other necessary random effects would be related to the nature of the data. In this ah model, we haven’t explored any between-grouping factors, but all of these data points were produced by speakers, embedded in specific words, so it’s important to capture that in the random effects.

Pitfalls in mixed effects modelling

The biggest issue people face when fitting a mixed effects model is that it may “fail to converge”. This is because these models aren’t ‘solved’ in the same way as the regular linear regression is. It is very important to only base your inference on converged models. There are a handful of tweaks that might help in your models converging.

  • center and scale all continuous predictors.
  • if possible, choose a different factor level that has more data as your reference level

The next step is to start taking out some of the recommended random slopes, or the ones most likely to be problematic. For example, in the ah model, the random slope of duration by word is likely to be problematic because many words only have one data point, making it more challenging to estimate a random slope for every word.

Mixed Effects Models Controversy?

One of the benefits of mixed-effects models is their limited ability to ameliorate unmodelled properties of the grouping factors.w

---
title: "Mixed Effects Models"
output: 
  html_notebook: 
    code_folding: none
    css: ../custom.css
    theme: flatly
    toc: yes
    toc_float: yes
    toc_depth: 3
---



# Intro

<div class = "box break">
<span class = "big-label">Setup</span>

```{r}
library(tidyverse)
library(lsa2017)
library(broom)
library(lme4)


ah <- iy_ah %>% filter(plt_vclass == "ah", 
                       context %in% c("internal", "final"))
```


</div>

## The Agenda For today

1. Introduce the justification for fitting LMEs
2. Explore no-pooling models
3. Fit some LMEs
4. Discuss when & how to include random intercepts / slopes

# The rationale behind LMEs

The original reasoning behind moving to linear mixed effects modelling has a lot to do with sophisticated statistical reasoning which I'll just briefly discuss. Let's return to our `ah` data.

```{r}

ah_tomod <- ah %>% 
              mutate(log2dur_c = log2(dur) - median(log2(dur)))

ah_tomod %>% 
  ggplot(aes(log2dur_c, F1_n))+
    geom_point(alpha = 0.05)+
    stat_smooth(method = 'lm', color = 'red')+
    theme_bw()
    
```

You may not even be able to tell that the regression line has has a confidence interval because it is narrower than the width of the line! That's because the model has low uncertainty about the slope:

```{r}
summary(lm(F1_n ~ log2dur_c, data = ah_tomod))$coef
```

But in an important sense, this model is *overly* confident. It thinks it's working with `r nrow(ah_tomod)` *independent* observations. As if `r nrow(ah_tomod)` different people walked up to a microphone and said `ah`. But that is exactly not what happened here. Instead, a smaller number of speakers (`r length(unique(ah$idstring))`) had many tokens of `ah`, and not always the same number of tokens each.

To make the example more extreme, let's imagine we had data from only two speakers, one who produced very many tokens, and one who produced relatively few.

```{r}
ah_tomod %>%
  mutate(idstring = as.character(idstring))%>%
  group_by(idstring)%>%
  mutate(N = n())%>%
  filter(N > 10)%>%
  ungroup() %>%  
  filter(N == max(N) | N == min(N))%>%
  ggplot(aes(log2dur_c, F1_n))+
    geom_point()+
    stat_smooth(method = 'lm')+
    facet_wrap(~idstring)+
    theme_bw()
```


The speaker with very little data has a pretty shallow slope in the relationship between duration and F1, and the speaker with a lot of data has a much steeper slope. However, if you fit a model with just these two speakers' data in the model, the estimated slope would be be dominated by the speaker with a lot of data, and almost none given to the speaker with very little. However, that's not ideal, since with should give a bit more weight to the fact that at least one speaker has a weaker effect, even if they have less data.


An even more perverse situation could arise if there's a strong correlation between a the speaker and their intercept. For example, in these figures it looks like the relationship between duration and F1 is *positive*, but it's conceptually possible that the relationship within every individual is actually *negative* if they were arranged like this figure:


```{r echo = F}
point_line_distance <- function(b, m, x, y)
  (y - (m*x + b))/sqrt(m^2 + 1)

confounded_data_frame <- function(x, y, m, num_grp){
  b <- 0 # intercept doesn't matter
  d <- point_line_distance(b, m, x, y)
  d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
  data.frame(x=x, y=y, 
            group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}

confounded_dur <- confounded_data_frame(x = ah_tomod$log2dur_c,
                                        y = ah_tomod$F1_n,
                                        m = -0.5,
                                        num_grp = 10)

confounded_dur %>%
  ggplot(aes(x, y, color = group))+
    geom_point(alpha = 0.05)+
    stat_smooth(method = 'lm')+
    xlab("log2dur_c")+
    ylab("F1_n")+
    theme_minimal()

```


# Thinking about Pooling

One way to start thinking about mixed effects models is to think about how you are "pooling" the data from your data sources. In this example, we're talking about speakers, but this could also be study sites, lexical items, etc.


## Complete Pooling
What people have been doing for a long time is "complete pooling." All of the data from all of the sources all just goes into one big data bucket. You then fit a model with all of this data in the bucket, undifferentiated by who contributed what data point.

![](figures/complete_pooling.svg)

For the reasons we've discussed already, this isn't a great approach.

1. It violates some statistical assumptions.
2. The parameter estimates are over-confident.
3. The parameter estimates are too skewed towards speakers/words with more data.
4. It is possible for the structure of the data to give you exactly the wrong estimate.


## No Pooling

One way you could try to resolve some of these problems would be to fit one model for every speaker. Here, you don't pool the data at all. Each speaker just has their own bucket of data.

![](figures/no_pooling.svg)

We'll actually do this in a second, and then discuss the pros and cons.


## Partial Pooling

The final option is to do "partial pooling". You model with all of the data, but preserve information about which speaker contributed which data. This approach allows the "model" for each speaker to mutually inform eachother. 


![](figures/partial_pooling.svg)

This is a "mixed effects model", which we'll also look at.

# Fitting no-pooling models

In this section, we're going to fit some no-pooling models. It is not necessary to be able to do this to be able to fit mixed effects models, but I think this can help you understand what mixed effecs models are doing.

While the no-pooling approach to modelling isn't the most sophisticated (and you definitely can't use their p-values), I think it is often [a good idea to try it out as an exploratory technique](http://jofrhwld.github.io/blog/2016/05/01/many_models.html).  You can check out [a talk by Hadley Wickham about it here](https://www.youtube.com/watch?v=rz3_FDVt9eg). 


It is pretty easy to fit one model for one speaker, we just take a subset of the data and fit the model:

```{r}
speakers <- unique(ah_tomod$idstring)

sp1 <- ah_tomod %>%
          filter(idstring == speakers[1])
sp1_mod <- lm(F1_n ~ log2dur_c, data = sp1)
summary(sp1_mod)
```

There is even a handy dandy function in the `broom` package called `tidy()` that will convert a linear model object (and many others) into a data frame.

```{r}
tidy(sp1_mod)
```

But there are `r length(unique(ah_tomod$idstring))` speakers in this data set.
We can't go repeating this procedure that many times. Defining the problem this way should start giving you the sense that we should be taking a Split-Apply-Combine approach!

## Step 1 - Write a function that will fit the model when given a data frame.

We have unfortunately not been able to cover how to write functions in R, but here's how this one will work:


```{r}
fit_ah_mod <- function(df){
  lm(F1_n ~ log2dur_c, data = df)
}
```

What `function()` does is create a new function. It says "When passed some kind of data (let's call it `df`), I'm going to try to use `df` to fit a linear model." Here's how it works in action:

```{r}
fit_ah_mod(sp1)
```

<div class = "break box">
<span class = 'big-label'>~2 minute break</span>

Go ahead and create the `fit_ah_mod` function, and run it on the subset for one speaker. All the code to do that again is:

```{r}
fit_ah_mod <- function(df){
  lm(F1_n ~ log2dur_c, data = df)
}
```

For fun here, choose some other number between 1 and `r length(unique(ah_tomod$idstring))`.

```{r}
speakers <- unique(ah_tomod$idstring)
sp_dat <- ah_tomod %>% filter(idstring == speakers[1])
```

```{r}
fit_ah_mod(sp_dat)
```

</div>


## Step 2 - Split up the data by speaker

`fit_ah_mod()` is now the function we want to *apply* to subsets of the data, but now we need to figure out how to *split* it up according to speaker. To start approaching that problem, we're going to use the `nest()` function.

<div class = "box break">
<span class = 'big-label'>~2 minute break</span>

Run this code:
```{r}
ah_nested <- ah_tomod %>%
              group_by(idstring) %>%
              nest()
ah_nested
```

What happened?

</div>


This is a little mind bending, but in `ah_nested`, there is now one row per speaker (because we told it to `group_by()` speaker). Each cell in the `idstring` column has the speaker's id. Each cell in the `data` column is a whole nother data frame.

```{r}
ah_nested$data[[1]]
```

And, we can apply our `fit_ah_mod()` function to each of these data frames:

```{r}
fit_ah_mod(ah_nested$data[[10]])
```
```{r}
fit_ah_mod(ah_nested$data[[210]])
```

## Step 3 - Apply the function to the data

But to do this in a clear and consistent way, we need to use `map()`.

<div class = 'box break'>
<span class = 'big-label'>~2 minute break</span>
Run the following codeL

```{r}
ah_nested <- ah_nested %>%
                mutate(model = map(data, fit_ah_mod))
ah_nested
```

What happened?

</div>

In the `model` column, which we created with `mutate()` there is now a linear model for each speaker.

```{r}
ah_nested$model[[1]]
```

## Step 4 - Combining & Tidying

We now need to convert these linear models into a more useful data structure for plotting and analysis, i.e. a data frame. We can turn individual models into data frames with `tidy()`.

```{r}
tidy(ah_nested$model[[1]])
```

We can apply `tidy()` to each model in the `model` column with `map()` again.

<div class = "break box">
<span class = 'big-label'>~2 minute break</span>
Run this code

```{r}
ah_nested <- ah_nested %>%
                mutate(params = map(model, tidy))
ah_nested
```

What happened?

</div>


We now have the model estimates in data frames in the `params` column. Getting them into a fully usable format is now as easy as `unnest()`.

```{r}
ah_params <- ah_nested %>%
                unnest(params)
ah_params
```

I want to make a plot where every speaker's Intercept value is plotted against their slope value, so we need to use just the `idstring`, `term` and `estimate` columns, and use `spread()`

```{r}
ah_params %>%
  select(idstring, term, estimate)%>%
  spread(term, estimate)%>%
  ggplot(aes(`(Intercept)`, log2dur_c))+
    geom_hline(yintercept = 0, linetype = 2)+  
    geom_point()+
    theme_minimal()+
    coord_fixed()
```

Here's another version with the points scaled by size.

```{r}
speaker_n <- ah_tomod %>%
                group_by(idstring)%>%
                tally()

ah_nested %>%
  unnest(params) %>%
  select(idstring, term, estimate) %>%
  spread(term, estimate) %>%
  left_join(speaker_n)%>%
  ggplot(aes(`(Intercept)`, log2dur_c))+
    geom_hline(yintercept = 0, linetype = 2)+  
    geom_point(aes(size = n), alpha = 0.5)+
    scale_size_area()+
    theme_minimal()+
    coord_fixed()
```


This no-pooling approach is both very informative, but also has some serious weaknesses.
First of all, we can see that *most* speakers have the expected slope (on the y-axis) between duration and F1. 
That is, the slope is positive, meaning as duration increases, F1 increases (lowering the vowel).
However, there is a lot if inter-speaker variability.
Some speakers are estimated to have a fairly steep slope, and others have slopes closer to 0 or even negative.

Here's a plot of the actual lines for each speaker.

```{r}
ah_nested %>%
  unnest(params) %>%
  select(idstring, term, estimate) %>%
  spread(term, estimate) %>%
  ggplot()+
    geom_abline(aes(intercept = `(Intercept)`, 
                    slope = log2dur_c),
                alpha = 0.1)+
    xlim(-1.2, 2.85)+
    xlab("log2dur_c")+
    ylim(-2,4.3)+
    ylab("F1_n")+
    theme_minimal()
    
```

<div class = "box break">
<span class = 'big-label'>Discussion</span>

What are the upsides and the downsides to the no-pooling approach.

</div>




# Partial Pooling (Mixed Effects)

The biggest downside to the no pooling approach is that there is no shared information *between* the models. It's possible that speakers who had a negative slope were only estimated to have a negative slope because they had very little data.
If there were some way to incorporate information about the slopes of every other speaker into these speakers' slope estimates, they might be considerably different.

This is exactly what Mixed Effects Models do. You define "grouping variables" or "pooling variables" or "random effects" in the model to tell the model to estimate effects for each speaker, but not to do so completely independently for each speaker.

<div class = "box break">
<span class = 'big-label'>~2 Minute Activity</span>

Fit a complete pooling model with `lm` and a partial pooling model with `lmer` with the following code.

```{r}
ah_lm <-lm(F1_n ~ log2dur_c, data = ah_tomod)
ah_lme <- lmer(F1_n ~ log2dur_c + (1 + log2dur_c | idstring), data = ah_tomod)
```

</div>

The big difference between the complete pooling and the partial pooling models are the definition of the "random effects" structures. 

<div class = "illustrate">
(<span class = "pop">1</span> + <span class = "pop">log2dur_c</span> | <span class = "pop">idstring</span>)
</div>

- Everything to the left of the `|`
    - Everything that can vary within the grouping variable. 
    - `1` refers to the intercept
- Everything to the right of the `|`
    - The grouping variable.
    
With `(1 + log2dur_c | idstring )`, we're saying

> Speakers can have variable intercepts, and variable slopes.

Let's look at the summary of the mixed effects model.

```{r}
summary(ah_lme)
```


## The "Random Effects"

<div class = 'half-img'>

![](figures/ranef.png)
</div>

By default, the summary doesn't print out all of the "random effects", because in this case there are two for each of the 292 speakers (intercept and slope). But it does give us estimates of the variance of these random effects, and their correlation. Here, it looks like speakers are more variable with respect to their baseline F1 than they are with respect to the effect duration has on their F1.


## Fixed Effects

<div class = 'half-img'>
![](figures/fixef.png)
</div>


These are the estimated "fixed effect". They can be interpreted a lot like the coefficients of the models we've been fitting already, in the sense that these are part of the formula to draw a line. Another way is to interpet these as the *average* intercept and slope between speakers.

## Looking at the random effects

It is possible to extract the random effects using the function `ranef()`.

```{r}
head(ranef(ah_lme)$idstring)
```

These are the estimated differences in Intercept and slope for each speaker from the fixed effects. That is, if we added the `(Intercept)` value from the fixed effects to these random effects, we would get the estimated slope for each speaker. Let's do that:

```{r}
ah_lme_fixef <- fixef(ah_lme)

ranef(ah_lme)$idstring %>%
  ggplot(aes(`(Intercept)` + ah_lme_fixef[1], 
             log2dur_c + ah_lme_fixef[2]))+
    geom_point()+
    geom_hline(yintercept = 0, linetype = 2)+
    theme_minimal()+
    coord_fixed()
```


```{r}
ah_lme_fixef <- fixef(ah_lme)

ranef(ah_lme)$idstring %>%
  ggplot()+
       geom_abline(aes(intercept = `(Intercept)` + ah_lme_fixef[1], 
                    slope = log2dur_c + ah_lme_fixef[2]),
                alpha = 0.1)+
    xlim(-1.2, 2.85)+
    xlab("log2dur_c")+
    ylim(-2,4.3)+
    ylab("F1_n")+
    theme_minimal()
```

We can get an idea of what the mixed-effects modelling has done by comparing these results to the no-pooling model. 
First, the clustering of individual speakers' random effects is tighter. 
Also, no speaker is estiamted to have an individual slope less than 0. 

It is also possible to see the positive correlation between the random effects.
Speakers who have a baseline lower `ah` appear to be more affected by duration changes.

## Multiple random effects

This is where it gets impossible to make illustrative graphs.
It is also possible to include multiple grouping factors in a mixed effects model.

<div class = 'break box'>
<span class = 'big-label'>~5 minute break</span>

We're going to add `word` as a grouping factor. For now, we'll treat `word` as only being able to modulate the baseline `F1_n`, which means we'll add it with `(1 | word)` (a.k.a. a random intercept).

Go ahead and fit this model:

```{r}
ah_lme_multi <- lmer(F1_n ~ log2dur_c + 
                       (1 + log2dur_c | idstring) + 
                       (1 | word), data = ah_tomod)
summary(ah_lme_multi)
```

What happened?

</div>

However, it is *possible* to include a random slope by word too, so let's do that as well:

```{r}
ah_lme_multi2 <- ah_lme_multi <- lmer(F1_n ~ log2dur_c + 
                       (1 + log2dur_c | idstring) + 
                       (1 + log2dur_c| word), data = ah_tomod)
summary(ah_lme_multi2)
```

One thing you might notice now is that the "fixed effect" of duration has gotten a bunch smaller. 

```{r}
# complete pooling
coef(ah_lm)
```

```{r}
# full ranef structure
fixef(ah_lme_multi2)
```

This may look scary, especially if the effect size gradually vanishes away, but this is exactly how mixed effects models are supposed to work. 
In some sense, in the complete-pooling model, we *thought* we were modelling the effect of duration, but actually some of that effect was actually due to the peculiarities of the words spoken, and who the speakers were.


## When to add random intercepts/slopes?

[Barr et al (2012)](http://idiom.ucsd.edu/~rlevy/papers/barr-etal-2013-jml.pdf) recommend (command?) fitting models  with "maximal" random effects structures.
That is, for every possible grouping factor, include a random slope for every grouping factor that's logically possible.
I think this is a good starting point, but isn't always practical.

### Logically Possible & Impossible Random Slopes

In the `ah` model we just fit, it was possible for us to fit a random slope of duration by speaker and word, because speakers produced tokens with different durations, and words were produced with different durations.

However, if we wanted to include `context` again (word initial or word final), we could include a random slope of context by speaker, but *not* by word.
This isn't a prohibition, it's just not logically possible and might blow up your model.
That's because every word only has one vowel context. `NAH`, only has a word final vowel context, and there's no way for it to be affected by being word internal.

Similarly, if we wanted to include year of birth in the model, we could include a random slope of dob by word, because many words were spoken by people across many ages. But we *couldn't* include a random slope of dob by speaker, because each speaker only has one date of birth.


### Necessary Random Effects

Any time you are looking at a between-grouping factor variable, you should include the grouping factor with a random intercept. This is *especially* true for any kind of research done outside of an experimental lab context.

- age: `(1 | speaker)`
- gender: `(1 | speaker)`
- word frequency: `(1 | word)`
- sentence grammaticality: `(1 | token)`

Other necessary random effects would be related to the nature of the data. In this `ah` model, we haven't explored any between-grouping factors, but all of these data points were produced by speakers, embedded in specific words, so it's important to capture that in the random effects.

### Recommended Random Effects

It is a good idea to include your biggest effects, or effects of greatest interest, as random slopes across the grouping factors which contain the most data.


# Pitfalls in mixed effects modelling

The biggest issue people face when fitting a mixed effects model is that it may "fail to converge". This is because these models aren't 'solved' in the same way as the regular linear regression is. It is very important to only base your inference on converged models. There are a handful  of tweaks that might help in your models converging.

- center and scale all continuous predictors.
- if possible, choose a different factor level that has more data as your reference level


The next step is to start taking out some of the recommended random slopes, or the ones most likely to be problematic. For example, in the `ah` model, the random slope of duration by word is likely to be problematic because many words only have one data point, making it more challenging to estimate a random slope for every word.





# Mixed Effects Models Controversy?


One of the benefits of mixed-effects models is their limited ability to ameliorate unmodelled properties of the grouping factors.w 

