Agenda

  • Model comparison via AIC, BIC and Likelihood Ratio Tests
  • Bootstrap replication

Setup

Today, we’re going to need to install and use a new package called “boot”, and re-install the course package because I’ve added a new data object to it for today.

#' Install
install.packages("boot")
library(devtools)
install_github("jofrhwld/lsa2017")
library(tidyverse)
library(lsa2017)
library(lme4)
library(boot)
library(broom)

Intro

So far, I have been cagey about the use of or interpretation of p-values, or some other methods for deciding when to include/exclude predictors from your model, instead trying to emphasize the importance on focusing on effect size & direction, and trying to impress on you the importance of having a theory of which direction an effect will go, and about how large it will be before you fit the model.

However, sometimes you do want to check whether an effect in a model is reliable, or is worth complexifying the model in order to improve the model fit. The current recommendations from the folks developing these LME methods are to use likelihood ratio tests and bootstrap confidence intervals.

Likelihood Ratio Test (and AIC and BIC)

Let’s start off with fitting a fairly maximimal model for the effect of duration and voicing on F1 of ah.

iy_ah_tomod <- iy_ah %>% 
                  filter(context == "internal",
                         fol_seg %in% c("B", "CH", "D",
                                        "DH", "F", "G",
                                        "JH", "K", "P",
                                        "S", "SH", "T",
                                        "TH", "V", "Z",
                                        "ZH"))%>%
                  mutate(log2dur = log2(dur),
                         log2dur_c = log2dur - median(log2dur),
                         is_voiceless = fol_seg %in% c("CH", "F", "K",
                                                       "P", "S", "SH",
                                                       "T", "TH"))

ah_tomod <- iy_ah_tomod %>% filter(plt_vclass == "ah")
ah_model_full <- lmer(F1_n ~ log2dur_c * is_voiceless + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)
summary(ah_model_full)
Linear mixed model fit by REML ['lmerMod']
Formula: F1_n ~ log2dur_c * is_voiceless + (1 + log2dur_c + is_voiceless |  
    idstring) + (1 + log2dur_c | word)
   Data: ah_tomod

REML criterion at convergence: 26380.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.1115 -0.5581  0.0060  0.5826  6.6176 

Random effects:
 Groups   Name             Variance Std.Dev. Corr       
 word     (Intercept)      0.07500  0.2739              
          log2dur_c        0.01624  0.1274   -0.77      
 idstring (Intercept)      0.04864  0.2205              
          log2dur_c        0.01732  0.1316    0.26      
          is_voicelessTRUE 0.02188  0.1479   -0.63  0.15
 Residual                  0.25961  0.5095              
Number of obs: 16804, groups:  word, 524; idstring, 292

Fixed effects:
                           Estimate Std. Error t value
(Intercept)                 1.10851    0.04339  25.547
log2dur_c                   0.26614    0.03547   7.503
is_voicelessTRUE            0.09927    0.04759   2.086
log2dur_c:is_voicelessTRUE  0.01606    0.03950   0.407

Correlation of Fixed Effects:
            (Intr) lg2dr_ i_TRUE
log2dur_c   -0.635              
is_vclsTRUE -0.862  0.600       
lg2d_:_TRUE  0.583 -0.844 -0.656

Looking just at the t-values, it looks like we have a reliable effect of duration, and voicing on F1. The longer the vowel, the higher F1, and pre-voiceless ah has a slightly higher F1. That’s a weaker effect than doubling the duration of the vowel.

However, one effect of interest is the interaction between duration and voicelessness. Looking at the coefficient, it looks kind of like pre-voiceless ah is more sensitive to duration changes than pre-voiced ah, but the effect size is small, and so is our certainty (t < 2).

One way to see whether it is worth including the interaction to improve the model fit is to fit another model where everything else is the same except for the effect of interest (in this case, the interaction between duration and voicelessness).

ah_model_2 <- lmer(F1_n ~ log2dur_c + is_voiceless + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)

Then, you compare these two models using the function anova(). Now, when you use anova() on to LMEs, it is not doing an ANOVA. The function anova() has just become semantically bleached to mean “compare two models”.1

~5 Minute Break

If you haven’t done it already, fit the two models and compare them with anova().

ah_model_full <- lmer(F1_n ~ log2dur_c * is_voiceless + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)
ah_model_2 <- lmer(F1_n ~ log2dur_c + is_voiceless + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)
anova(ah_model_full, ah_model_2)

There’s a lot to look at here:

The AIC and the BIC are metrics of model fit. They’re trying to quantify this balance between how well the models fit the data vs how complex they are. I’ve mentioned the the tradeoff between improving the model fit and trying to avoid overfitting, and these Information Criterions are one way of trying to quantify it. I have the formulas for the AIC and the BIC here, not because the actual math matters for us, but in order to do a contrastive analysis:

\[ AIC = 2k - 2\ln(\hat{L}) \]

\[ BIC = \ln(n)k - 2\ln(\hat{L}) \]

This is a classic example of having two competing metrics that are basically the same, but people have their preferences so we get both.

When you’re comparing models (even multiple models) using AIC and BIC, the models with the smallest AIC and BIC are the preferred ones. There isn’t an established difference in AIC and BIC which is considered a “significant” difference.

This block of numbers is dedicated to the Likelihood Ratio Test (or LRT). It is a significance test for whether the more complex model significantly improves model fit. Some notes:

  • LRTs should only be carried out comparing “nested” models. That is, one model should have all the same parameters as the other, plus some.
  • LRTs should ideally be carried out on minimally contrasting models.
  • These are evaluating a “stepping up” of the model complexity.

I would recommend using LRTs to probe the reliablity of very specific predictors and their interactions. In this example, if we were interested more broadly in the effct of voicing, we could try fitting one more model that removes the voicing fixed effect altogheter (not touching the random effects though!).

ah_model_3 <- lmer(F1_n ~ log2dur_c + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)
anova(ah_model_full, ah_model_2, ah_model_3)
refitting model(s) with ML (instead of REML)
Data: ah_tomod
Models:
ah_model_3: F1_n ~ log2dur_c + (1 + log2dur_c + is_voiceless | idstring) + 
ah_model_3:     (1 + log2dur_c | word)
ah_model_2: F1_n ~ log2dur_c + is_voiceless + (1 + log2dur_c + is_voiceless | 
ah_model_2:     idstring) + (1 + log2dur_c | word)
ah_model_full: F1_n ~ log2dur_c * is_voiceless + (1 + log2dur_c + is_voiceless | 
ah_model_full:     idstring) + (1 + log2dur_c | word)
              Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)   
ah_model_3    12 26392 26485 -13184    26368                            
ah_model_2    13 26385 26485 -13179    26359 9.2778      1   0.002319 **
ah_model_full 14 26387 26495 -13179    26359 0.1783      1   0.672863   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

While the BIC does not distinguish between these three models, the model with just a main effect of voicing is preferred by the AIC. This is also the only model to significantly improve model fit over a simpler model according to the Likelihood Ratio Test. It would seem then that the following segment’s voicing has a reliable effect on F1, but it does not reliably interact with the duration of the vowel/vowel reduction.

The Bootstrap

If you want to get something p-value-like or confidence interval-like for the parameters in a mixed effects model, the people who have developed these techniques recommend bootstrapping. Bootstrapping is a very general purpose technique that sort of looks like it shouldn’t work.

Bootstrapping in Principle

Bootstrapping is a very general method for estimating our uncertainty about any given paremeter estimate. For example, let’s say that we had 8 randomly2 sampled heights of linguists at the Intitute, and we were interested in how noisy an estimate this is of the average height of linguists.

heights = c(a = 69, b = 75, c = 62, d = 62, e = 73, f = 71, g = 61, h = 64)
mean(heights)
[1] 67.125

One method statisticians have developed is the “bootstrap” estimate of uncertainty. The way that works is you first resample your sample, with replacement.

new_heights = sample(heights, replace = T)
new_heights
 e  f  e  c  e  c  g  b 
73 71 73 62 73 62 61 75 

Some people’s heights have shown up more than once, and some none at all. Then, you take the mean

mean(new_heights)
[1] 68.75

And now, you do this hundreds and hundreds of times.

replicates <- (1:100000) %>%
                map(~sample(heights, replace = T))%>%
                map(mean)%>%
                simplify()
data_frame(replicates = replicates)%>%
  ggplot(aes(replicates))+
    stat_density(color = "black", alpha = 0.6)+
    geom_vline(xintercept = mean(heights))+
    theme_minimal()

The statisticians have set unto us that this is a sensible way to estimate uncertainty about the mean of our sample. As the bootstrap is to the sample, so the sample is to the population. We can use some very simplistic methods to get a confidence interval out of the boostraps, say by taking the center 95% interval from the sample:

quantile(replicates, prob = c(0.025, 0.975))
  2.5%  97.5% 
63.625 70.750 

Basically, this means we’ve got a pretty lousy sample for trying estimate the average height of linguists. This 95% CI ranges from just a bit taller than the shortest people in the sample and just a bit shorter than the tallest, meaning that as far as we know with these 8 people, any one of them could be the average height.

sort(heights)
 g  c  d  h  a  f  e  b 
61 62 62 64 69 71 73 75 

You can use bootstrap replication for any property of the sample, incidentally. For example, we could try to estimate the standard deviation of linguists’ heights using this method.

sd_replicates <- (1:100000) %>%
                map(~sample(heights, replace = T))%>%
                map(sd)%>%
                simplify()
data_frame(sd_replicates = sd_replicates)%>%
  ggplot(aes(sd_replicates))+
    stat_density(color = "black", alpha = 0.6)+
    geom_vline(xintercept = sd(heights))+
    theme_minimal()

Here’s a better example. Let’s look at the ratio of iy duration to ah duration again.

vowel_ratio <- iy_ah %>%
  group_by(idstring, plt_vclass)%>%
  summarise(dur = median(dur))%>%
  spread(plt_vclass, dur)%>%
  mutate(ratio = ah/iy)
vowel_ratio %>%
ggplot(aes(ratio))+
    stat_density(alpha = 0.6, color = 'black')+
    theme_minimal()

The median duration

ratio_bootstrap <- (1:100000)%>%
                      map(~sample(vowel_ratio$ratio, replace = T))%>%
                      map(median)%>%
                      simplify()
data_frame(ratio_bootstrap = ratio_bootstrap) %>%
  ggplot(aes(ratio_bootstrap))+
    stat_density(alpha = 0.6, color = "black")+
    xlim(0.8, 2)+
    theme_minimal()

Bootstrapping for Regression

NOTE!: I give a bunch of code to do bootstrap replication by hand here, but there is a pre-built function to use called bootMer().

Bootstrapping to get confidence intervals for regression paremeters works in a similar way, but implemented differently. For clarity’s sake, we’ll just look at the data from one speaker.

sp1 <- ah_tomod %>% 
          filter(idstring == idstring[1]) %>%
          select(F1_n, log2dur_c)
ggplot(sp1, aes(log2dur_c, F1_n))+
    geom_point()+
    stat_smooth(method = 'lm')

sp1_mod <- lm(F1_n ~ log2dur_c, data = sp1)
sp1_mod

Call:
lm(formula = F1_n ~ log2dur_c, data = sp1)

Coefficients:
(Intercept)    log2dur_c  
     1.3137       0.5726  

With this model, every data point has three values associated with it:

  1. Its x value (in this case, its duration)
  2. Its fitted y value (in this case, its predicted F1)
  3. Its residual error (the difference between its observed F1 and and the predicted F1).

In order to get a bootstrap estimate of the intercept and slope here, you actually sample the residuals with replacement, add them to each point’s fitted value, and refit the model.

sp1$resid <- resid(sp1_mod)
sp1$fitted <- fitted(sp1_mod)
head(sp1)

As a reminder, here’s a plot of the observed duration values against the fitted F1 values.

ggplot(sp1, aes(log2dur_c, fitted))+
    geom_point()

To get back the original data, you need to add each data point’s residual error to the fitted value.

ggplot(sp1, aes(log2dur_c, fitted + resid))+
    geom_point()

In bootstrapping, you actually create a new set of residuals by sampling the existing residuals with replacement. Then you add these to the fitted values to get a new bootstrap replicate.

~2 minute break

Run this code in a chunk a handful of times.

ymax <- max(sp1$fitted) + max(sp1$resid)
ymin <- min(sp1$fitted) + min(sp1$resid)
sp1 %>%
  mutate(resid_samp = sample(resid, replace = T))%>%
  ggplot(aes(log2dur_c, fitted + resid_samp))+
    geom_point()+
    ylim(ymin, ymax)+
    stat_smooth(method = 'lm')+
    theme_minimal()

Now, we just need to collect up all of these re-fittings, and plot them.

lm_boot <- (1:1000)%>%
                map(~(sp1 %>% mutate(resid_samp = sample(resid, replace = T),
                                     log2dur_c = resid_samp + fitted)))%>%
                map(~(lm(F1_n ~ log2dur_c, data = .x)))%>%
                map(tidy)%>%
                bind_rows(.id = "id")
lm_boot %>%
  ggplot(aes(estimate))+
    stat_density(alpha = 0.6, color = "black")+
    facet_wrap(~term, scales = "free")+
    geom_vline(xintercept = 0)+
    theme_bw()

Bootstrapping LMEs

The specific way which boostrapping works for lmer models is different than for vanilla linear models (dealing with the random effects structure appropriately is complex), but fortunately we don’t need to write code to do that for us. The function bootMer() will do it for us.

bootMer(model, FUN = fixef, nsim, …)

  • model
    • the model you want to do bootstrapping on (can be a linear or other generalized linear mixed effects model).
  • FUN
    • The function you want to apply to each re-fit model. This will almost always be just fixef
  • nsim
    • The number of bootstrap simulations you want to do
  • ...
    • Additional optional arguments (e.g. what different kinds of bootstrapping you want to do, whether or not to use parallel processing).

~5 minute break

Run this code to do 5 bootstrap replications

ah_bootstraps <- bootMer(ah_model_full, 
                         FUN = fixef, 
                         nsim = 5,
                         verbose = T)

The somewhat annoying thing about bootstrapping is that it takes just as long to fit each replication as it does to fit the original model. That’s why I baked one at home to share in the lsa2017.

ah_model_boot
unknown type of "boot" object

Call:
bootMer(x = ah_model_full, FUN = fixef, nsim = 1000, verbose = T)


Bootstrap Statistics :
      original        bias    std. error
t1* 1.14191405  0.0002722577  0.03941867
t2* 0.26614247  0.0007798575  0.03584854
t3* 0.10128834  0.0003775092  0.04348148
t4* 0.01605835 -0.0016354358  0.03933166

Each t* item corresponds to a fixed effect coefficient. They show up in the summary in the same order they did in the model summary:

fixef(ah_model_full)
               (Intercept)                  log2dur_c           is_voicelessTRUE 
                1.10850500                 0.26614232                 0.09927247 
log2dur_c:is_voicelessTRUE 
                0.01605849 

In the bootstrap summary, the original column corresponds to the original coefficient estimate from the model.

To get a confidence interval for these parameters, we need to use the boot.ci() function.

boot.ci(bootstrap, index, type = perc)

  • bootstrap
    • The bootstrap object
  • index
    • The index of the coefficient you want a confidence interval for
  • type
    • Unfortunately, there are many different ways of estimating a confidence interval from bootstrapped replicates, and not all of them work for a bootstrapped lme. I recommend type = "perc"

So to get the bootstrap confidence interval for the intercept, we’d give index=1.

boot.ci(ah_model_boot, index = 1, type = "perc")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = ah_model_boot, type = "perc", index = 1)

Intervals : 
Level     Percentile     
95%   ( 1.068,  1.217 )  
Calculations and Intervals on Original Scale

This confidence interval excludes 0, so we could call this a reliable effect.

To get the confidence interval for the interaction between voicing and duration, you give it index = 4

boot.ci(ah_model_boot, index = 4, type = "perc")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = ah_model_boot, type = "perc", index = 4)

Intervals : 
Level     Percentile     
95%   (-0.0642,  0.0896 )  
Calculations and Intervals on Original Scale

In this case, the confidence interval contains 0, so this does not look like a reliable effect.

So, if you were writing a paper on the effect of vowel duration and voicing on the F1 of ah, you could argue based on convergent evidence from the model comparisons, likelihood ratio test, and the bootstrap confidence intervals that there is a consistent effect of following voiceless consonants raising F1 (lowering the vowel), but this does not interact with duration reliably.

ah_model_boot$t %>%
  data.frame()%>%
  gather()%>%
  ggplot(aes(value))+
    stat_density(alpha = 0.6, color = "black")+
    geom_vline(xintercept = 0, linetype = 2)+
    facet_wrap(~key, scales = "free")+
    theme_bw()

  1. I imagine this is similar to how the Extended Projection Principle came to mean “Have specifier”, and there are still EPP features kicking around even though the theory of phrase structure has completely changed.

  2. In this case, not actually random.

---
title: "Model Comparison and Bootstrapping"
output: 
  html_notebook: 
    code_folding: none
    css: ../custom.css
    theme: flatly
    toc: yes
    toc_float: yes
    toc_depth: 3
---


# Agenda

- Model comparison via AIC, BIC and Likelihood Ratio Tests
- Bootstrap replication

<div class = "break box">
<span class = "big-label">Setup</span>

Today, we're going to need to install and use a new package called "boot", and re-install the course package because I've added a new data object to it for today.

```{r}
#' Install
install.packages("boot")
library(devtools)
install_github("jofrhwld/lsa2017")
```

```{r}
library(tidyverse)
library(lsa2017)
library(lme4)
library(boot)
library(broom)
```
</div>


# Intro

So far, I have been cagey about the use of or interpretation of p-values, or some other methods for deciding when to include/exclude predictors from your model, instead trying to emphasize the importance on focusing on effect size & direction, and trying to impress on you the importance of having a theory of *which direction* an effect will go, and *about how large* it will be before you fit the model.

However, sometimes you do want to check whether an effect in a model is reliable, or is worth complexifying the model in order to improve the model fit.
The current [recommendations from the folks developing these LME methods are to use *likelihood ratio tests* and *bootstrap confidence intervals*](https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#methods-for-testing-single-parameters).

# Likelihood Ratio Test (and AIC and BIC)

Let's start off with fitting a fairly maximimal model for the effect of duration and voicing on F1 of `ah`.

```{r}
iy_ah_tomod <- iy_ah %>% 
                  filter(context == "internal",
                         fol_seg %in% c("B", "CH", "D",
                                        "DH", "F", "G",
                                        "JH", "K", "P",
                                        "S", "SH", "T",
                                        "TH", "V", "Z",
                                        "ZH"))%>%
                  mutate(log2dur = log2(dur),
                         log2dur_c = log2dur - median(log2dur),
                         is_voiceless = fol_seg %in% c("CH", "F", "K",
                                                       "P", "S", "SH",
                                                       "T", "TH"))

ah_tomod <- iy_ah_tomod %>% filter(plt_vclass == "ah")
```


```{r}
ah_model_full <- lmer(F1_n ~ log2dur_c * is_voiceless + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)
```


```{r}
summary(ah_model_full)
```

Looking just at the t-values, it looks like we have a reliable effect of duration, and voicing on F1. The longer the vowel, the higher F1, and pre-voiceless `ah` has a slightly higher F1. That's a weaker effect than doubling the duration of the vowel.

However, one effect of interest is the *interaction* between duration and voicelessness. 
Looking at the coefficient, it looks kind of like pre-voiceless `ah` is more sensitive to duration changes than pre-voiced `ah`, but the effect size is small, and so is our certainty (t < 2).

One way to see whether it is worth including the interaction to improve the model fit is to fit another model **where everything else is the same** except for the effect of interest (in this case, the interaction between duration and voicelessness).


```{r}
ah_model_2 <- lmer(F1_n ~ log2dur_c + is_voiceless + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)
```

Then, you compare these two models using the function `anova()`. Now, when you use `anova()` on to LMEs, it is *not* doing an ANOVA. 
The function `anova()` has just become semantically bleached to mean "compare two models".^[I imagine this is similar to how the Extended Projection Principle came to mean "Have specifier", and there are still EPP features kicking around even though the theory of phrase structure has completely changed.]

<div class = "break box">
<span class = "big-label">~5 Minute Break</span>

If you haven't done it already, fit the two models and compare them with `anova()`.

```{r}
ah_model_full <- lmer(F1_n ~ log2dur_c * is_voiceless + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)
```

```{r}
ah_model_2 <- lmer(F1_n ~ log2dur_c + is_voiceless + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)
```


```{r}
anova(ah_model_full, ah_model_2)
```

</div>


There's a lot to look at here:

<div class = 'half-img'>
![](figures/aic_bic.png)
</div>

The AIC and the BIC are metrics of model fit. They're trying to quantify this balance between how well the models fit the data vs how complex they are. 
I've mentioned the the tradeoff between improving the model fit and trying to avoid overfitting, and these *Information Criterions* are one way of trying to quantify it.
I have the formulas for the AIC and the BIC here, not because the actual math matters for us, but in order to do a contrastive analysis:

$$
AIC = 2k - 2\ln(\hat{L})
$$

$$
BIC = \ln(n)k - 2\ln(\hat{L})
$$

This is a classic example of having two competing metrics that are basically the same, but people have their preferences so we get both. 

When you're comparing models (even multiple models) using AIC and BIC, the models with the *smallest* AIC and BIC are the preferred ones. There isn't an established difference in AIC and BIC which is considered a "significant" difference.

<div class = 'half-img'>
![](figures/lrt.png)
</div>

This block of numbers is dedicated to the Likelihood Ratio Test (or LRT). It is a significance test for whether the more complex model significantly improves model fit. Some notes:

- LRTs should only be carried out comparing "nested" models. That is, one model should have all the same parameters as the other, plus some.
- LRTs should ideally be carried out on minimally contrasting models.
- These are evaluating a "stepping up" of the model complexity.

I would recommend using LRTs to probe the reliablity of very specific predictors and their interactions.
In this example, if we were interested more broadly in the effct of voicing, we could try fitting one more model that removes the voicing fixed effect altogheter (not touching the random effects though!).


```{r}
ah_model_3 <- lmer(F1_n ~ log2dur_c + 
                  (1 + log2dur_c + is_voiceless | idstring) +
                  (1 + log2dur_c | word), 
                 data = ah_tomod)
```



```{r}
anova(ah_model_full, ah_model_2, ah_model_3)
```

> While the BIC does not distinguish between these three models, the model with just a main effect of voicing is preferred by the AIC. This is also the only model to significantly improve model fit over a simpler model according to the Likelihood Ratio Test. It would seem then that the following segment's voicing has a reliable effect on F1, but it does not reliably interact with the duration of the vowel/vowel reduction.




# The Bootstrap

If you want to get something p-value-like or confidence interval-like for the parameters in a mixed effects model, the people who have developed these techniques recommend *bootstrapping*.
Bootstrapping is a very general purpose technique that sort of looks like it shouldn't work. 

## Bootstrapping in Principle

*Bootstrapping* is a very general method for estimating our uncertainty about any given paremeter estimate. For example, let's say that we had 8 randomly^[In this case, not actually random.] sampled heights of linguists at the Intitute, and we were interested in how noisy an estimate this is of the average height of linguists.

```{r}
heights = c(a = 69, b = 75, c = 62, d = 62, e = 73, f = 71, g = 61, h = 64)
mean(heights)
```

One method statisticians have developed is the "bootstrap" estimate of uncertainty. The way that works is you first resample your sample, with replacement.

```{r}
new_heights = sample(heights, replace = T)
new_heights
```

Some people's heights have shown up more than once, and some none at all.
Then, you take the mean

```{r}
mean(new_heights)
```

And now, you do this hundreds and hundreds of times.

```{r}
replicates <- (1:100000) %>%
                map(~sample(heights, replace = T))%>%
                map(mean)%>%
                simplify()
```

```{r}
data_frame(replicates = replicates)%>%
  ggplot(aes(replicates))+
    stat_density(color = "black", alpha = 0.6)+
    geom_vline(xintercept = mean(heights))+
    theme_minimal()
```

The statisticians have set unto us that this is a sensible way to estimate uncertainty about the mean of our sample. 
As the bootstrap is to the sample, so the sample is to the population.
We can use some very simplistic methods to get a confidence interval out of the boostraps, say by taking the center 95% interval from the sample:

```{r}
quantile(replicates, prob = c(0.025, 0.975))
```

Basically, this means we've got a pretty lousy sample for trying estimate the *average* height of linguists. This 95% CI ranges from just a bit taller than the shortest people in the sample and just a bit shorter than the tallest, meaning that as far as we know with these 8 people, any one of them could be the *average* height.

```{r}
sort(heights)
```

You can use bootstrap replication for any property of the sample, incidentally. For example, we could try to estimate the standard deviation of linguists' heights using this method.

```{r}
sd_replicates <- (1:100000) %>%
                map(~sample(heights, replace = T))%>%
                map(sd)%>%
                simplify()
```


```{r}
data_frame(sd_replicates = sd_replicates)%>%
  ggplot(aes(sd_replicates))+
    stat_density(color = "black", alpha = 0.6)+
    geom_vline(xintercept = sd(heights))+
    theme_minimal()
```


Here's a better example. Let's look at the ratio of `iy` duration to `ah` duration again.

```{r}
vowel_ratio <- iy_ah %>%
  group_by(idstring, plt_vclass)%>%
  summarise(dur = median(dur))%>%
  spread(plt_vclass, dur)%>%
  mutate(ratio = ah/iy)


vowel_ratio %>%
ggplot(aes(ratio))+
    stat_density(alpha = 0.6, color = 'black')+
    theme_minimal()
```

The median duration 


```{r}
ratio_bootstrap <- (1:100000)%>%
                      map(~sample(vowel_ratio$ratio, replace = T))%>%
                      map(median)%>%
                      simplify()
```

```{r}
data_frame(ratio_bootstrap = ratio_bootstrap) %>%
  ggplot(aes(ratio_bootstrap))+
    stat_density(alpha = 0.6, color = "black")+
    xlim(0.8, 2)+
    theme_minimal()
```



## Bootstrapping for Regression

**NOTE!**: I give a bunch of code to do bootstrap replication by hand here, but there is a pre-built function to use called `bootMer()`.

Bootstrapping to get confidence intervals for regression paremeters works in a similar way, but implemented differently. 
For clarity's sake, we'll just look at the data from one speaker.


```{r}
sp1 <- ah_tomod %>% 
          filter(idstring == idstring[1]) %>%
          select(F1_n, log2dur_c)
```

```{r}
ggplot(sp1, aes(log2dur_c, F1_n))+
    geom_point()+
    stat_smooth(method = 'lm')
```


```{r}
sp1_mod <- lm(F1_n ~ log2dur_c, data = sp1)
sp1_mod
```

With this model, every data point has three values associated with it:

1. Its x value (in this case, its duration)
2. Its fitted y value (in this case, its predicted F1)
3. Its residual error (the difference between its observed F1 and and the predicted F1).

In order to get a bootstrap estimate of the intercept and slope here, you actually sample the residuals with replacement, add them to each point's fitted value, and refit the model.

```{r}
sp1$resid <- resid(sp1_mod)
sp1$fitted <- fitted(sp1_mod)
head(sp1)
```

As a reminder, here's a plot of the observed duration values against the fitted F1 values.

```{r}
ggplot(sp1, aes(log2dur_c, fitted))+
    geom_point()
```

To get back the original data, you need to add each data point's residual error to the fitted value.

```{r}
ggplot(sp1, aes(log2dur_c, fitted + resid))+
    geom_point()
```

In bootstrapping, you actually create a new set of residuals by sampling the existing residuals with replacement. Then you add these to the fitted values to get a new bootstrap replicate.

<div class = "box break">
<span class='big-label'>~2 minute break</span>

Run this code in a chunk a handful of times.

```{r}
ymax <- max(sp1$fitted) + max(sp1$resid)
ymin <- min(sp1$fitted) + min(sp1$resid)


sp1 %>%
  mutate(resid_samp = sample(resid, replace = T))%>%
  ggplot(aes(log2dur_c, fitted + resid_samp))+
    geom_point()+
    ylim(ymin, ymax)+
    stat_smooth(method = 'lm')+
    theme_minimal()
```

</div>

Now, we just need to collect up all of these re-fittings, and plot them.


```{r}
lm_boot <- (1:1000)%>%
                map(~(sp1 %>% mutate(resid_samp = sample(resid, replace = T),
                                     log2dur_c = resid_samp + fitted)))%>%
                map(~(lm(F1_n ~ log2dur_c, data = .x)))%>%
                map(tidy)%>%
                bind_rows(.id = "id")
```

```{r}
lm_boot %>%
  ggplot(aes(estimate))+
    stat_density(alpha = 0.6, color = "black")+
    facet_wrap(~term, scales = "free")+
    geom_vline(xintercept = 0)+
    theme_bw()
```

## Bootstrapping LMEs

The specific way which boostrapping works for `lmer` models is different than for vanilla linear models (dealing with the random effects structure appropriately is complex), but fortunately we don't need to write code to do that for us. 
The function `bootMer()` will do it for us.

<div class = "illustrate">
bootMer(<span class = "pop">model</span>, <span class = "pop">FUN = fixef</span>, <span class = "pop">nsim</span>, ...)
</div>

- `model`
    - the model you want to do bootstrapping on (can be a linear or other generalized linear mixed effects model).
- `FUN`
    - The function you want to apply to each re-fit model. This will almost always be just `fixef`
- `nsim`
    - The number of bootstrap simulations you want to do
- `...`
    - Additional  optional arguments (e.g. what different kinds of bootstrapping you want to do, whether or not to use parallel processing).
   
<div class = "break box">
<span class = "big-label">~5 minute break</span>

Run this code to do 5 bootstrap replications


```{r}
ah_bootstraps <- bootMer(ah_model_full, 
                         FUN = fixef, 
                         nsim = 5,
                         verbose = T)
```
</div>

The somewhat annoying thing about bootstrapping is that it takes just as long to fit each replication as it does to fit the original model.
That's why I baked one at home to share in the `lsa2017`.

```{r}
ah_model_boot
```

Each `t*` item corresponds to a fixed effect coefficient. They show up in the summary in the same order they did in the model summary:

```{r}
fixef(ah_model_full)
```

In the bootstrap summary, the `original` column corresponds to the original coefficient estimate from the model.

To get a confidence interval for these parameters, we need to use the `boot.ci()` function.

<div class = "illustrate">
boot.ci(<span class = "pop">bootstrap</span>, <span class = "pop">index</span>, <span class = "pop">type = perc</span>)
</div>

- `bootstrap`
    - The bootstrap object
- `index`
    - The index of the coefficient you want a confidence interval for
- `type`
    - Unfortunately, there are many different ways of estimating a confidence interval from bootstrapped replicates, and not all of them work for a bootstrapped lme. I recommend `type = "perc"`
    
So to get the bootstrap confidence interval for the intercept, we'd give `index=1`.


```{r}
boot.ci(ah_model_boot, index = 1, type = "perc")
```

This confidence interval excludes 0, so we could call this a reliable effect.

To get the confidence interval for the interaction between voicing and duration, you give it `index = 4`


```{r}
boot.ci(ah_model_boot, index = 4, type = "perc")
```

In this case, the confidence interval contains 0, so this does not look like a reliable effect.

So, if you were writing a paper on the effect of vowel duration and voicing on the F1 of `ah`, you could argue based on convergent evidence from the model comparisons, likelihood ratio test, and the bootstrap confidence intervals that there is a consistent effect of following voiceless consonants raising F1 (lowering the vowel), but this does not interact with duration reliably.



```{r}
ah_model_boot$t %>%
  data.frame()%>%
  gather()%>%
  ggplot(aes(value))+
    stat_density(alpha = 0.6, color = "black")+
    geom_vline(xintercept = 0, linetype = 2)+
    facet_wrap(~key, scales = "free")+
    theme_bw()
```


