The Agenda

  1. Follow up from the last meeting.
  2. More about interpreting the model output
  3. Homework review
  4. Pitfalls to look out for, and how to address them
  5. A modelling menu

Setup

library(tidyverse)
library(ggthemes)
library(broom)
library(lsa2017)

ah <- iy_ah %>% filter(plt_vclass == "ah")
iy <- iy_ah %>% filter(plt_vclass == "iy")
ay0 <- ay %>% filter(plt_vclass == "ay0")

And, in preparation of Thursday’s meeting, go ahead and install lme4, so we can catch any problems well in advance.

install.packages("lme4")

Followup

The question came up last time about why it really matters to center and scale our predictors. Here is an example of a data set I’ve looked at, which involves the raising of /ay/ before voiceless consonants. A very strong predictor is the speaker’s year of birth.

ay0_means <- ay0 %>%
              mutate(dob = year-age)%>%
              group_by(idstring,dob, word)%>%
              summarise(F1_n = mean(F1_n))%>%
              summarise(F1_n = mean(F1_n))
ggplot(ay0_means, aes(dob, F1_n))+
    geom_point()+
    scale_y_reverse()+
    theme_minimal()

But if I just fit a linear model with year of birth as a predictor, I get an entirely uninterpretable intercept value that is actually impossible given the data.

ay0_model <- lm(F1_n ~ dob, data = ay0_means)
summary(ay0_model)

Call:
lm(formula = F1_n ~ dob, data = ay0_means)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.03831 -0.16668 -0.01521  0.16445  1.06611 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 23.2970421  1.1782669   19.77   <2e-16 ***
dob         -0.0116366  0.0006059  -19.20   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2506 on 324 degrees of freedom
Multiple R-squared:  0.5324,    Adjusted R-squared:  0.5309 
F-statistic: 368.8 on 1 and 324 DF,  p-value: < 2.2e-16
ggplot(ay0_means, aes(dob, F1_n))+
    geom_point()+
    geom_vline(xintercept = 0, linetype = 2)+
    geom_hline(yintercept = coef(ay0_model)[1], linetype = 2)+
    geom_abline(intercept = coef(ay0_model)[1],
                slope = coef(ay0_model)[2])+
    theme_minimal()

The slope is also too specific, from a theoretical perspective. We wouldn’t realistically expect to see reliable differences between people born in 1959 and 1960. In my own work, I’ve divided year of birth by 10, resulting in a coefficient readable as a change per decade. Labov (2001) divides year of birth1 by 25, to try to get a slope in terms of a “generation”. Either way works.

ay0_means <- ay0_means %>%
                mutate(dob_c = (dob-1960)/10)
ay0_model2 <- lm(F1_n ~ dob_c, data = ay0_means)
summary(ay0_model2)

Call:
lm(formula = F1_n ~ dob_c, data = ay0_means)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.03831 -0.16668 -0.01521  0.16445  1.06611 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.489214   0.016775   29.16   <2e-16 ***
dob_c       -0.116366   0.006059  -19.20   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2506 on 324 degrees of freedom
Multiple R-squared:  0.5324,    Adjusted R-squared:  0.5309 
F-statistic: 368.8 on 1 and 324 DF,  p-value: < 2.2e-16

Another important factor in choosing where you center your predictors is if there’s a strong interaction with some other predictor. For example, this plot compares pre-voiced to pre-voiceless [ay]. It is possible model this entire graph in one model that has an interaction between voicing and date of birth. But where you choose to center date of birth will have a big effect on the estimated parameters! Compare the models with DOB centered on 1890 vs 1950.

ay_means <- ay %>%
              mutate(dob = year-age)%>%
              group_by(idstring, dob, plt_vclass, word)%>%
              summarise(F1_n = mean(F1_n)) %>%
              summarise(F1_n = mean(F1_n))
ay_means %>%
  ggplot(aes(dob, F1_n, color = plt_vclass))+
    geom_point()+
    stat_smooth(method = 'lm')+
    geom_vline(xintercept = 1890, linetype = 2)+
    geom_vline(xintercept = 1950, linetype = 2)+  
    annotate(geom  = "label",
             x = 1890,
             y = -0.25,
             label = "a")+
    annotate(geom  = "label",
             x = 1950,
             y = -0.25,
             label = "b")+  
    scale_y_reverse()+
    scale_color_colorblind()+
    theme_minimal()

ay_means <- ay_means %>%
              mutate(dob_center_a = (dob - 1890) / 10,
                     dob_center_b = (dob - 1950) / 10)
ay_model_a <- lm(F1_n ~ dob_center_a * plt_vclass, data = ay_means)
ay_model_b <- lm(F1_n ~ dob_center_b * plt_vclass, data = ay_means)
tidy(ay_model_a) %>% select(term, estimate, p.value)
tidy(ay_model_b) %>% select(term, estimate, p.value)

In the model with the 1890 as the intercept, the coefficient for ay0 is small, and not significant! In the model with dob centered at 1950, the coefficient for ay0 is large and very significant. This is simply because they are describing different locations on the plot.

So what is the “right” thing to do?

At this point is usualy when the “So what’s the right thing to do?” question comes up. And the annoying answer is that there isn’t a clear cut answer or decision process. You should guide your decisions to be sure that

  1. You have a sensible intercept.
  2. Your slopes are defined across a reasonable range for the data.

For many data types, the Gelman & Hill (2006) recommendation (subtract the mean, divide by 2\(\times\) the standard deviation) will do the trick. For others, there may be some people making a top-down effort to define a scaling (e.g. the Zipf Scale for word frequencies). For the remaining cases, if someone has done something before and it doesn’t seem obviously wrong, try that for the sake of comparibility. Otherwise, it involves a lot of development of a model-gefühl.

More model interpretation

Let’s come back to modelling the effect of duration on the F1 of ah. Here, we’ll fit the model using the centered log2(duration) like a recommended on the homework.

ah_tomod <- ah %>%
               mutate(log2dur_c = log2(dur)-median(log2(dur)))
ah_tomod %>%
  ggplot(aes(log2dur_c, F1_n))+
    geom_point(alpha = 0.05)+
    xlab("centered log2(duration)")+
    theme_minimal()

ah_model <- lm(F1_n ~ log2dur_c, data = ah_tomod)
summary(ah_model)

Call:
lm(formula = F1_n ~ log2dur_c, data = ah_tomod)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4740 -0.3717  0.0102  0.3827  3.2640 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.061795   0.004260   249.2   <2e-16 ***
log2dur_c   0.388204   0.006957    55.8   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6185 on 21117 degrees of freedom
Multiple R-squared:  0.1285,    Adjusted R-squared:  0.1285 
F-statistic:  3113 on 1 and 21117 DF,  p-value: < 2.2e-16

The intercept is the estimated F1_n at the median duration, and the slope log2dur_c is the estimated increase in normalized F1 for every doubling of duration. Let’s walk through these results more slowly.

Residuals

This is a so-called “five number summary” of the residuals (the difference between the fitted line and the actual data points). Given the assumptions that the model makes (e.g. the data can be described by a line with some residual error), you’ll want to see these values being fairly symmetrical, with the median just about 0. If this isn’t the case, then you should probably make some more plots of the data to figure out what’s going on.

Standard Error and t-value

These are fairly straightforward measures of the uncertainty about the coefficient estimates that you can reason about safely without having taken a philosophy course. The bigger the standard error, the less certainty there is about the the parameters. The t-value is just the coefficient divided by the standard error, so the bigger it is, the more certainty there is. Roughly speaking, if the t-value is greater than 2, the effect is probably reliable.

\(r^2\)

This is a measure of how much modelling your outcome data as a straight line with your chosen predictors has reduced the variability in the data. This is a helpful enough heuristic, but can’t be taken as an intrinsically informative value. We can roughly visualize this by plotting density distributions of our data (centered on the mean) and the model residuals.

ah_tomod %>%
  ggplot(aes(log2dur_c, F1_n))+
    geom_point(alpha = 0.05)+
    geom_hline(yintercept = mean(ah_tomod$F1_n))+
    geom_abline(intercept = coef(ah_model)[1],
                slope = coef(ah_model)[2],
                color = "red")+
    theme_minimal()

ah_tomod$resid <- resid(ah_model)
ah_tomod %>%
  mutate(F1_n_c = F1_n - mean(F1_n))%>%
  ggplot()+
    geom_density(aes(x = F1_n_c, color = "centered data"))+
    geom_density(aes(x = resid, color = "residuals"))+
    theme_minimal()+
    ggtitle(paste0("r-squared = ", round(summary(ah_model)$r.squared, digits = 2)))+
    scale_color_manual(NULL, values = c("black", "red"))

The distribution of the residuals in red is just slightly narrower than the data, which is why the r-squared is fairly modest.

Sometimes you can get a really large r-squared by including a big and obvious predictor, that doesn’t really contribute to our understanding. For example, for the whole iy_ah data set, the largest number one predictor for the model is the vowel class, since iy and ah are contrastively different along F1.

iy_ah_model <- lm(F1_n ~ plt_vclass, data = iy_ah)
summary(iy_ah_model)

Call:
lm(formula = F1_n ~ plt_vclass, data = iy_ah)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0714 -0.3278 -0.0125  0.3269  4.8163 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.072222   0.003850   278.5   <2e-16 ***
plt_vclassiy -2.495949   0.005294  -471.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5595 on 44816 degrees of freedom
Multiple R-squared:  0.8322,    Adjusted R-squared:  0.8322 
F-statistic: 2.222e+05 on 1 and 44816 DF,  p-value: < 2.2e-16
iy_ah$resid <- resid(iy_ah_model)
iy_ah %>%
  mutate(F1_n_c = F1_n - mean(F1_n))%>%
    ggplot()+
      geom_density(aes(x = F1_n_c, color = "centered data"))+
      geom_density(aes(x = resid, color = "residuals"))+
      theme_minimal()+
      ggtitle(paste0("r-squared = ", round(summary(iy_ah_model)$r.squared, digits = 2)))+
      scale_color_manual(NULL, values = c("black", "red"))

R-squared is probably best used when comparing two models. For an example, see the bar plots on page 12 of this paper.

Interpreting other people’s models

More or less, if you see someone report only the p-value for their effects without the coefficients, especially if they’re working with a lot of data, reach for your wallet. If you are reviewing a paper where this happens, make it a required revision that either they need to include the coefficients, or drop that quantitative component.

Pitfalls in Modelling

Type I and Type II Error

These are poorly named, and I think few people remember which is which correctly the first time. These are errors related to the use of p-values. First it’s important to define what the Null Hypothesis is.

  • The Null Hypothesis is the Dull Hypothesis: The Null Hypothesis is that there is no relationship between the outcome and predictors. It is not what you think the most likely relationship is.

With that in mind, we can define Type I and Type II errors.

  • Type I - You reject the null hypothesis (because p < 0.05), but in reality there isn’t a relationship. That is, you say there is a reliable effect when there isn’t one.
  • Type II - You fail to reject the null hypotehsis when there is an real effect.

Or to put it another way, when you see a statistical result that is significant (p < 0.05), but you don’t think it could possibly be real, you think a Type I error has occurred.

And when you get p > 0.05, but you really think there’s something there so you put “trending towards significance” in your paper, you think a Type II error has occurred.

The best way to ameliorate Type I and Type II errors is to either collect more data, or to move to Bayesian Statistics, which we won’t be able to cover in this course, but I recommend this book.

Type M and Type S errors

These are error types discussed by Gelman and colleagues (and are better named).

  • Type M Errors are gross over-estimation of the magnitude of an effect size.
  • Type S Errors get the sign wrong on an estimate. That is, they predict an effect in the opposite direction of what the true effect is.

Type M might not sound like a very big problem, and Type S might sound a bit outlandish but these can both happen, especially when we are exploring for small effects and filtering results by their p-values. Gelman has proposed that Type M errors are a likely culprit in the so-called “decline effect”, which is a phenomenon some have found that large and significant effects gradually get smaller after their original reports.

Here are some ways to guard youself against Type M and Type S errors.

  • Try to reason how big the effect is likely to be.
  • Try to collect of how other factors effect the outcome you’re investigating, and how big those effects are.
  • Commit to the direction of the effect.

Type M and Type S errors are only going to grow in severity as the size of data sets grow.

Overfitting

Collinearity

“Collinearity” is when two potential predictors are highly correlated with each other. When two correlated predictors are included in the same regression, the results can be wonky at best. Gorman (2010) talks about this at length with respect to socioeconomic measures in the Language Change and Variation study. For example, speakers’ Occupation was highly correlated with the value of their Residence. You shouldn’t include these both, untreated, into your models.

There are at least three different strategies for dealing with this issue

  1. If you must model these factors independently, residualize (Gorman, 2010)
  2. Or, just pick the one that is theoretically motivated (Fruehwald, 2016).
  3. Or give up on treating your measures as having independent reality, realize they are reflections of more abstract dimensions, and try to figure out what those are (i.e. factor analysis, Walker (2016)).

Residualization

Like the name suggests, this involves the residuals from a linear model. If you really want to fit a model like this:

negative_concord ~ occupation + residence

Then what you need to do first is fit this model.

residence ~ occupation

And replace the residence term in the first model with the residuals of the second model. Now, it does matter which order you residualize in. How do you choose? Again, there’s no hard and fast answer, but please don’t try it every possible way and then just choose the model that looks best!

Factor Analysis

Another approach would be to abandon trying to look at all of these predictors independently and to treat them all as rough measures correlated to a more abstract variable, using something like a factor analysis. This is what Walker (2016) did.

The best way to explain a factor analysis is to imagine surveying a bunch of people on the street and asking them the following questions:

  1. Did you have the heat on this morning?
  2. Are you wearing a wooly jumper?
  3. Are you wearing a scarf?
  4. Could you see your breath when you stepped outside?
  5. Did your glasses fog up when you walked inside?
  6. Did you make yourself a big breakfast?
  7. Was your stomach growling when you woke up?

This looks like you’ve asked people 7 different questions, but really, you’ve only effectively asked 2 questions:

  1. Was it cold?
  2. Were you hungry?

If you took the raw data from the 7 question survey and ran it through a factor analysis, it would (ideally) tell you that there are only 2 effective questions in it, based on how responses to questions 1 thru 5 are correlated with each other, and 6&7 are correlated with each other.


Collinearity often comes up as a pitfall of of estimating effects. But it’s also an issue when trying to understand our models. A common, annoying, and usually valid question is:

Couldn’t your effect of A actually be an effect of B?

The best way to deal with this kind of collinearity is to try to kill your effects. Does the same effect hold in the same direction for important subsets of the data? Can you subset the data according to B and fit separate models for A?

Additional Structure

Building up a model

When approaching how to build up a model (which predictors should be included or excluded), I would caution against “kitchen sink” models, or trying to trying to outsource your reasoning to an automated step-wise regression.

I’ll summarise Gelman & Hill’s (2006) recommendations here:

  1. “Include all input variables that, for substantive reasons, might be expected to be important in predicting the outcome.”
    • Joe addendum: I think substantive reasons include “I have a good theory” and “Previous work has included these”.
  2. “It is not always necessary to include these inputs as separate predictors-—for example, sometimes several inputs can be averaged or stunmed to create a ‘total score’ that can be used as a single predictor in the model.”
  3. “For inputs that have large effects, consider including their interactions as well.”

Their recommendation for excluding a variable from a model is more nuance. They say:

  1. “If a predictor is not statistically significant and has the expected sign, it is generally fine to keep it in. It may not help predictions dramatically hut is also probably not hurting them.”
  2. “If a predictor is not statistically significant and does not have the expected sign (for example, incumbency having a negative effect on vote share), consider removing it from the model (that is, setting its coefilcient to zero).”
  3. “If a predictor is statistically significant and does not have the expected sign, then think hard if it makes sense.”
  4. “If a predictor is statistically significant and has the expected sign, then by all means keep it in the model.”

Modelling Menu

Curves

Increasingly, people are getting dissatisfied by assuming the relationship between their variables is linear. The growing standard for fitting curvy models are GAMs (generalized additive models), but I won’t try to re-create the excellent tutorial by Jacolien van Rij here.


  1. Actually, the rescaled predictor was “age”, but in that data it had more or less a 1-to-1 relationsip to year of birth.

---
title: "More on interpreting & fitting linear models"
output: 
  html_notebook: 
    code_folding: none
    css: ../custom.css
    theme: flatly
    toc: yes
    toc_float: yes
    toc_depth: 3
---



# The Agenda

1. Follow up from the last meeting.
2. More about interpreting the model output
3. Homework review
3. Pitfalls to look out for, and how to address them
4. A modelling menu


<div class = "box break">
<span class = "big-label">Setup</span>

```{r}
library(tidyverse)
library(ggthemes)
library(broom)
library(lsa2017)

ah <- iy_ah %>% filter(plt_vclass == "ah")
iy <- iy_ah %>% filter(plt_vclass == "iy")
ay0 <- ay %>% filter(plt_vclass == "ay0")
```


And, in preparation of Thursday's meeting, go ahead and install `lme4`, so we can catch any problems well in advance.

```{r}
install.packages("lme4")
```

</div>

# Followup

The question came up last time about why it really matters to center and scale our predictors. Here is an example of a data set I've looked at, which involves the raising of /ay/ before voiceless consonants. A very strong predictor is the speaker's year of birth.


```{r}
ay0_means <- ay0 %>%
              mutate(dob = year-age)%>%
              group_by(idstring,dob, word)%>%
              summarise(F1_n = mean(F1_n))%>%
              summarise(F1_n = mean(F1_n))

ggplot(ay0_means, aes(dob, F1_n))+
    geom_point()+
    scale_y_reverse()+
    theme_minimal()
```


But if I just fit a linear model with year of birth as a predictor, I get an entirely uninterpretable intercept value that is actually *impossible* given the data.

```{r}
ay0_model <- lm(F1_n ~ dob, data = ay0_means)
summary(ay0_model)
```


```{r}
ggplot(ay0_means, aes(dob, F1_n))+
    geom_point()+
    geom_vline(xintercept = 0, linetype = 2)+
    geom_hline(yintercept = coef(ay0_model)[1], linetype = 2)+
    geom_abline(intercept = coef(ay0_model)[1],
                slope = coef(ay0_model)[2])+
    theme_minimal()
```

The slope is also too specific, from a theoretical perspective. We wouldn't realistically expect to see reliable differences between people born in 1959 and 1960. In my own work, I've divided year of birth by 10, resulting in a coefficient readable as a change per decade. Labov (2001) divides year of birth^[Actually, the rescaled predictor was "age", but in that data it had more or less a 1-to-1 relationsip to year of birth.] by 25, to try to get a slope in terms of a "generation". Either way works.

```{r}
ay0_means <- ay0_means %>%
                mutate(dob_c = (dob-1960)/10)
ay0_model2 <- lm(F1_n ~ dob_c, data = ay0_means)
summary(ay0_model2)
```

Another important factor in choosing where you center your predictors is if there's a strong interaction with some other predictor. For example, this plot compares pre-voiced to pre-voiceless [ay]. It is possible model this entire graph in one model that has an interaction between voicing and date of birth. But where you choose to center date of birth will have a big effect on the estimated parameters! Compare the models with DOB centered on 1890 vs 1950.


```{r}
ay_means <- ay %>%
              mutate(dob = year-age)%>%
              group_by(idstring, dob, plt_vclass, word)%>%
              summarise(F1_n = mean(F1_n)) %>%
              summarise(F1_n = mean(F1_n))

ay_means %>%
  ggplot(aes(dob, F1_n, color = plt_vclass))+
    geom_point()+
    stat_smooth(method = 'lm')+
    geom_vline(xintercept = 1890, linetype = 2)+
    geom_vline(xintercept = 1950, linetype = 2)+  
    annotate(geom  = "label",
             x = 1890,
             y = -0.25,
             label = "a")+
    annotate(geom  = "label",
             x = 1950,
             y = -0.25,
             label = "b")+  
    scale_y_reverse()+
    scale_color_colorblind()+
    theme_minimal()
```

```{r}
ay_means <- ay_means %>%
              mutate(dob_center_a = (dob - 1890) / 10,
                     dob_center_b = (dob - 1950) / 10)

ay_model_a <- lm(F1_n ~ dob_center_a * plt_vclass, data = ay_means)
ay_model_b <- lm(F1_n ~ dob_center_b * plt_vclass, data = ay_means)
```

<div style = "float:left; width:100%">
<div style = "float:left; width:49%">

```{r}
tidy(ay_model_a) %>% select(term, estimate, p.value)
```
</div>
<div style = "float:left; width:49%">
```{r}
tidy(ay_model_b) %>% select(term, estimate, p.value)
```
</div>
</div>

In the model with the 1890 as the intercept, the coefficient for `ay0` is small, and not significant! In the model with dob centered at 1950, the coefficient for `ay0` is large and very significant. This is simply because they are describing different locations on the plot.

## So what is the "right" thing to do?

At this point is usualy when the "So what's the right thing to do?" question comes up. And the annoying answer is that there isn't a clear cut answer or decision process. You should guide your decisions to be sure that

1. You have a sensible intercept.
2. Your slopes are defined across a reasonable range for the data.

For many data types, the Gelman & Hill (2006) recommendation (subtract the mean, divide by 2$\times$ the standard deviation) will do the trick. For others, there may be some people making a top-down effort to define a scaling (e.g. the [Zipf Scale for word frequencies](http://crr.ugent.be/archives/1352)). For the remaining cases, if someone has done something before and it doesn't seem obviously wrong, try that for the sake of comparibility. Otherwise, it involves a lot of development of a model-gefühl.


# More model interpretation

Let's come back to modelling the effect of duration on the F1 of `ah`. 
Here, we'll fit the model using the centered log2(duration) like a recommended on the homework.


```{r fig.width = 8/2, fig.height = 5/2, out.width = "50%"}
ah_tomod <- ah %>%
               mutate(log2dur_c = log2(dur)-median(log2(dur)))
ah_tomod %>%
  ggplot(aes(log2dur_c, F1_n))+
    geom_point(alpha = 0.05)+
    xlab("centered log2(duration)")+
    theme_minimal()
```

```{r}
ah_model <- lm(F1_n ~ log2dur_c, data = ah_tomod)
summary(ah_model)
```

The intercept is the estimated F1_n at the median duration, and the slope `log2dur_c` is the estimated increase in normalized F1 for every doubling of duration.
Let's walk through these results more slowly.

## Residuals

<div class = "half-img">
![](figures/resid.png)
</div>

This is a so-called "five number summary" of the residuals (the difference between the fitted line and the actual data points).
Given the assumptions that the model makes (e.g. the data can be described by a line with some residual error), you'll want to see these values being fairly symmetrical, with the median just about 0. If this isn't the case, then you should probably make some more plots of the data to figure out what's going on.

## Standard Error and t-value

<div class = "half-img">

![](figures/stderr.png)

</div>

These are fairly straightforward measures of the uncertainty about the coefficient estimates that you can reason about safely without having taken a philosophy course. The bigger the standard error, the less certainty there is about the the parameters. The t-value is just the coefficient divided by the standard error, so the bigger it is, the more certainty there is. *Roughly* speaking, if the t-value is greater than 2, the effect is probably reliable.


## $r^2$

<div class = "half-img">

![](figures/rsquared.png)

</div>

This is a measure of how much modelling your outcome data as a straight line with your chosen predictors has reduced the variability in the data. This is a helpful enough heuristic, but can't be taken as an intrinsically informative value. We can roughly visualize this by plotting density distributions of our data (centered on the mean) and the model residuals.

```{r}
ah_tomod %>%
  ggplot(aes(log2dur_c, F1_n))+
    geom_point(alpha = 0.05)+
    geom_hline(yintercept = mean(ah_tomod$F1_n))+
    geom_abline(intercept = coef(ah_model)[1],
                slope = coef(ah_model)[2],
                color = "red")+
    theme_minimal()
```




```{r}
ah_tomod$resid <- resid(ah_model)
ah_tomod %>%
  mutate(F1_n_c = F1_n - mean(F1_n))%>%
  ggplot()+
    geom_density(aes(x = F1_n_c, color = "centered data"))+
    geom_density(aes(x = resid, color = "residuals"))+
    theme_minimal()+
    ggtitle(paste0("r-squared = ", round(summary(ah_model)$r.squared, digits = 2)))+
    scale_color_manual(NULL, values = c("black", "red"))
```

The distribution of the residuals in red is just slightly narrower than the data, which is why the r-squared is fairly modest.

Sometimes you can get a really large r-squared by including a big and obvious predictor, that doesn't really contribute to our understanding. For example, for the whole `iy_ah` data set, the largest number one predictor for the model is the vowel class, since `iy` and `ah` are contrastively different along F1. 

```{r}
iy_ah_model <- lm(F1_n ~ plt_vclass, data = iy_ah)
summary(iy_ah_model)
```

```{r}
iy_ah$resid <- resid(iy_ah_model)
iy_ah %>%
  mutate(F1_n_c = F1_n - mean(F1_n))%>%
    ggplot()+
      geom_density(aes(x = F1_n_c, color = "centered data"))+
      geom_density(aes(x = resid, color = "residuals"))+
      theme_minimal()+
      ggtitle(paste0("r-squared = ", round(summary(iy_ah_model)$r.squared, digits = 2)))+
      scale_color_manual(NULL, values = c("black", "red"))
```


R-squared is probably best used when *comparing* two models. For an example, see the bar plots on [page 12 of this paper](http://www.stat.columbia.edu/~gelman/research/unpublished/cohort_voting_20140605.pdf).

## Interpreting other people's models

More or less, if you see someone report *only* the p-value for their effects without the coefficients, especially if they're working with a lot of data, reach for your wallet. If you are reviewing a paper where this happens, make it a required revision that either they need to include the coefficients, or drop that quantitative component.



# Pitfalls in Modelling


## Type I and Type II Error

These are poorly named, and I think few people remember which is which correctly the first time. These are errors related to the use of p-values. First it's important to define what the **Null Hypothesis** is.

- **The Null Hypothesis is the Dull Hypothesis:** The Null Hypothesis is that there is no relationship between the outcome and predictors. It is *not* what you think the most likely relationship is.

With that in mind, we can define Type I and Type II errors.

- **Type I** - You reject the null hypothesis (because p < 0.05), but in reality there *isn't* a relationship. That is, you say there is a reliable effect when there isn't one.
- **Type II** - You fail to reject the null hypotehsis when there *is* an real effect. 

Or to put it another way, when you see a statistical result that is significant (p < 0.05), but you don't think it could possibly be real, you think a Type I error has occurred.

And when you get p > 0.05, but you really think there's something there so you put "trending towards significance" in your paper, you think a Type II error has occurred.

The best way to ameliorate Type I and Type II errors is to either collect more data, or to move to Bayesian Statistics, which we won't be able to cover in this course, [but I recommend this book](https://sites.google.com/site/doingbayesiandataanalysis/).



 
 
## Type M and Type S errors

These are error types [discussed by Gelman and colleagues](http://www.stat.columbia.edu/~gelman/research/published/PPS551642_REV2.pdf) (and are better named).

 - **Type M Errors** are gross over-estimation of the *magnitude* of an effect size.
 - **Type S Errors** get the *sign* wrong on an estimate. That is, they predict an effect in the opposite direction of what the true effect is.
 
Type M might not sound like a very big problem, and Type S might sound a bit outlandish but these can both happen, especially when we are [exploring for small effects and filtering results by their p-values](http://rpubs.com/JoFrhwld/291734). Gelman has proposed that [Type M errors are a likely culprit in the so-called "decline effect"](http://andrewgelman.com/2010/12/13/the_truth_wears/), which is a phenomenon some have found that large and significant effects gradually get smaller after their original reports.

Here are some ways to guard youself against Type M and Type S errors.

- Try to reason how big the effect is *likely* to be.
- Try to collect of how other factors effect the outcome you're investigating, and how big those effects are.
- Commit to the *direction* of the effect.

Type M and Type S errors are only going to grow in severity as [the size of data sets grow](http://jofrhwld.github.io/papers/plc39_2015/#/).


## Overfitting



```{r}
ay0_means %>%
  ggplot(aes(dob, F1_n))+
    #geom_point()+
    stat_smooth(method = 'lm', 
                se = F,
                aes(color = "line"))+
    stat_smooth(method = 'loess', 
                se = F,
                aes(color = "curve"))+
    stat_smooth(method = 'loess', 
                se = F,
                span = 0.1, 
                aes(color = "bumpy curve"))+
    scale_color_brewer("model", palette = "Dark2")+
    scale_y_reverse()+
    theme_minimal()
```




## Collinearity

"Collinearity" is when two potential predictors are highly correlated with each other. When two correlated predictors are included in the same regression, the results can be *wonky* at best. [Gorman (2010)](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1146&context=pwpl) talks about this at length with respect to socioeconomic measures in the Language Change and Variation study. For example, speakers' Occupation was highly correlated with the value of their Residence. You *shouldn't* include these both, untreated, into your models. 


There are at least three different strategies for dealing with this issue

1. If you *must* model these factors independently, residualize (Gorman, 2010)
2. *Or*, just pick the one that is theoretically motivated (Fruehwald, 2016).
3. *Or* give up on treating your measures as having independent reality, realize they are reflections of more abstract dimensions, and try to figure out what those are (i.e. factor analysis, Walker (2016)).


### Residualization

Like the name suggests, this involves the residuals from a linear model. If you really want to fit a model like this:

```
negative_concord ~ occupation + residence
```

Then what you need to do first is fit this model.

```
residence ~ occupation
```

And replace the `residence` term in the first model with the *residuals* of the second model. Now, it does matter which order you residualize in. How do you choose? Again, there's no hard and fast answer, but please don't try it every possible way and then just choose the model that looks best!

### Factor Analysis

Another approach would be to abandon trying to look at all of these predictors independently and to treat them all as rough measures correlated to a more abstract variable, using something like a factor analysis. This is what Walker (2016) did.

The best way to explain a factor analysis is to imagine surveying a bunch of people on the street and asking them the following questions:

1. Did you have the heat on this morning?
2. Are you wearing a wooly jumper?
3. Are you wearing a scarf?
4. Could you see your breath when you stepped outside?
5. Did your glasses fog up when you walked inside?
6. Did you make yourself a big breakfast?
7. Was your stomach growling when you woke up?


This looks like you've asked people 7 different questions, but really, you've only effectively asked 2 questions: 

1. Was it cold?
2. Were you hungry?

If you took the raw data from the 7 question survey and ran it through a factor analysis, it would (ideally) tell you that there are only 2 effective questions in it, based on how responses to questions 1 thru 5 are correlated with each other, and 6&7 are correlated with each other.


---


Collinearity often comes up as a pitfall of of estimating effects. But it's also an issue when trying to understand our models. A common, annoying, and usually valid question is:

> Couldn't your effect of A *actually* be an effect of B?

The best way to deal with this kind of collinearity is to try to kill your effects. Does the same effect hold in the same direction for important subsets of the data? Can you subset the data according to B and fit separate models for A?


## Additional Structure


```{r echo = F}
real_data <- data_frame(x = rnorm(200),
                        y = (-0.5 * x) + 8 + rnorm(200))
ggplot(real_data, aes(x, y))+
    geom_point()+
    stat_smooth(method = 'lm')
```


```{r echo = F} 
point_line_distance <- function(b, m, x, y)
  (y - (m*x + b))/sqrt(m^2 + 1)

confounded_data_frame <- function(x, y, m, num_grp){
  b <- 0 # intercept doesn't matter
  d <- point_line_distance(b, m, x, y)
  d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
  data.frame(x=x, y=y, 
            group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}


conf_df <- confounded_data_frame(real_data$x, real_data$y, m = 1, num_grp = 3)


ggplot(conf_df, aes(x, y))+
    geom_point(aes(color = group))+
    stat_smooth(method = lm, color = 'black', linetype = 2)+
    stat_smooth(method = lm, aes(color = group))
```


# Building up a model

When approaching how to build up a model (which predictors should be included or excluded), I would caution against "kitchen sink" models, or trying to trying to outsource your reasoning to an automated step-wise regression.

I'll summarise Gelman & Hill's (2006) recommendations here:

1. "Include all input variables that, for substantive reasons, might be expected to be important in predicting the outcome."
    - Joe addendum: I think substantive reasons include "I have a good theory" and "Previous work has included these".
2. "It is not always necessary to include these inputs as separate predictors-—for example, sometimes several inputs can be averaged or stunmed to create a 'total score' that can be used as a single predictor in the model."
3. "For inputs that have large effects, consider including their interactions as well."

Their recommendation for excluding a variable from a model is more nuance. They say:

a. "If a predictor is not statistically significant and has the expected sign, it is generally fine to keep it in. It may not help predictions dramatically hut is also probably not hurting them."
b. "If a predictor is not statistically significant and does not have the expected sign (for example, incumbency having a negative effect on vote share), consider removing it from the model (that is, setting its coefilcient to zero)."
c. "If a predictor is statistically significant and does not have the expected sign, then think hard if it makes sense."
d. "If a predictor is statistically significant and has the expected sign, then by all means keep it in the model."





# Modelling Menu


## Different "links"

You may be very likely to have data you want to model which is not like the formant data here. 
You may have categorical data of how often something did or didn't happen.
Or you might only have data on how often something *did* happen.
There are ways to fit statistical models with these kinds of data that for the most part involve the same kind of decisions about predictors and interpretations of the parameters.

### Binary Data

Sociolinguists have been working with binary data for a long time. Examples in clude:

- **TD Deletion**: mist vs mis'
- **ING**: walking vs walkin'
- **Filled Pauses**: um vs uh

These are usually coded as `0` or `1`, and it is inappropriate to fit a linear regression to these values. Instead, fit a **logistic regression**

### Rate Data

Sometimes, it's not possible to describe a linguistic variable in terms of how many times it did and didn't happen. For example Rickford & Price (2013) wanted to explore changes across the lifespan for certain AAVE features, including habitual-be.

- "I be waking up... and I be going." ((1) from Rickford & Price (2013))

A difficulty is that it's probably impossible to tell when someone *could* use habitual-be, but didn't. What they did instead was report how often their speakers used habitual-be *per hour of interview*. 

If your data is best described like this, not as a proportion of occurance, but as a rate, [like X per 1000 words](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1926&context=pwpl), then you can fit a **poisson regression**.

## Curves

Increasingly, people are getting dissatisfied by assuming the relationship between their variables is linear. The growing standard for fitting curvy models are GAMs (generalized additive models), but I won't try to re-create [the excellent tutorial by Jacolien van Rij](http://jacolienvanrij.com/Tutorials/GAMM.html) here.





