Sociolinguistics

You cannot email this data to a colleague. You can’t even download it on your computer. This is data on an unprecedented impossibly mind boggling massive scale. - Kenneth Benoit (2015)

Not sociolinguistics yet.

speakers | duration | wav files | other |
---|---|---|---|

397 | 302.46 hours | 51G | 32G |

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. - Wikipedia

It is data made useful to us for analysis - Hilary Mason

Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is “large enough,” you can start subdividing the data to learn more […]. N is never enough because if it were “enough” you’d already be on to the next problem for which you need more data. - Andrew Gelman

**FAVE-extract**- automates formant analysis.**DARLA**- Automatic Speech Recognition implementation (Reddy & Stanford)

- Document classification, generally
- Sentiment analysis, specifically
- Topic modeling, etc.

- We need to keep learning about and developing new computational tools.
- We need to push our students to do the same.

My alternate title

Another alternate title

**A Fear:**

- More people doing superficial and theoretically unmotivated work.

**A Problem**

- It’ll be trivial to find effect sizes != 0.

**The Facebook Contagion Experiment**

Adam D. I. Kramer et al. PNAS 2014;111:8788-8790

The headlines about the same effect size, but with different Ns might be:

**Facebook’s unethical experiment has no apparent effect on users’ emotions.**

**Facebook is using mind control!**

Effect Size

- Is the effect size of some predictor big enough to explain the phenomenon under discussion?
- Is it about the size we
*expected*it to be.

Expectations about how big an effect *ought* to be can only be provided by an articulated theory.

Theory

Yang (2013)

- There are two different models of undershoot that predict pre-voiceless /ay/ raising. I estimate the rate of change they predict, and compare that to the observed rates of change.
- Using ideas from information theory, I estimate the predicted indexical association between gender and filled pause choice.
- Both analysis are based on data from the Philadelphia Neighborhood Corpus

- Philadelphia used to have no pre-voiceless /ay/ raising.
- It does now.

As a diphthong, /ay/ has a lot of ground to cover. Its nucleus raises before voicless consonants because

- /ay/ is shorter before voiceless, giving you less time to all the way from /a/ to /i/. Therefore, the nucleus raises to [ʌ], so you don’t have to make such a big gesture in such a short amount of time. (Joos 1942)
- The offglide of /ay/ is forced to be peripheral before voiceless, so the nucleus raises to [ʌ] via co-articulation with the glide. (Moreton & Thomas, 2007)

**The rate of change across phonetic contexts ought to be proportional to the phonetic pressure driving the change in that context.**

- Contexts: /ay/ preceding:
- /t/, /d/ \(\rightarrow\) [t], [d]
- /t/, /d/ \(\rightarrow\) [ɾ]

- We know /ay/ rose before {/t/ \(\rightarrow\) [t]}
- We know /ay/ did not raise before {/d/ \(\rightarrow\) [d]}

- Estimate the strength of the glide/duration precursor at the beginning of the change in the contexts:
- {/t/ \(\rightarrow\) [t]}, {/d/ \(\rightarrow\) [d]}
- {/t/ \(\rightarrow\) [ɾ]}, {/d/ \(\rightarrow\) [ɾ]}

- Rescale these effects so {/t/ \(\rightarrow\) [t]} = 1, {/d/ \(\rightarrow\) [d]} = 0
- Resulting
*relative*precursors in flapping contexts should be proportional to the rate of change in these contexts.

Based on 10,000 samples from the posterior of the model

`precursor ~ TD * context * decade +`

`(TD * context | Speaker) + (1|Word)`

- Fit a linear model, estimating the rate of change for {/t/\(\rightarrow\)[t]} contexts.(\(\beta\))
- Estimate multipliers between 0 and 1 for each remaining context. (\(p_c\))
- Treat the rate of change of the other contexts as the {/t/\(\rightarrow\)[t]} slope times the multiplier (\(p_c\beta\))

Neither precursor model accounts for the behavior of *both* t-flaps and d-flaps.

How much does the patterning of the message and signal together reduce uncertainty about either in isolation?

- A big difference in \(P(um|gender)\) doesn’t translate to a big \(P(gender|um)\).
- The socio-indexical association between filled pause use and gender is very weak, despite the large difference in usage rates between men and women.

“Big Data” or “Big for Sociolinguistics Data” is going to allow us to investigate some phenomena we’ve always been interested in in detail that wasn’t possible before. If we’re creative, we might be able to investigate phenomena that we hadn’t thought were investigatable.

In the almost total absence of large-scale, questionnaire-supported observations which would have to be extended or repeated over generations of speakers in a community, such a picture can be only guesswork. - Hoenigswald (1960)

It could be observed only by means of an enormous mass of mechanical records, reaching through several generations of speakers. - Bloomfield (1933)

We need to stay on our theory building game. Our theories need to make quantitative predictions about what we’ll observe in our big data. Without that, we risk devolving into a field of superficial and insightless observation.