# Are Sociolinguists Using Big Data?

You cannot email this data to a colleague. You can’t even download it on your computer. This is data on an unprecedented impossibly mind boggling massive scale. - Kenneth Benoit (2015)

Not sociolinguistics yet.

# The Philadelphia Neighborhood Corpus

speakers duration wav files other
397 302.46 hours 51G 32G

# Big for Sociolinguistics Data

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. - Wikipedia

# Useful Data

It is data made useful to us for analysis - Hilary Mason

# Big Enough N

Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is “large enough,” you can start subdividing the data to learn more […]. N is never enough because if it were “enough” you’d already be on to the next problem for which you need more data. - Andrew Gelman

# “Big Data”, whatever it is, is coming to sociolinguistics

• FAVE-extract - automates formant analysis.
• DARLA - Automatic Speech Recognition implementation (Reddy & Stanford)

### Likely coming soon

• Document classification, generally
• Sentiment analysis, specifically
• Topic modeling, etc.

# What it means for us

• We need to keep learning about and developing new computational tools.
• We need to push our students to do the same.

# Big Data-ism

My alternate title

# Big Data-ism

Another alternate title

# Big Data-ism

A Fear:

• More people doing superficial and theoretically unmotivated work.

A Problem

• It’ll be trivial to find effect sizes != 0.

# Effect Size

The Facebook Contagion Experiment

Adam D. I. Kramer et al. PNAS 2014;111:8788-8790

# Alternate Universes

The headlines about the same effect size, but with different Ns might be:

### N = 1,000

Facebook’s unethical experiment has no apparent effect on users’ emotions.

### N = 689,003

Facebook is using mind control!

# Let’s Get Serious About Effect Size

• Is the effect size of some predictor big enough to explain the phenomenon under discussion?
• Is it about the size we expected it to be.

Expectations about how big an effect ought to be can only be provided by an articulated theory.

Yang (2013)

# Remainder of the talk

• There are two different models of undershoot that predict pre-voiceless /ay/ raising. I estimate the rate of change they predict, and compare that to the observed rates of change.
• Using ideas from information theory, I estimate the predicted indexical association between gender and filled pause choice.
• Both analysis are based on data from the Philadelphia Neighborhood Corpus

# /ay/ Raising

• Philadelphia used to have no pre-voiceless /ay/ raising.
• It does now.

# Undershoot Explanation

As a diphthong, /ay/ has a lot of ground to cover. Its nucleus raises before voicless consonants because

• /ay/ is shorter before voiceless, giving you less time to all the way from /a/ to /i/. Therefore, the nucleus raises to [ʌ], so you don’t have to make such a big gesture in such a short amount of time. (Joos 1942)
• The offglide of /ay/ is forced to be peripheral before voiceless, so the nucleus raises to [ʌ] via co-articulation with the glide. (Moreton & Thomas, 2007)

# Theory Prediction

The rate of change across phonetic contexts ought to be proportional to the phonetic pressure driving the change in that context.

• Contexts: /ay/ preceding:
• /t/, /d/ $$\rightarrow$$ [t], [d]
• /t/, /d/ $$\rightarrow$$ [ɾ]

# Predicting

### We know

• We know /ay/ rose before {/t/ $$\rightarrow$$ [t]}
• We know /ay/ did not raise before {/d/ $$\rightarrow$$ [d]}

### Procedure

• Estimate the strength of the glide/duration precursor at the beginning of the change in the contexts:
• {/t/ $$\rightarrow$$ [t]}, {/d/ $$\rightarrow$$ [d]}
• {/t/ $$\rightarrow$$ [ɾ]}, {/d/ $$\rightarrow$$ [ɾ]}
• Rescale these effects so {/t/ $$\rightarrow$$ [t]} = 1, {/d/ $$\rightarrow$$ [d]} = 0
• Resulting relative precursors in flapping contexts should be proportional to the rate of change in these contexts.

# Relative Precursors:

Based on 10,000 samples from the posterior of the model

precursor ~ TD * context * decade +
(TD * context | Speaker) + (1|Word)

# Modelling

### Estimating relative rates of change

• Fit a linear model, estimating the rate of change for {/t/$$\rightarrow$$[t]} contexts.($$\beta$$)
• Estimate multipliers between 0 and 1 for each remaining context. ($$p_c$$)
• Treat the rate of change of the other contexts as the {/t/$$\rightarrow$$[t]} slope times the multiplier ($$p_c\beta$$)

# Results

Neither precursor model accounts for the behavior of both t-flaps and d-flaps.

# Mutual Information

How much does the patterning of the message and signal together reduce uncertainty about either in isolation?

# Um: Results

• A big difference in $$P(um|gender)$$ doesn’t translate to a big $$P(gender|um)$$.
• The socio-indexical association between filled pause use and gender is very weak, despite the large difference in usage rates between men and women.

# Preparing for the future

“Big Data” or “Big for Sociolinguistics Data” is going to allow us to investigate some phenomena we’ve always been interested in in detail that wasn’t possible before. If we’re creative, we might be able to investigate phenomena that we hadn’t thought were investigatable.

# The future is now

In the almost total absence of large-scale, questionnaire-supported observations which would have to be extended or repeated over generations of speakers in a community, such a picture can be only guesswork. - Hoenigswald (1960)

It could be observed only by means of an enormous mass of mechanical records, reaching through several generations of speakers. - Bloomfield (1933)

# Preparing for the future

We need to stay on our theory building game. Our theories need to make quantitative predictions about what we’ll observe in our big data. Without that, we risk devolving into a field of superficial and insightless observation.