Big Data and

Josef Fruehwald

March 19, 2015

Big Data

Are Sociolinguists Using Big Data?

You cannot email this data to a colleague. You can’t even download it on your computer. This is data on an unprecedented impossibly mind boggling massive scale. - Kenneth Benoit (2015)

Not sociolinguistics yet.

The Philadelphia Neighborhood Corpus

speakers duration wav files other
397 302.46 hours 51G 32G

Big for Sociolinguistics Data

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. - Wikipedia

“Traditional data processing applications”


Useful Data

It is data made useful to us for analysis - Hilary Mason

Big Enough N

Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is “large enough,” you can start subdividing the data to learn more […]. N is never enough because if it were “enough” you’d already be on to the next problem for which you need more data. - Andrew Gelman

“Big Data”, whatever it is, is coming to sociolinguistics

Already Here:

  • FAVE-extract - automates formant analysis.
  • DARLA - Automatic Speech Recognition implementation (Reddy & Stanford)

Likely coming soon

  • Document classification, generally
  • Sentiment analysis, specifically
  • Topic modeling, etc.

What it means for us

  • We need to keep learning about and developing new computational tools.
  • We need to push our students to do the same.

Big Data-ism


Big Data-ism

My alternate title cooling

Big Data-ism

Another alternate title cooling

Big Data-ism

A Fear:

  • More people doing superficial and theoretically unmotivated work.

A Problem

  • It’ll be trivial to find effect sizes != 0.

Effect Size

The Facebook Contagion Experiment


Adam D. I. Kramer et al. PNAS 2014;111:8788-8790

Alternate Universes

The headlines about the same effect size, but with different Ns might be:

N = 1,000

Facebook’s unethical experiment has no apparent effect on users’ emotions.

N = 689,003

Facebook is using mind control!

Let’s Get Serious About
Effect Size

  • Is the effect size of some predictor big enough to explain the phenomenon under discussion?
  • Is it about the size we expected it to be.

Expectations about how big an effect ought to be can only be provided by an articulated theory.

Let’s Get Serious about



Yang (2013)

Remainder of the talk

  • There are two different models of undershoot that predict pre-voiceless /ay/ raising. I estimate the rate of change they predict, and compare that to the observed rates of change.
  • Using ideas from information theory, I estimate the predicted indexical association between gender and filled pause choice.
  • Both analysis are based on data from the Philadelphia Neighborhood Corpus

/ay/ Raising

/ay/ Raising

plot of chunk plot_ays1

  • Philadelphia used to have no pre-voiceless /ay/ raising.
  • It does now.

Undershoot Explanation

As a diphthong, /ay/ has a lot of ground to cover. Its nucleus raises before voicless consonants because

  • /ay/ is shorter before voiceless, giving you less time to all the way from /a/ to /i/. Therefore, the nucleus raises to [ʌ], so you don’t have to make such a big gesture in such a short amount of time. (Joos 1942)
  • The offglide of /ay/ is forced to be peripheral before voiceless, so the nucleus raises to [ʌ] via co-articulation with the glide. (Moreton & Thomas, 2007)

/ay/ Trajectories

plot of chunk trajectory_plot

Theory Prediction

The rate of change across phonetic contexts ought to be proportional to the phonetic pressure driving the change in that context.

  • Contexts: /ay/ preceding:
    • /t/, /d/ \(\rightarrow\) [t], [d]
    • /t/, /d/ \(\rightarrow\) [ɾ]

Pressure 1: Nucleus-to-glide distance

plot of chunk glide_plot2

Pressure 2: Duration

plot of chunk unnamed-chunk-4


We know

  • We know /ay/ rose before {/t/ \(\rightarrow\) [t]}
  • We know /ay/ did not raise before {/d/ \(\rightarrow\) [d]}


  • Estimate the strength of the glide/duration precursor at the beginning of the change in the contexts:
    • {/t/ \(\rightarrow\) [t]}, {/d/ \(\rightarrow\) [d]}
    • {/t/ \(\rightarrow\) [ɾ]}, {/d/ \(\rightarrow\) [ɾ]}
  • Rescale these effects so {/t/ \(\rightarrow\) [t]} = 1, {/d/ \(\rightarrow\) [d]} = 0
  • Resulting relative precursors in flapping contexts should be proportional to the rate of change in these contexts.

Relative Precursors:

plot of chunk unnamed-chunk-6

Based on 10,000 samples from the posterior of the model

precursor ~ TD * context * decade +
(TD * context | Speaker) + (1|Word)

Rate of Change

plot of chunk flap_graph


Estimating relative rates of change

  • Fit a linear model, estimating the rate of change for {/t/\(\rightarrow\)[t]} contexts.(\(\beta\))
  • Estimate multipliers between 0 and 1 for each remaining context. (\(p_c\))
  • Treat the rate of change of the other contexts as the {/t/\(\rightarrow\)[t]} slope times the multiplier (\(p_c\beta\))


plot of chunk comp_plot


Neither precursor model accounts for the behavior of both t-flaps and d-flaps.


Filled Pauses

plot of chunk um_plot

Is this a signal?

plot of chunk boxplot

Information Theory

Mutual Information

How much does the patterning of the message and signal together reduce uncertainty about either in isolation?

Um: Mutual Information with Gender

plot of chunk unnamed-chunk-10

Comparison: Names

Comparison: Last Letter of Names


plot of chunk unnamed-chunk-14

Um: Results

  • A big difference in \(P(um|gender)\) doesn’t translate to a big \(P(gender|um)\).
  • The socio-indexical association between filled pause use and gender is very weak, despite the large difference in usage rates between men and women.

Summing Up

Preparing for the future

“Big Data” or “Big for Sociolinguistics Data” is going to allow us to investigate some phenomena we’ve always been interested in in detail that wasn’t possible before. If we’re creative, we might be able to investigate phenomena that we hadn’t thought were investigatable.

The future is now

In the almost total absence of large-scale, questionnaire-supported observations which would have to be extended or repeated over generations of speakers in a community, such a picture can be only guesswork. - Hoenigswald (1960)

It could be observed only by means of an enormous mass of mechanical records, reaching through several generations of speakers. - Bloomfield (1933)

Preparing for the future

We need to stay on our theory building game. Our theories need to make quantitative predictions about what we’ll observe in our big data. Without that, we risk devolving into a field of superficial and insightless observation.