Doing sociophonetics with LAP data

Josef Fruehwald

Introduction

  • Quick intro to the LAP data

  • Overview of the contemporary sociophonetics workflow

  • The unique issues posed by the LAP data

  • Initial approach to addressing the issues

LAP Data

Linguistic Atlas of the North-Central States

LANCS data

Kentucky Data

Using Automatic Speech Recognition on the LAP

A Typical Sociophonetics workflow

Time intensiveness

This first portion of the diagram is the most time intensive part of the process after fieldwork is over and before analysis begins.

Best case scenario is 10 hours of transcription for every 1 hour of audio.

Time intensiveness

LANCS Audio ~177 hours
Total Transcription time 1770 to 2700 hours

Time to Transcription

(1 RA @ 15 hr/wk)

2.5 to 3.5 years

Cost of Transcription

(@ $15/hr)

$26,500 to $40,500

My Original Plan for LAP data

Replace this

My Original Plan for LAP data

With this

wav2vec fine tuning

Initial experiments fine tuning a pretrained wav2vec2 model on 3.5 hours of PNC data resulted in

  • eval word error rate = 0.34

  • eval character error rate = 0.189

However

Audio

All ASR systems are trained with labelled audio. When properties of the training audio and the use case audio are very different, they may not perform well. This includes

Audio

An example of training audio

which circumstances do not permit him to employ

(source: LibriSpeech (Panayotov et al. 2015))

An example of LANCS audio

well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday

Using Automatic Speech Recognition on the LAP

Pre-processing LAP Audio

Consistent issues

To the extent there are consistent issues across LANCS data, we can develop pre-processing workflows for them.

Issue 1: 60hz (and harmonics) mains hum

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide

Issue 2: Microphone Hits

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide

Issue 3: Low signal to noise ratio