Josef Fruehwald
Quick intro to the LAP data
Overview of the contemporary sociophonetics workflow
The unique issues posed by the LAP data
Initial approach to addressing the issues
This first portion of the diagram is the most time intensive part of the process after fieldwork is over and before analysis begins.
Best case scenario is 10 hours of transcription for every 1 hour of audio.
LANCS Audio | ~177 hours |
Total Transcription time | 1770 to 2700 hours |
Time to Transcription (1 RA @ 15 hr/wk) |
2.5 to 3.5 years |
Cost of Transcription (@ $15/hr) |
$26,500 to $40,500 |
Replace this
With this
Initial experiments fine tuning a pretrained wav2vec2 model on 3.5 hours of PNC data resulted in
eval word error rate = 0.34
eval character error rate = 0.189
All ASR systems are trained with labelled audio. When properties of the training audio and the use case audio are very different, they may not perform well. This includes
Different kinds of speech (Tatman & Kasten 2017; Wassink & Gansen & Bartholomew 2022)
Different kinds of recordings
An example of training audio
which circumstances do not permit him to employ
(source: LibriSpeech (Panayotov et al. 2015))
An example of LANCS audio
well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday
Pre-processing LAP Audio
To the extent there are consistent issues across LANCS data, we can develop pre-processing workflows for them.
A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide
A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide