FAVE-Workshop: Part 1

Outline of the morning

Outline

  1. The Benefits of Automation
  2. How FAVE (roughly) Works
    • FAVE-align
    • FAVE-extract
  3. How to use FAVE

Benefits of Automation

Fear, Uncertainty, Doubt

“It’ll make mistakes!”

People make mistakes.

plot of chunk unnamed-chunk-2

P.S., this is only data from 374 speakers out of a total possible 442 cause I couldn’t figure out how to read 68 into R.

F2 = 15543?

##    F1    F2 plt_vclass  Word
## 1 996 15543         ay whine

F2 < F1?

##    F1  F2 plt_vclass Word
## 1 822 761         oh  all

What’s this little hat?

plot of chunk unnamed-chunk-5

It’s all low vowels?

plot of chunk unnamed-chunk-6

Mistakes

  • Hand Measurements ≠ Error Free

“It’s a black box!”

People are black boxes

https://en.wikipedia.org/wiki/File:PhrenologyPix.jpg

FAVE

https://github.com/JoFrhwld/FAVE

“It’s like…”

too much

Well so is Praat!

praat

Don’t stop looking at and listening to your data!

Positive Benefits

  • Consistancy
  • Replicability

When humans format data by hand

Diane Altwasser, 28, Calgary, AB  TS 663


Darcy Janzen (m), 36, Calgary, AB  TS 658


John Kistler, 47, m,ColoradoSprings, CO TS 147

When humans curate data by hand

AB:Calgary:DAltwasser.txt:       text/plain; charset=us-ascii
...
AR:LittleRock:MKemp.pln:         text/x-c++; charset=iso-8859-1
...
AZ:Tucson:JBrunekant.pln:        text/plain; charset=iso-8859-1
...
IL:Chicago:JWojcik.pln:          text/x-c; charset=us-ascii
...
IL:Chicago:KReynen.pln:          text/x-c++; charset=us-ascii
text/plain charset=iso-8859-1 text/plain charset=us-ascii text/x-c charset=us-ascii text/x-c++ charset=iso-8859-1 text/x-c++ charset=us-ascii
66 368 4 2 2

Some Undeniable Benefits

  • It’s a bit faster.
  • You get more, and richer data.

Dinkin (2009)

  • 57,464 formant measurements
  • 119 speakers
  • Average of 483 mesurements per speaker

Philadelphia Neighborhood Corpus

  • 743,802 formant measurements
  • 397 speakers
  • Average of 1,874 measurements per speaker
  • Average of 0.94 vowels per second.

More Data

/ay/ followed by different /t,d/ contexts.

faithful phrase_flap word_flap
D 1,190 384 245
T 4,024 524 285

More Data

plot of chunk unnamed-chunk-13

More Data

plot of chunk unnamed-chunk-15

Richer Data

transcription

Richer Data

alignment

Richer Data

plot of chunk unnamed-chunk-16

Reproducibility

plot of chunk unnamed-chunk-19

Researcher Heuristics

  • Choosing a measurement point.
  • Adjusting the LPC settings.
  • Choosing vowels to measure or ignore.

Researcher Effects

  • How explicitly defined their heuristics are.
  • How strictly they enforce them.
  • How experienced and skilled they are.
  • Their recent caffiene consumption, and quality of previous night’s sleep.

Automation

Knowable, Explicitly Defined, Exceptionless

  • Measurement point selection method.
  • LPC parameter setting method.
  • Decision process for measuring a vowel or not.

Automation

Eliminated

  • Researcher experience and skill.
  • …oops

Reproducibility

Reproducibility

formantlog

FAVE-align

What is Forced Alignment?

“Forced alignment” simply time aligns a transcription to some audio.

Input

Audio:

wavform

Transcription:

this is a test

Dictionary Lookup

dict_lookup

Dictionary Lookup

dict_lookup

Alignment!

alignment

How do humans align?

human_align

  • An audio-visual processing task.

How does FAVE-align?

A brief aside into digital signal processing.

A Waveform

plot of chunk unnamed-chunk-20

Digital Sampling

plot of chunk unnamed-chunk-21

Fourier Transform

fourier

Spectrum

plot of chunk unnamed-chunk-22

Spectrum

plot of chunk unnamed-chunk-23

Fourier Transform (Again!)

fourier

Cepstrum

plot of chunk unnamed-chunk-24

FAVE-align uses the Cepstrum

  • FAVE utilizes a variant on cepstral coefficients (Perceptual Linear Prediction coefficients).
  • For each window of analysis, it looks at these coefficients, the difference between this window and the window previous, and the difference between the differences.
  • FAVE’s decision making process is not readily comparable to your audio-visual decision making process.

FAVE-align’s Decision Making

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

Hidden Markov Models

defeat

The Result

alignment

P2FA

p2fa

Specs

  • 25.5 hours training data.
  • Monophone model
  • 10 ms granularity

Specs

Accuracy, from Yuan & Liberman (2008)

p2fa_errors

Specs

MacKenzie & Turton compared FAVE to other aligners on British English.

Median Mean Max
Onset Offset Onset Offset Onset Offset
FAVE 0.009 0.009 0.019 0.021 0.583 0.588
PLA 0.015 0.019 0.267 0.252 55.473 55.488
SPPAS 0.150 0.155 0.504 0.480 68.903 67.408

Variation in the dictionary

No

car walking both
K AA R W AO K IH NG B OW TH
K AA W AO K IH N B OW F

Requires special training of forced aligner.

Maybe?

either going to
AY DH ER G OW IH NG T UW
IY DH ER G AA N AH

Using Forced Alignment

Yuan, J., Liberman, M., “Automatic detection of ‘g-dropping’ in American English using forced alignment,” Proceedings of 2011 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 490-493.

Yuan, J, Liberman, M., “Investigating /l/ variation in English through forced alignment,” Proceedings of Interspeech 2009, pp. 2215-2218.

FAVE-extract

Formant Analysis

  • FAVE-align tells us where in the audio vowels are.
  • FAVE-extract automates the vowel formant analysis.

How do Humans do Formant Analysis?

Humans

s1_t1

s2_t1

s2_t1

Automation

s1_t1

s2_t1

Automating Formant Estimation

  • The bad errors are very very bad.
  • Some small differences, any expert may disagree

4

5

Step 1: A Measurement Point

  • Most Vowels: Measured at 1/3 of the duration.
    • Evanini found this to most closely approximate human annotators’ behavior.
  • Complex Vowels: For vowels with a more complex trajectory, something different is done to avoid measuring in the transition from nucleus to glide.

Step 1: A Measurement Point

Vowel Measurement
ay F1 Maximum
ey F1 Maximum
Tuw onset
ow Half way between onset and F1 Maximum
aw Half way between onset and F1 Maximum

Step 1: A Measurement Point

  • In addition, padding equal to the window of the LPC analysis is added to the beginning and the end of the vowel, to ensure a formant track through the full vowel’s duration.
    • For /ay/ 2x the window is added to the beginning.
  • To take into account any alignment errors, etc, any bits from the beginning and end which fall below 10% of the maximum intensity for the vowel are excluded.

Other Measurement Points

Other supported measurement point selection methods.

Method Description
third Measure all vowels at 1/3 of duration
fourth Measure all vowels at 1/4 of duration
mid Measure all vowels at 1/2 of duration
maxint Measure at maximum vowel intensity
lennig Measure according to Lennig (1978)
anae Measure according to the Atlas of North American English

Formant Settings

  • For each vowel token, the F1 and F2 estimates you could get for different LPC parameter settings constitute a candidate set.
  • Choose a winner based on its multivariate distance (based on F1, F2, log(B1), log(B2)) to the Atlas of North American English’s distribution for that vowel class.
  • Logic: If there is an LPC setting whch is produces a measurement close to the ANAE distribution for that vowel class, it’s probably ok.

Formant Settings

  • The ANAE distribution for a vowel class is the prior.
  • The candidate set of potential formant estimates is the likelihood.
  • The winner is the posterior.


  • Like most worries about Bayesian reasoning, people worry that the prior might exert too strong an influence on the posterior.
  • Fortunately, the prior’s influence here doesn’t seem to be too strong.

FAVE - Step 1

formants

FAVE - Remeasurement

formants

FAVE - Step 2

formants

FAVE - Results

results

Using FAVE

Transcription

We recommend using ELAN

transcription

elan

Transcription

From Praat

Transcription

The input transcription to FAVE-align should be a tab-delimited file that looks like this:


ID  Name  start(seconds)  end(seconds)  transcriptions

Example:


JF  Josef Fruehwald 1.07    4.5 this is a test audio file for forced alignment
JF  Josef Fruehwald 5.286   9.678   the word irn bru is probably not in the dictionary
  • Only audio between the start and end times will be considered for alignment.
  • Only audio that is aligned will be analyzed by FAVE-extract.

FAVE-align Usage

Files

  • In the FAVE-align directory:
    • The audio file
    • The transcription

FAVE-align Usage

Out of Dictionary Words

  • If there are words in the transcription that aren’t in the CMU Dictionary, you will need to provide those transcriptions.
  • You will need to provide transcriptions for partial words in the transcription.
python FAAValign.py -c unknown.txt transcription.txt

FAVE-align Usage

Out of Dictionary Words

python FAAValign.py -c unknown.txt speaker.txt
using python the script check for unknown words and put them in unknown.txt the transcription

FAVE-align Usage

Provide Transcriptions


IRN  AY1 R N
BRU  B R UW1

FAVE-align Usage

Do the alignment

python FAAValign.py -i unknown.txt speaker.wav speaker.txt
python FAAValign.py -i unknown.txt speaker.wav speaker.txt
using python the script add transcriptions in unknown.txt to dictionary the audio the transcription

FAVE-extract Usage

transcript

FAVE-extract Usage

Settle on your options

My config.txt


--outputFormat
txt
--speechSoftware
Praat
--formantPredictionMethod
mahalanobis
--measurementPointMethod
faav
--nSmoothing
12
--remeasure
--vowelSystem
phila
--onlyMeasureStressed

FAVE-extract Usage

Run FAVE-extract


python bin/extractFormants.py +config.txt test.wav test.TextGrid test_meas
python bin/extractFormants.py +config.txt test.wav test.TextGrid test_meas
using python the script add config options the audio the TextGrid the output file name