Tidy Data, Split-Apply-Combine, and Vowel Data: Part 2

-June 17, 2015

This is part two of my blog post on how vowel formant data ought to be thought about in terms of Tidy Data, and Vowel Normalization can be thought about in terms of Split-Apply-Combine procedures. Let’s start out by getting our data in order.

library(downloader)
library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
  ## temp file for downloading
  tmp <- tempfile()
  pb_tar <- download("https://www.cs.cmu.edu/Groups/AI/areas/speech/database/pb/pb.tgz", 
                     tmp, quiet = T)
  
  ## decompress the data
  untar(tmp, "PetersonBarney/verified_pb.data")
  
  ## read the data
  pb_data <- read.delim("PetersonBarney/verified_pb.data", header = F)
  
  ## cleanup
  unlink(tmp)
  success <- file.remove("PetersonBarney/verified_pb.data")
  success <- file.remove("PetersonBarney")

  pb_data %>%
    set_colnames(c("sex", "speaker", "phonN", "phon", 
                   "F0", "F1", "F2", "F3")) %>%
    mutate(sex = factor(sex, labels = c("C","F","M"),levels = 3:1),
           phon = gsub("\\*", "", phon)) %>%
    select(sex, speaker, phon, F0, F1, F2, F3) %>%
    mutate(id = 1:n())%>%
    gather(formant, hz, F0:F3)%>%
    tbl_df()-> pb_long

We’re going to be treating vowel formant data as having the following variables:

  • The speaker whose vowel was measured
  • The id of the vowel that was measured.
  • The phonemic identity of the vowel that was measured.
  • The formant that was measured.
  • The the value of the measurement in Hz.

This means that each individual vowel token will have 4 rows in the pb_long data frame, one for each formant.

  pb_long %>%
    filter(id == 1)%>%
    kable()
sex speaker phon id formant hz
M 1 IY 1 F0 160
M 1 IY 1 F1 240
M 1 IY 1 F2 2280
M 1 IY 1 F3 2850

Normalization

The need for vowel formant normalization can be neatly summarized in the following plot, which has a boxplot for each “sex” (in Peterson & Barney, that’s C: Child, F: Female, M: Male) for each formant. You can see how:

  • For each formant the overall distribution of values goes C > F > M,
  • The difference between each group appears to get more extreme the higher the formant.
  pb_long %>%
    ggplot(aes(formant, hz, fill = sex))+
      geom_boxplot()

center

One thing that’s not immediately evident from the figure above is that the variance of formant values also differs between each sex, also following the pattern C > F > M

  pb_long %>%
    group_by(speaker, sex, formant) %>%
    summarise(hz_sd = sd(hz))%>%
    ggplot(aes(formant, hz_sd, fill = sex))+
      geom_boxplot()

center

The usual way of representing these sex effects emphasizes the effect it has on the vowel triangle, which is especially evident in the high front region.

  pb_long %>%
    spread(formant, hz) %>%
    ggplot(aes(F2, F1, color = sex))+
      geom_point()+
      scale_x_reverse()+
      scale_y_reverse()+
      coord_fixed()

center

The fact there are these differences in vowel spectra and the reasons for it are interesting, but when you’re doing empirical research, not all interesting phenomena are the object of study. Usually, sociolinguists want to factor out gross sex differences involving the location and scale of the formant values. This usually involves transforming formant values by subtraction and division with reference to some other set of values. How you characterize that process is the point of this post.

A literature developing a typology for normalization in sociolinguistics has sprung up, beginning with Adank (2003). Procedures are usually characterized by what subsets of the data are used to estimate normalization parameters. For example:

Speaker

  • Speaker Intrinsic Parameters estimated within each individual speaker’s data.
  • Speaker Extrinsic Parameters estimated across multiple speakers’ data.

Formant:

  • Formant Intrinsic Each formant is separately normalized.
  • Speaker Extrinsic Multiple formants are normalized simultaneously.

Vowel (token)

  • Vowel Intrinsic Each vowel token is normalized individually.
  • Vowel Extrinsic Multiple vowels are normalized simultaneously.

I’ll be looking at 6 normalization procedures in this post, and they break down like so (speaker extrinsic in bold).

  Vowel Token Intrinsic Vowel Token Extrinsic
Formant Intrinsic   Z-score, Nearey1, Watt & Fabricius
Formant Extrinsic Bark Difference Nearey2, ANAE

To a large degree, this “intrinsic” vs “extrinsic” distinction can be captured by which variables are included in a dplyr::group_by() operation. I’ll try to be pedantic about that fact in the R code below, even if the particular grouping is going to be vacuous with respect to R’s vectorized arithmetic.

Another thing you might notice looking at the code is that these normalization procedures can be characterized by a few other data base operations. For example some require tidr::spread() while others don’t, meaning they have different models of what constitutes an observation. Some also require what I’ll call a resetting of the “scope” of analysis. That is, they take the raw data, run through a chain of operations, then need to return back to the raw data, merging back in the results of the first chain of operations.

The point of this post isn’t really to compare the effectiveness of these procedures relative to what kinds of grouping and data base operations they require, or their model of a observation. Comparisons of normalization procedures have been done to death are well understood. This is more about exploring the structure of data and database operations in a comfortable domain for sociolinguists.

Bark Difference Metric

First off is the Bark Difference Metric, which is the only vowel intrinsic method I’ll be looking at. It tries to normalize F1 and F2 with respect to either F0 or F3. It also converts the Hz to Bark first, which tamps down on the broader spread of the higher formants a bit. ±1 on the Bark scale is also supposed to be psychoacoustically meaningful.

First, I’ll define a function that takes Hz values and converts them to bark.

  bark <- function(hz){
    (26.81/(1+(1960/hz)))-0.53
  }

Here’s the chain of operations that does the Bark difference normalization.

  • Convert hz to bark
  • drop the hz column
  • create columns for each formant, filled with their value in bark
  • group_by() vowel token id (vacuous)
  • normalize backness by F3-F2
  • normalize height by F1-F0
  pb_long %>%
    mutate(bark = bark(hz))%>%
    select(-hz)%>%
    spread(formant, bark)%>%
    group_by(id)%>%
    mutate(backness = F3-F2,
           height = F1-F0)->pb_bark

Here’s how that looks.

  pb_bark %>%
    ggplot(aes(backness, height, color = sex))+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      coord_fixed()

center

Some things to notice about the resulting normalized space:

  • As the backness dimension increases, so does backness. As the height dimension increases, so does lowness (hence the reversed y-axis).
  • The relative relationship between F1 and F2 has been eliminated. They both run on scales from about 0 to about 9.
  • The larger variance in backness relative to height is preserved, but is mitigated relative to the raw Hz.

Bark Difference Summary

Model of an observation

  • F0, F1, F2, and F3 for each vowel token for each speaker.

Scope resets

  • None.

ANAE

The Atlas of North American English is a formant extrinsic, speaker extrinsic normalization technique. It involves calculating the grand average log(hz) for all speakers for F1 and F2 combined. Here’s a function to do the actual normalization calculation.

  anae_fun <- function(hz, grand_mean){
    hz * (exp(grand_mean - mean(log(hz))))
  }

While pb_long already has the data in the correct format to do this easilly, the original ANAE approach only looked at F1 and F2, so we need to filter the data to only have those rows corresponding to F1 and F2 estimates.

This dplyr chain does that filtering, groups the data by speaker, estimates the mean log(hz) for each speaker, then the grand mean across all speakers.

  pb_long %>%
    filter(formant %in% c("F1","F2"))%>%
    group_by(speaker)%>%
    summarise(log_mean = mean(log(hz)))%>%
    summarise(log_mean = mean(log_mean))%>%
    use_series("log_mean")-> grand_mean

We now need to return to the whole data set, and use this grand mean to normalize the data within each speaker. We need to group by the speaker, so that we can estimate individual speakers’ mean log(hz).

  pb_long %>%
    filter(formant %in% c("F1","F2"))%>%    
    group_by(speaker)%>%
    mutate(norm_hz = anae_fun(hz, grand_mean))->pb_anae_long

This method preserves both the relative ordering of F1 and F2, as well as the greater spread in F2 relative to F1.

  pb_anae_long %>%
    ggplot(aes(formant, norm_hz, fill = sex))+
      geom_boxplot()

center

Here’s how it looks in the vowel triangle.

  pb_anae_long %>%
    select(-hz)%>%
    spread(formant, norm_hz)%>%
    ggplot(aes(F2, F1, color = sex))+
      geom_point(alpha = 0.75)+
      scale_x_reverse()+
      scale_y_reverse()+
      coord_fixed()

center

ANAE Summary

Model of an observation

  • Hz for each formant for each vowel token for each speaker

Scope resets

  • One.

Nearey

Nearey normalization is essentially identitcal to the ANAE normalization, except it’s speaker intrinsic. That’s not always clear given the way the ANAE and Nearey formulas are usually given, so I’ve defined the Nearey formula below in a way to emphasize this fact. nearey_fun() is identical to anae_fun() except it takes a single argument, and grand_mean is replaced by 0.

  nearey_fun <- function(hz){
    hz * (exp(0 - mean(log(hz))))
  }

There are two instantiations of Nearey. Nearey1 is formant intrinsic, and Nearey2 is formant extrinsic.

Nearey is easy peasy.

  • group data, appropriately
  • apply the normalization function
  pb_long %>%
    group_by(speaker, formant)%>%
    mutate(nearey = nearey_fun(hz)) -> pb_nearey1_long
  
  pb_long %>%
    filter(formant %in% c("F1","F2"))%>%
    group_by(speaker)%>%
    mutate(nearey = nearey_fun(hz)) -> pb_nearey2_long  

The formant intrinsic approach eliminates the relative ordering of each formant, and mitigates the difference in variance, at least between F1 and F2. The formant extrinsic approach looks roughly similar to the ANAE plot

  pb_nearey1_long %>%
    ggplot(aes(formant, nearey, fill = sex)) + 
      geom_boxplot()+
      ggtitle("Nearey1: Formant intrinsic")

center

  pb_nearey2_long %>%
    ggplot(aes(formant, nearey, fill = sex)) + 
      geom_boxplot()+
      ggtitle("Nearey2: Formant extrinsic")

center

Here’s how it looks in the vowel space. For Nearey1, I’ve included horizontal and vertical lines at 1, since these represent the “center” of the vowel space as far as the normalization is concerned.

  pb_nearey1_long %>%
    select(-hz)%>%
    spread(formant, nearey)%>%
    ggplot(aes(F2, F1, color = sex))+
      geom_hline(y = 1)+
      geom_vline(x = 1)+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()+
      ggtitle("Nearey1: Formant intrinsic")
## Error: Unknown parameters: y
  pb_nearey2_long %>%
    select(-hz)%>%
    spread(formant, nearey)%>%
    ggplot(aes(F2, F1, color = sex))+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()+
      ggtitle("Nearey2: Formant extrinsic")

center

Nearey Summary

Model of an observation

  • Hz for each formant for each vowel token for each speaker

Scope resets

  • None

Z-Score (a.k.a. Lobanov)

Z-scores are one of the most wide spread forms of normalization across any area of research. For example, Gelman, Jakulin, Pittau & Su recommend that for logistic regressions, all continuous variables be converted to a modified z-score: \(\frac{x-mean(x)}{2sd(x)}\). The ubiquity of the z-score across all domains of research, and the relative obscurity of what the “Lobanov” normalization is, is why I’ll always describe this procedure as “z-score normalization (also known as Lobanov normalization)”.

The z-score function itself is straight forward. R actually has a native z-scoring function scale(), but I’ll write out a fuction called zscore() just for explicitness.

  zscore <- function(hz){
    (hz-mean(hz))/sd(hz)
  }

To apply it to the data, it requires just one grouped application.

  pb_long %>%
    group_by(speaker, formant) %>%
    mutate(zscore = zscore(hz)) ->  pb_z_long

Z-scores eliminate the relative ordering of the formants, and mitigate the different variances between them.

  pb_z_long %>%
    ggplot(aes(formant, zscore, fill = sex))+
      geom_boxplot()

center

Maybe I’m just biased because z-scores are what I always use, but this to me looks like what a vowel space should.

  pb_z_long %>%
    select(-hz)%>%
    spread(formant, zscore)%>%
    ggplot(aes(F2, F1, color= sex))+
      geom_hline(y = 0)+
      geom_vline(x = 0)+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()
## Error: Unknown parameters: y

Watt & Fabricius

The Watt & Fabricius method (and ensuing modifications) focuses on normalizing vowels in terms of aligning vowel triangles in F1xF2 space. As such, it’s model of an observation is implicitly F1 and F2 for each vowel token.

The basic Watt & Fabricius function involves dividing each speaker’s F1 and F2 by “centroid” values, corresponding to the center of the vowel triangle.

  wf_fun <- function(hz, centroid){
    hz/centroid
  }

The method involves two steps. First, estimating the centroid values for F1 and F2 for each speaker. Second, after merging these centroid values back onto the original data, dividing F1 and F2 by them.

# Part 1: Centriod Estimation
  pb_long %>%
    filter(formant %in% c("F1","F2"))%>%
    group_by(speaker, formant, phon)%>%
    summarise(hz = mean(hz))%>%
    spread(formant, hz)%>%
    group_by(speaker)%>%
    summarise(beet1 = F1[phon=="IY"],
              beet2 = F2[phon=="IY"],
              school1 = beet1,
              school2 = beet1,
              bot1 = F1[phon=="AA"],
              bot2 = ((beet2 + school2)/2) + school2) %>%
    mutate(S1 = (beet1 + bot1 + school1)/3,
           S2 = (beet2 + bot2 + school2)/3)%>%
    select(speaker, S1, S2)->speaker_s
  
# Step 2: Merging on centroids and normalizing
  pb_long %>%
    spread(formant, hz)%>%
    left_join(speaker_s)%>%
    group_by(speaker)%>%
    mutate(F1_norm = wf_fun(F1, S1),
           F2_norm = wf_fun(F2, S2)) -> pb_wf
## Joining, by = "speaker"

Here’s the resulting vowel space. I’ll say it does a helluva job normalizing /iy/.

  pb_wf %>%
    ggplot(aes(F2_norm, F1_norm, color = sex))+
      geom_hline(y = 1)+
      geom_vline(x = 1)+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()
## Error: Unknown parameters: y

Watt & Fabricius Summary

Observation Model

  • F1 and F2 for each vowel token for each speaker

Scope Resets

  • One

Summary

So, here’s a summary of these methods, classified by their observation models, number of scope resets, and their various trinsicities.

  Observation Model Scope Resets Speaker Formant Vowel
Bark Difference F0, F1, F2, F3 for each vowel 0 intrinsic extrinsic instrinsic
ANAE Hz for each formant 1 ex ex ex
Nearey1 Hz for each formant 0 in in ex
Nearey2 Hz for each formant 0 in ex ex
Z-score Hz for each formant 0 in in ex
Watt & Fabricius F1, F2 for each vowel 1 in in ex

It’s kind of interesting how relatively orthogonal the observation model, the scope resetting and the trinsicities are. Only two of the procedures I looked at involved scope resetting (ANAE, and Watt & Fabricius). I believe scope resets are going to always be a feature of speaker-extrinsic models, since you’ll need to estimate some kind of over-arching parameter for all speakers, then go back over each individual speaker’s data. But the fact Watt & Fabricius also involved a scope reset goes to show that they’re not a property exclusive to speaker extrinsic normalization. It’s also probably the case that any given vowel intrinsic method will have an observation model similar to the Bark Difference metric, but again, Watt & Fabricius, an vowel extrinsic method, has a very similar model.

posted by Joe at 16:29 (more or less).

blog comments powered by Disqus