Individual Differences and Fitting Many Models

-May 1, 2016

Today I went to Hadley Wickham’s talk to the EdinbR users’ group called Managing Many Models in R. It highlighted some workflows for cleanly fitting many models to subsets of a data set, then comparing across model fits I thought I’d experiment by applying them to some linguistic data that I’m familiar with.

TD Deletion

Very briefly, “TD Deletion”, as I’m calling it, is a process common to every variety of English where a word final /t/ or /d/ in a cluster is deleted. Some example words are:

Word Deleted
mist mis’
kept kep’
missed miss’

It’s more common in words where the /t d/ is just part of the word, like mist than in words where it’s part of the past tense marker, like missed. In the data, the mist case is labeled mono for “monomorpheme” and the missed case is labeled past.

Setup

The TD Deletion data I’m using is derived from The Buckeye Corpus, and is available in the grammarOfVariatonData package.

  library(devtools)
  if(!require("grammarOfVariationData")){
    install_github("jofrhwld/grammarOfVariationData")  
    library(grammarOfVariationData)
  }

The functions in these workflows come from what some call the “Hadleyverse”, but he’s not the primary author of two of these packages (broom and magrittr). When I’m using functions from these packages, I’ll use the package::function() notation so it’s clear which functions come from which package, but in normal practice, you probably won’t need to include the package:: part yourself.

  library(dplyr)
  library(purrr)
  library(broom)
  library(tidyr)
  library(magrittr)
  library(ggplot2)

I’ll also be fitting some more standard mixed effects models for comparison sake.

  library(lme4)

The Plan

I’ll be looking at two factors that seem to have an influence on the rate of TD Deletion

  1. Speech Rate
  2. Word Frequency

Specifically, I’m going to see if there are any meaningful individual differences between how individual speakers are affected by these factors.

The Original Data

This is a subset of columns from the full Buckeye TD deletion data.

  buckeye %>%
    dplyr::select(Speaker, Word, seg, DictRate, Gram2, td)%>%
    head()
##   Speaker     Word seg DictRate Gram2 td
## 1     s01    lived   d 6.036144  past  1
## 2     s01      and   d 6.182479   and  0
## 3     s01   raised   d 6.431375  past  1
## 4     s01     west   t 4.227088  mono  0
## 5     s01     kind   d 6.612163  mono  0
## 6     s01 received   d 4.600222  past  1

DictRate corresponds to how many syllables per second there were in an 8 word window surrounding the target word, based on cannonical pronunciations. td is the outcome measure, and has a value of 1 if the /t d/ was pronounced and a value of 0 if it wasn’t.

Rate of Speech

DictRate is heavily leftward skewed, with some extremely (implausibly) high syllable per second rates. Things look more plausible if we exclude data where the target word was not right or left aligned with the window in which speech rate was estimated. So, we’ll use this subset for the speech rate modeling.

  buckeye %>%
    ggplot(aes(DictRate))+
      geom_density()
center
  buckeye %>% filter(PreWindow > 1 | PostWindow > 1 ) %>%
    ggplot(aes(DictRate))+
      geom_density()
center

Nesting the data

Step 1 is to create a data frame of data frames. First, let’s select the rows and columns of the full dataset we want to use.

  rate_to_use <- buckeye %>%
                  dplyr::filter(PreWindow > 1 | PostWindow > 1,
                                Gram2 %in% c("past", "mono")) %>%
                  dplyr::mutate(RateCenter = DictRate - median(DictRate)) %>%
                  dplyr::select(Speaker, Word, RateCenter, Gram2, td)

I want to fit 1 model for each speaker for each grammatical class (past tense or monomorpheme), so I need to group this data frame by Speaker and Gram2. Then, I’ll use nest() to create a new column that contains a data frame with the remaining columns.

  rate_nest <- rate_to_use %>%
                  dplyr::group_by(Speaker, Gram2) %>%
                  tidyr::nest()

Here’s what the data frame of data frames looks like:

  rate_nest
## Source: local data frame [76 x 3]
## 
##    Speaker  Gram2             data
##     (fctr) (fctr)            (chr)
## 1      s01   past  <tbl_df [58,3]>
## 2      s01   mono <tbl_df [171,3]>
## 3      s02   mono <tbl_df [218,3]>
## 4      s02   past  <tbl_df [46,3]>
## 5      s03   mono <tbl_df [219,3]>
## 6      s03   past  <tbl_df [56,3]>
## 7      s04   mono <tbl_df [240,3]>
## 8      s04   past  <tbl_df [41,3]>
## 9      s05   mono <tbl_df [184,3]>
## 10     s05   past  <tbl_df [38,3]>
## ..     ...    ...              ...

Here’s what the first item in the data column looks like.

  rate_nest$data[[1]]
## Source: local data frame [58 x 3]
## 
##         Word RateCenter    td
##       (fctr)      (dbl) (int)
## 1      lived  0.8881393     1
## 2     raised  1.2833697     1
## 3   received -0.5477829     1
## 4  sustained  0.4536305     1
## 5      moved -1.1036793     1
## 6   involved  0.3295448     1
## 7   involved  3.3815070     1
## 8     looked -3.4224800     1
## 9   informed -1.5658356     1
## 10   watched  0.7513170     1
## ..       ...        ...   ...

Fitting the models

The next step is to write a little function that takes one of our small data frames as input, and will produce a model as an output.

  fit_one_rate_mod <- function(data){
    mod <- glm(td ~ RateCenter , data = data, family = binomial)
    return(mod)
  }

Here’s what happens when we pass just one of our smaller data frames to fit_one_rate_mod.

  fit_one_rate_mod(rate_nest$data[[1]])
## 
## Call:  glm(formula = td ~ RateCenter, family = binomial, data = data)
## 
## Coefficients:
## (Intercept)   RateCenter  
##     1.35381      0.02941  
## 
## Degrees of Freedom: 57 Total (i.e. Null);  56 Residual
## Null Deviance:	    59.14 
## Residual Deviance: 59.12 	AIC: 63.12

But what we want to do is apply fit_one_rate_mod to all of the nested data frames. For this, we an use the purrr::map() function. It takes a list as its first argument, and a function as its second argument. It then applies that function to every item in the list.

  rate_nest <- rate_nest %>%
                 dplyr::mutate(mod = purrr::map(data, fit_one_rate_mod))

  rate_nest
## Source: local data frame [76 x 4]
## 
##    Speaker  Gram2             data          mod
##     (fctr) (fctr)            (chr)        (chr)
## 1      s01   past  <tbl_df [58,3]> <S3:glm, lm>
## 2      s01   mono <tbl_df [171,3]> <S3:glm, lm>
## 3      s02   mono <tbl_df [218,3]> <S3:glm, lm>
## 4      s02   past  <tbl_df [46,3]> <S3:glm, lm>
## 5      s03   mono <tbl_df [219,3]> <S3:glm, lm>
## 6      s03   past  <tbl_df [56,3]> <S3:glm, lm>
## 7      s04   mono <tbl_df [240,3]> <S3:glm, lm>
## 8      s04   past  <tbl_df [41,3]> <S3:glm, lm>
## 9      s05   mono <tbl_df [184,3]> <S3:glm, lm>
## 10     s05   past  <tbl_df [38,3]> <S3:glm, lm>
## ..     ...    ...              ...          ...
  rate_nest$mod[[1]]
## 
## Call:  glm(formula = td ~ RateCenter, family = binomial, data = data)
## 
## Coefficients:
## (Intercept)   RateCenter  
##     1.35381      0.02941  
## 
## Degrees of Freedom: 57 Total (i.e. Null);  56 Residual
## Null Deviance:	    59.14 
## Residual Deviance: 59.12 	AIC: 63.12

Tidy model summaries

In our nested data frame, we now have a coulum which contains a logistic regression model for each speaker for each grammatical class. In order to to do anything meaningful now, we need to extract important information from each model, ideally storing it again in another data frame. Here’s where broom::glance() and broom::tidy() come in. glance() will produce a one line summary for the entire model, and tidy() will produce a summary of the model coefficients.

  broom::glance(rate_nest$mod[[1]])
##   null.deviance df.null   logLik      AIC      BIC deviance df.residual
## 1      59.13862      57 -29.5594 63.11879 67.23968 59.11879          56
  broom::tidy(rate_nest$mod[[1]])
##          term   estimate std.error statistic      p.value
## 1 (Intercept) 1.35381379 0.3329521 4.0660918 0.0000478081
## 2  RateCenter 0.02941017 0.2086622 0.1409463 0.8879123804

What we’ll do is apply broom::tidy() to each model, again using purr::map(), and extract how many data points each model was fit with. Then, with tidyr::unnest(), we’ll pop this out into a free-standing data frame.

  rate_nest <- rate_nest %>%
                 dplyr::mutate(tidy = purrr::map(mod, broom::tidy),
                               n = purrr::map(data, nrow) %>% simplify())

  rate_nest
## Source: local data frame [76 x 6]
## 
##    Speaker  Gram2             data          mod               tidy     n
##     (fctr) (fctr)            (chr)        (chr)              (chr) (int)
## 1      s01   past  <tbl_df [58,3]> <S3:glm, lm> <data.frame [2,5]>    58
## 2      s01   mono <tbl_df [171,3]> <S3:glm, lm> <data.frame [2,5]>   171
## 3      s02   mono <tbl_df [218,3]> <S3:glm, lm> <data.frame [2,5]>   218
## 4      s02   past  <tbl_df [46,3]> <S3:glm, lm> <data.frame [2,5]>    46
## 5      s03   mono <tbl_df [219,3]> <S3:glm, lm> <data.frame [2,5]>   219
## 6      s03   past  <tbl_df [56,3]> <S3:glm, lm> <data.frame [2,5]>    56
## 7      s04   mono <tbl_df [240,3]> <S3:glm, lm> <data.frame [2,5]>   240
## 8      s04   past  <tbl_df [41,3]> <S3:glm, lm> <data.frame [2,5]>    41
## 9      s05   mono <tbl_df [184,3]> <S3:glm, lm> <data.frame [2,5]>   184
## 10     s05   past  <tbl_df [38,3]> <S3:glm, lm> <data.frame [2,5]>    38
## ..     ...    ...              ...          ...                ...   ...
  rate_coefs <- rate_nest %>%
                  tidyr::unnest(tidy) %>%
                  dplyr::select(-(std.error:p.value)) %>%
                  tidyr::spread(term, estimate)
  
  rate_coefs
## Source: local data frame [76 x 5]
## 
##    Speaker  Gram2     n (Intercept)  RateCenter
##     (fctr) (fctr) (int)       (dbl)       (dbl)
## 1      s01   mono   171   0.3109309 -0.08511930
## 2      s01   past    58   1.3538138  0.02941017
## 3      s02   mono   218   0.2834635 -0.30801155
## 4      s02   past    46   0.3795177 -0.04756514
## 5      s03   mono   219  -0.3751245 -0.42651949
## 6      s03   past    56   0.1228607 -0.34137119
## 7      s04   mono   240  -0.7632547 -0.22761177
## 8      s04   past    41   0.8512747 -0.18682255
## 9      s05   mono   184   0.8195723 -0.14171049
## 10     s05   past    38   1.5298129 -0.64609481
## ..     ...    ...   ...         ...         ...

The Results

Now it’s just a matter of plotting it out.

  rate_coefs%>%
    ggplot(aes(`(Intercept)`, RateCenter, color = Gram2))+
      geom_point(aes(size = n), alpha = 0.6)+
      scale_size_area()+
      coord_fixed()

center

It looks like the predicted effect that as speech rate increases, /t d/ retention decreases is held up. Most of the points for each speaker are below 0. There is a fair bit of inter-speaker variation in their baseline rate of /t d/ retention (along the x axis), but much less for the rate of speech effect, and there seems to be very little relationship between the two. That is, most speakers are likely react in a very similar way to increasing the rate of speech, regardless of whether they have a high or low baseline rate of /t d/ retention.

Let’s look at how this would all shake out in a normal mixed-effects model:

  rate_data <- rate_nest%>%unnest(data)
  
  rate_mod <- glmer(td ~ RateCenter*Gram2 + (RateCenter|Speaker) + (1|Word),
               data = rate_data,
               family = binomial)
  
  summary(rate_mod)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: td ~ RateCenter * Gram2 + (RateCenter | Speaker) + (1 | Word)
##    Data: rate_data
## 
##      AIC      BIC   logLik deviance df.resid 
##  12846.1  12904.8  -6415.0  12830.1    11359 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -7.4473 -0.6971  0.3012  0.6463  4.0799 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev. Corr 
##  Word    (Intercept) 1.044249 1.02189       
##  Speaker (Intercept) 0.255016 0.50499       
##          RateCenter  0.002359 0.04857  -0.03
## Number of obs: 11367, groups:  Word, 1000; Speaker, 38
## 
## Fixed effects:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.74630    0.10817   6.900 5.21e-12 ***
## RateCenter           -0.17929    0.01742 -10.294  < 2e-16 ***
## Gram2past             0.67300    0.11483   5.861 4.61e-09 ***
## RateCenter:Gram2past  0.08665    0.03441   2.518   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) RtCntr Grm2ps
## RateCenter  -0.010              
## Gram2past   -0.363  0.002       
## RtCntr:Grm2  0.001 -0.356  0.002

Looks mostly like we’d expect given the speaker-level modelling. The fixed effect of speech rate is negative, meaning a higher rate of speech has a higher rate of deletion. Looking at the random effects, we see a larger standard deviation for speaker-level intercepts than for speaker-level speech rate effects, just like in the figure above, and they are very weakly correlated.

Frequency Effects

Let’s do the same thing for frequency effects. I’ll be using word frequency norms from SUBTLEX-US, specifically a centered version of the Zipf Scale.

I’m going to compress this down into as few lines as possible just to show off the power of the workflow.

  subtlex <- read.delim("../data/subtlex/SUBTLEXus74286wordstextversion-3.txt")

  # This is just transforming and centering the frequency data for modelling.
  # It's important to center the frequencies over word types in this way.
  zipf_scores <- subtlex %>%
                  dplyr::mutate(Word = tolower(Word),
                                zipf = log10(SUBTLWF)) %>%
                  dplyr::select(Word, zipf) %>%
                  dplyr::semi_join(buckeye %>% 
                                      mutate(Word = tolower(Word)) %>%
                                      filter(Gram2 %in% c("past", "mono"))) %>%
                  dplyr::mutate(zipf_center = zipf - median(zipf))
## Joining by: "Word"
  # getting the subset of the data to use for analysis
  freq_to_use <- buckeye%>%
                  dplyr::filter(Gram2 %in% c("past", "mono"))%>%
                  dplyr::mutate(Word = tolower(Word))%>%
                  dplyr::left_join(zipf_scores)%>%
                  dplyr::select(Speaker, Word, zipf_center, Gram2, td)
## Joining by: "Word"
  # The model fitting function
  fit_one_freq_mod <- function(data){
    mod <- glm(td ~ zipf_center, data = data, family = binomial)
    return(mod)
  }
    
  # this block does all of the modelling and summaries of the models.
  freq_nest <- freq_to_use %>%
                  dplyr::group_by(Speaker, Gram2) %>%
                  tidyr::nest()%>%
                  dplyr::mutate(mod = purrr::map(data, fit_one_freq_mod),
                                tidy = map(mod, broom::tidy),
                                n = purrr::map(data, nrow)%>%simplify())

  # extrating the tidy summaries for plotting
  freq_coefs <- freq_nest %>%
                    tidyr::unnest(tidy) %>%
                    dplyr::select(-(std.error:p.value)) %>%
                    tidyr::spread(term, estimate)
  freq_coefs%>%
      ggplot(aes(`(Intercept)`, zipf_center, color = Gram2))+
        geom_point(aes(size = n), alpha = 0.6)+
        coord_fixed()

center

Unlike the rate of speech, it looks like there is quite a bit of interspeaker variation in their sensitivity to word frequency, and it’s related to their baseline rate of /t d/ retention. Speakers with a higher baseline deletion rate have a weaker frequency effect. You can kind of see this in the output of a standard mixed effects model, but it’s rolled up in the random-effects report that we usually don’t pay enough attention to.

  freq_data <- freq_nest%>%unnest(data)
  
  freq_mod <- glmer(td ~ zipf_center*Gram2 + (zipf_center|Speaker) + (1|Word),
                    data = freq_data,
                    family = binomial)
  
  summary(freq_mod)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: td ~ zipf_center * Gram2 + (zipf_center | Speaker) + (1 | Word)
##    Data: freq_data
## 
##      AIC      BIC   logLik deviance df.resid 
##  13746.6  13805.9  -6865.3  13730.6    12163 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.1754 -0.6912  0.3073  0.6549  3.9010 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev. Corr 
##  Word    (Intercept) 1.05171  1.0255        
##  Speaker (Intercept) 0.28874  0.5373        
##          zipf_center 0.07198  0.2683   -0.43
## Number of obs: 12171, groups:  Word, 1007; Speaker, 38
## 
## Fixed effects:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.88120    0.12105   7.279 3.35e-13 ***
## zipf_center           -0.13821    0.09446  -1.463    0.143    
## Gram2past              0.65529    0.12451   5.263 1.42e-07 ***
## zipf_center:Gram2past -0.07027    0.13884  -0.506    0.613    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) zpf_cn Grm2ps
## zipf_center -0.480              
## Gram2past   -0.432  0.321       
## zpf_cntr:G2  0.218 -0.530 -0.328

Up there in the random effects summary, you can see that the correlation between speaker-level intercepts and speaker-level frequency effects is -0.43, which means more or less the same thing as the figure above.

Takeaway

I think there’s a tendency when we have a nice big data set in hand to skip straight to fitting one big mixed-effects model to rule them all. But in rushing headlong into the one big model and focusing on the p-values on the fixed effects, we might be missing some important and interesting patterns in our data.

by Joe

Comments

The Average of Formant Dynamics Doesn't Have the Average Dynamic Properties

-October 9, 2015

I’ve been playing around with an analysis method called “Functional Data Analysis”, which has me looking at the kinds of analyses we do of formant measurements a bit differently. In sociolinguistics, we’ve inherited an intellectual and methodological tradition of characterizing vowels, including diphthongs, using single point measurements. More historiography needs to be done on why this is, but I think it might be a long carry over from Daniel Jones’ cardinal vowels theory, plus Trager & Bloch’s theory characterizing English has having 6 phonemic vowels. I’ve got a table here of Trager & Bloch’s phonemic theory, cause it’s kind of beautiful.

phoneme V /V/+j /V/+w /V/+h
/i/ pit pea dew idea
/e/ pet pay yeah
/a/ pat buy bough pa
/o/ pot boy paw
/ə/ butt beau huh
/u/ put boo Kahlua

In any event, a common alternative to characterizing vowels in terms of point measurements of steady states is to take a fixed number of measurements along the full formant track, or to measure the formants at specific points proportional to the vowel’s duration. This latter approach is supported by FAVE, which outputs F1 and F2 at 20%, 35%, 50%, 65% and 80% of the vowel’s duration. My own take is that this approach is better for trying to understand formant dynamics, but has its own shortcomings that I’ll discuss here. I’ll also discuss a potential alternative using functional data analysis.

  # Setting up for this post.
  # NB: For some reason summarise() and mutate() from plyr
  # wind up getting loaded when I compile this post, so I 
  # have to use dplyr::mutate and dplyr::summarise throughout
  library(knitr)

  library(dplyr)
  library(tidyr)
  library(magrittr)
  library(fda)
  library(ggplot2)

  opts_chunk$set(cache = T, message=  F, fig.width = 8/1.5, 
                 fig.height = 5/1.5, dev = 'svg', message = F, autodep = T,
                 cache.path = "rmd_drafts/prop_problems/")
  dep_auto()

To demonstrate for this post, I’ll be using formant tracks from my own speech as an interviewer from 2006. Since FAVE v1.2, the option has existed to save the full formant tracks of each vowel analyzed with the --tracks flag. I’ll be focusing on pre-voiceless /ay/ since it’s got nice and dynamic F1 to explore, and it’s my favorite. I’ve put the data online, so you can run all of this code at home.

  tracks <- read.delim("http://jofrhwld.github.io/data/fruehwald.txt")
  ays <- tracks %>%
          filter(plt_vclass %in% c("ay0")) %>%
          filter(!word == "LIKE") %>%
          group_by(id) %>%
          dplyr::mutate(t_rel = t-min(t),
                        t_prop = t_rel/max(t_rel))

As a quick demonstration, here’s a plot of 3 cherry picked F1 tracks, plotted across proportional time, with the 20%, 35%, 50%, 65% and 80% points marked.

  example_tracks <- ays %>% 
                      filter(id %in% c(756, 758, 1088))
  
  example_points <- example_tracks %>%
                      group_by(id) %>%
                      slice(round(c(0.2, 0.35, 0.5, 0.65, 0.8) * n())) %>%
                      mutate(prop = c(0.2, 0.35, 0.5, 0.65, 0.8))
  

  example_tracks %>%
    ggplot(aes(t_prop, F1))+
      geom_line(aes(group = id))+
      geom_point(data = example_points, aes(color = factor(prop)), size = 3)

center

Looking at these formant tracks, it looks like an important feature is that F1 increases, reaches a maximum, then decreases. But that landmark feature happens at different points in proportional time for each of these formant tracks. for the lowest curve, the 20% time point seems closesest to the F1-maximum landmark, for the middle curve, the 35% point is closest, and for the highest curve it’s actually the 50% point. This poses a general problem for averaging across these proportional points in order to explore formant dynamics, since each token is at a different phase of its track at each point. At 50% of the duration, for example, two of the tokens are well into the transition from the nucleus to the glide, while the third is just reaching its maxmimum. By just averaging at the 50% point, you’re averaging over values that aren’t dynamically equivalent.

There is another approach that still allows for the analysis of formant trajectories, and tries to address the problem of dynamic equivalency. It involves the following steps:

  1. Convert the scalar representation of the formant data to a functional representation.
  2. Identify key landmarks in each function.
  3. Carry out landmark registration.
  4. Calculate averages.

The first step needs some explaining. In the dataframe example_tracks, the data is represented as being just a collection of times and F1s. For each row \(i\), you have a pair of observations, \(\langle x_i, y_i \rangle\), specifically \(\langle \text{time}_i, \text{F1}_i \rangle\). But really, we want to think about F1 as a function of time, or \(y=f(x)\), or \(\text{F1}=f(\text{time})\).

I’ll be making this conversion from the \(\langle x_i, y_i \rangle\) to the \(y=f(x)\) representation for each formant track. There are a few ways to go about this using the fda package, and a few different parameters under my control that will affect the resulting functional representation for each formant track, all of which takes a while to explain. For now, I’ll describe without explaining too much. First, I’ll be using a relatively large b-spline basis, and penalize the roughness of the acceleration of the function (a.k.a. its second derivative).

  # I'm leveraging dplyr's ability to store lists in columns.
  example_tracks %>%
    group_by(id) %>%
    dplyr::mutate(t_rel = t-min(t),
                  t_prop = t_rel/max(t_rel),
                  duration = max(t_rel)) %>% 
    group_by(word, fol_seg, id, plt_vclass, duration, context) %>%
    do(f1 = c(.$F1),
       f2 = c(.$F2),
       t_prop = c(.$t_prop))->examples_vector
  # Defining the b-spline basis.
  # This one has a pretty large number of knots given the 
  # amount of wiggliness we're looking for.
  
  basic_basis <- create.bspline.basis(rangeval = c(0, 1), 
                                      nbasis = (30 + 7)-2, 
                                      norder = 7)

Here’s a demonstration of how the FDA works for one of the formant tracks. The figure plots the original F1 formant track measurements with the open circles, and the bold red line is the new functional representation.

  sample_time <- examples_vector$t_prop[[2]]
  sample_f1 <- examples_vector$f1[[2]]
  
  # Maximum Likelihood fit.
  # With the size of the basis, this is almost completely perfect fit.
  # RMSE = 0.1972364, certainly overfit.
  sample_fit <- smooth.basis(sample_time, sample_f1, basic_basis)
  
  # Smooth with a roughness penalty on the acceleration
  # Lfdobj = the derivative number
  # 0 = original F1, 1 = velocity, 2 = acceleration
  sample_smooth <- smooth.fdPar(sample_fit$fd, Lfdobj = 2, lambda = 1e-5)
  
  plot(sample_time, sample_f1, cex = 0.5)
  x <- plot(sample_smooth, add = T, col = 'red', lwd = 3)

center

One of the great benefits of converting the scalar representation to the functional representation is that we can also calculate the first and second derivatives of the function, to give us the rate of change in F1, and the rate of acceleration in F1.

  # the assignment here is just to capture an annoying
  # "done" value plot.fd produces
  par(mfrow = c(1,3))
  x <- plot(sample_smooth, Lfdobj = 0)
  title("F1")
  
  x <- plot(sample_smooth, Lfdobj = 1)
  title("F1 velocity")

  x <- plot(sample_smooth, Lfdobj = 2)
  title("F1 acceleration")  

center

If it’s not clear from eyeballing this figure, where the velocity of F1 crosses zero with a negative slope is where F1 reaches its maximum. I’ve written a little function that will return zero-crossing locations if you give it a functional data object. You can look at it here.

  library(devtools)
  source_gist("https://gist.github.com/JoFrhwld/f75528f7b358148ed9fb", 
              quiet = T)
  sample_zero_cross <- zero_crossings(sample_smooth, Lfdobj = 1, slope = -1)
  
  x <- plot(sample_smooth)
  abline(v = sample_zero_cross, lty = 2)

center

So our next steps will be to smooth all three of these formant tracks in the same way, and identify the same landmark in all of them. Here’s the dplyr code I’ve used to fit a b-spline curve to each formant track, and smooth it with a roughness penalty on the acceleration.

  example_smoothed <- examples_vector %>%
                        rowwise()%>%
                        do(fdobj = smooth.basis(.$t_prop, 
                                                .$f1, 
                                                basic_basis)$fd %>%
                                    smooth.fdPar(., Lfdobj = 2, lambda = 1e-5)) 
  
  # a list2fd function
  source_gist("https://gist.github.com/JoFrhwld/0b0f69ca1a341f38f275",
         quiet = T)
  
  example_fd <- list2fd(example_smoothed$fdobj, basic_basis)
  par(mfrow = c(1,3))
  x <- plot(example_fd, lty = 1)
  title("F1")
  
  x <- plot(example_fd, Lfdobj = 1, lty = 1)
  title("F1 velocity")
  
  x <- plot(example_fd, Lfdobj = 2, lty = 1)
  title("F1 acceleration")  

center

Just like it was in the original plot of the raw formant tracks, these functions reach their F1 maximum at different points in proportional time, which is really clear to see if you look at where the velocity crosses zero for each. This next figure plots the F1 function with a vertical line indicating the position of the zero crossing.

  example_zero_cross <- zero_crossings(example_fd, Lfdobj = 1, slope = -1)
  
  x <- plot(example_fd, lty = 1)
  abline(v = example_zero_cross, col = 1:3, lty = 2)

center

The next step in the functional data analysis is to register these functions with regards to their landmarks. I have to admit that I don’t fully understand how this works, but it involves warping the time domain. The warping can be non-linear, so we’ll be defining a basis function similar to the original basis we used for smoothing the formant tracks. One thing to note is that we’ll be warping the time domain for these functions according to just one landmark, but it is possible to define multiple landmarks for each function.

  wbasis <- create.bspline.basis(rangeval=c(0,1),
                                 norder=4, 
                                 breaks=seq(0, 1, len=10))
  Wfd0   <- fd(matrix(0, wbasis$nbasis, 1), wbasis)
  WfdPar <- fdPar(Wfd0, 1, 1e-4)
  
  example_regfd <- landmarkreg(example_fd, 
                               ximarks = example_zero_cross, 
                               WfdPar = WfdPar)
## Progress:  Each dot is a curve
## ...
  # Plotting a comparison between the original and registered curves
  par(mfrow = c(1,2))
  x <- plot(example_fd, lty = 1)
  abline(v = example_zero_cross, col = 1:3, lty = 2)
  title("original")
  
  x <- plot(example_regfd$regfd, lty = 1)
  abline(v = mean(example_zero_cross), lty = 2)
  title("registered")

center

Each curve has been warped so that their F1 maximum ocurrs at the same time. This is going to have the biggest effect on the mean value of each set of curves. By the way, b-splines have some really pleasing properties when it comes to estimating means, and broader analyses. You just take the mean across each b-spline coefficient.

  # getting the mean function for the original
  # and registered functions
  mean_fd <- mean(example_fd)
  mean_regfd <- mean(example_regfd$regfd)
  
  # Prediction and plotting
  data_frame(time = seq(0, 1, length = 100),
             original = predict(mean_fd, newdata = time)[,1],
             registered = predict(mean_regfd, newdata = time)[,1])%>%
    gather(type, F1, original:registered) %>%
    ggplot(aes(time, F1, color = type))+
      geom_line()+
      scale_color_brewer(palette = "Dark2")+
      theme_bw()

center

The mean of the registered functions is different from the mean of the original functions in two ways: the location of the F1 maximum is different, and the magnitude of the F1 maximum is different. I would argue that the mean of the registered function is different in a good direction.

  locations = c(original = zero_crossings(mean_fd, Lfdobj = 1, slope = -1),
                registered = zero_crossings(mean_regfd, Lfdobj = 1, slope = -1))
  
  magnitudes = c(original = eval.fd(locations[1], mean_fd),
                 registered = eval.fd(locations[2], mean_regfd))
  
  rbind(locations, magnitudes)  
##            original registered
## locations    0.3080     0.3310
## magnitudes 777.6664   781.8984

The peak F1 occurs later in the mean for the registered function than the original functions. We previously estimated the peak F1 time for each of the original functions, and if we take the average of those, it’s the same as the peak F1 time for the mean of the registered functions.

  example_zero_cross
## [1] 0.192 0.305 0.499
  mean(example_zero_cross)  
## [1] 0.332

Similarly, the mean F1 maximum of the registered functions is higher than for the original functions. Again, the mean of the registered functions is closer to the mean of each of the original functions F1 maximum.

  # getting f1 maximum from each original function
  f1_maxes <- eval.fd(example_zero_cross, example_fd) %>% diag
  f1_maxes
## [1] 735.2567 791.2660 819.1646
  mean(f1_maxes)
## [1] 781.8957

The lesson here is that the simple average of formant dynamics will not have the average dynamic properties of the input.


There are pros and cons to doing landmark registration like this. For example, it seems pretty clear that the duration of the vowel is going to be a predictor of where the F1 maximum occurs in proportional time. If you wanted to explore those properties, you wouldn’t want to do that analysis on registered functions.

by Joe

Comments

Tidy Data, Split-Apply-Combine, and Vowel Data: Part 2

-June 17, 2015

This is part two of my blog post on how vowel formant data ought to be thought about in terms of Tidy Data, and Vowel Normalization can be thought about in terms of Split-Apply-Combine procedures. Let’s start out by getting our data in order.

library(downloader)
library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
  ## temp file for downloading
  tmp <- tempfile()
  pb_tar <- download("https://www.cs.cmu.edu/Groups/AI/areas/speech/database/pb/pb.tgz", 
                     tmp, quiet = T)
  
  ## decompress the data
  untar(tmp, "PetersonBarney/verified_pb.data")
  
  ## read the data
  pb_data <- read.delim("PetersonBarney/verified_pb.data", header = F)
  
  ## cleanup
  unlink(tmp)
  success <- file.remove("PetersonBarney/verified_pb.data")
  success <- file.remove("PetersonBarney")

  pb_data %>%
    set_colnames(c("sex", "speaker", "phonN", "phon", 
                   "F0", "F1", "F2", "F3")) %>%
    mutate(sex = factor(sex, labels = c("C","F","M"),levels = 3:1),
           phon = gsub("\\*", "", phon)) %>%
    select(sex, speaker, phon, F0, F1, F2, F3) %>%
    mutate(id = 1:n())%>%
    gather(formant, hz, F0:F3)%>%
    tbl_df()-> pb_long

We’re going to be treating vowel formant data as having the following variables:

  • The speaker whose vowel was measured
  • The id of the vowel that was measured.
  • The phonemic identity of the vowel that was measured.
  • The formant that was measured.
  • The the value of the measurement in Hz.

This means that each individual vowel token will have 4 rows in the pb_long data frame, one for each formant.

  pb_long %>%
    filter(id == 1)%>%
    kable()
sex speaker phon id formant hz
M 1 IY 1 F0 160
M 1 IY 1 F1 240
M 1 IY 1 F2 2280
M 1 IY 1 F3 2850

Normalization

The need for vowel formant normalization can be neatly summarized in the following plot, which has a boxplot for each “sex” (in Peterson & Barney, that’s C: Child, F: Female, M: Male) for each formant. You can see how:

  • For each formant the overall distribution of values goes C > F > M,
  • The difference between each group appears to get more extreme the higher the formant.
  pb_long %>%
    ggplot(aes(formant, hz, fill = sex))+
      geom_boxplot()

center

One thing that’s not immediately evident from the figure above is that the variance of formant values also differs between each sex, also following the pattern C > F > M

  pb_long %>%
    group_by(speaker, sex, formant) %>%
    summarise(hz_sd = sd(hz))%>%
    ggplot(aes(formant, hz_sd, fill = sex))+
      geom_boxplot()

center

The usual way of representing these sex effects emphasizes the effect it has on the vowel triangle, which is especially evident in the high front region.

  pb_long %>%
    spread(formant, hz) %>%
    ggplot(aes(F2, F1, color = sex))+
      geom_point()+
      scale_x_reverse()+
      scale_y_reverse()+
      coord_fixed()

center

The fact there are these differences in vowel spectra and the reasons for it are interesting, but when you’re doing empirical research, not all interesting phenomena are the object of study. Usually, sociolinguists want to factor out gross sex differences involving the location and scale of the formant values. This usually involves transforming formant values by subtraction and division with reference to some other set of values. How you characterize that process is the point of this post.

A literature developing a typology for normalization in sociolinguistics has sprung up, beginning with Adank (2003). Procedures are usually characterized by what subsets of the data are used to estimate normalization parameters. For example:

Speaker

  • Speaker Intrinsic Parameters estimated within each individual speaker’s data.
  • Speaker Extrinsic Parameters estimated across multiple speakers’ data.

Formant:

  • Formant Intrinsic Each formant is separately normalized.
  • Speaker Extrinsic Multiple formants are normalized simultaneously.

Vowel (token)

  • Vowel Intrinsic Each vowel token is normalized individually.
  • Vowel Extrinsic Multiple vowels are normalized simultaneously.

I’ll be looking at 6 normalization procedures in this post, and they break down like so (speaker extrinsic in bold).

  Vowel Token Intrinsic Vowel Token Extrinsic
Formant Intrinsic   Z-score, Nearey1, Watt & Fabricius
Formant Extrinsic Bark Difference Nearey2, ANAE

To a large degree, this “intrinsic” vs “extrinsic” distinction can be captured by which variables are included in a dplyr::group_by() operation. I’ll try to be pedantic about that fact in the R code below, even if the particular grouping is going to be vacuous with respect to R’s vectorized arithmetic.

Another thing you might notice looking at the code is that these normalization procedures can be characterized by a few other data base operations. For example some require tidr::spread() while others don’t, meaning they have different models of what constitutes an observation. Some also require what I’ll call a resetting of the “scope” of analysis. That is, they take the raw data, run through a chain of operations, then need to return back to the raw data, merging back in the results of the first chain of operations.

The point of this post isn’t really to compare the effectiveness of these procedures relative to what kinds of grouping and data base operations they require, or their model of a observation. Comparisons of normalization procedures have been done to death are well understood. This is more about exploring the structure of data and database operations in a comfortable domain for sociolinguists.

Bark Difference Metric

First off is the Bark Difference Metric, which is the only vowel intrinsic method I’ll be looking at. It tries to normalize F1 and F2 with respect to either F0 or F3. It also converts the Hz to Bark first, which tamps down on the broader spread of the higher formants a bit. ±1 on the Bark scale is also supposed to be psychoacoustically meaningful.

First, I’ll define a function that takes Hz values and converts them to bark.

  bark <- function(hz){
    (26.81/(1+(1960/hz)))-0.53
  }

Here’s the chain of operations that does the Bark difference normalization.

  • Convert hz to bark
  • drop the hz column
  • create columns for each formant, filled with their value in bark
  • group_by() vowel token id (vacuous)
  • normalize backness by F3-F2
  • normalize height by F1-F0
  pb_long %>%
    mutate(bark = bark(hz))%>%
    select(-hz)%>%
    spread(formant, bark)%>%
    group_by(id)%>%
    mutate(backness = F3-F2,
           height = F1-F0)->pb_bark

Here’s how that looks.

  pb_bark %>%
    ggplot(aes(backness, height, color = sex))+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      coord_fixed()

center

Some things to notice about the resulting normalized space:

  • As the backness dimension increases, so does backness. As the height dimension increases, so does lowness (hence the reversed y-axis).
  • The relative relationship between F1 and F2 has been eliminated. They both run on scales from about 0 to about 9.
  • The larger variance in backness relative to height is preserved, but is mitigated relative to the raw Hz.

Bark Difference Summary

Model of an observation

  • F0, F1, F2, and F3 for each vowel token for each speaker.

Scope resets

  • None.

ANAE

The Atlas of North American English is a formant extrinsic, speaker extrinsic normalization technique. It involves calculating the grand average log(hz) for all speakers for F1 and F2 combined. Here’s a function to do the actual normalization calculation.

  anae_fun <- function(hz, grand_mean){
    hz * (exp(grand_mean - mean(log(hz))))
  }

While pb_long already has the data in the correct format to do this easilly, the original ANAE approach only looked at F1 and F2, so we need to filter the data to only have those rows corresponding to F1 and F2 estimates.

This dplyr chain does that filtering, groups the data by speaker, estimates the mean log(hz) for each speaker, then the grand mean across all speakers.

  pb_long %>%
    filter(formant %in% c("F1","F2"))%>%
    group_by(speaker)%>%
    summarise(log_mean = mean(log(hz)))%>%
    summarise(log_mean = mean(log_mean))%>%
    use_series("log_mean")-> grand_mean

We now need to return to the whole data set, and use this grand mean to normalize the data within each speaker. We need to group by the speaker, so that we can estimate individual speakers’ mean log(hz).

  pb_long %>%
    filter(formant %in% c("F1","F2"))%>%    
    group_by(speaker)%>%
    mutate(norm_hz = anae_fun(hz, grand_mean))->pb_anae_long

This method preserves both the relative ordering of F1 and F2, as well as the greater spread in F2 relative to F1.

  pb_anae_long %>%
    ggplot(aes(formant, norm_hz, fill = sex))+
      geom_boxplot()

center

Here’s how it looks in the vowel triangle.

  pb_anae_long %>%
    select(-hz)%>%
    spread(formant, norm_hz)%>%
    ggplot(aes(F2, F1, color = sex))+
      geom_point(alpha = 0.75)+
      scale_x_reverse()+
      scale_y_reverse()+
      coord_fixed()

center

ANAE Summary

Model of an observation

  • Hz for each formant for each vowel token for each speaker

Scope resets

  • One.

Nearey

Nearey normalization is essentially identitcal to the ANAE normalization, except it’s speaker intrinsic. That’s not always clear given the way the ANAE and Nearey formulas are usually given, so I’ve defined the Nearey formula below in a way to emphasize this fact. nearey_fun() is identical to anae_fun() except it takes a single argument, and grand_mean is replaced by 0.

  nearey_fun <- function(hz){
    hz * (exp(0 - mean(log(hz))))
  }

There are two instantiations of Nearey. Nearey1 is formant intrinsic, and Nearey2 is formant extrinsic.

Nearey is easy peasy.

  • group data, appropriately
  • apply the normalization function
  pb_long %>%
    group_by(speaker, formant)%>%
    mutate(nearey = nearey_fun(hz)) -> pb_nearey1_long
  
  pb_long %>%
    filter(formant %in% c("F1","F2"))%>%
    group_by(speaker)%>%
    mutate(nearey = nearey_fun(hz)) -> pb_nearey2_long  

The formant intrinsic approach eliminates the relative ordering of each formant, and mitigates the difference in variance, at least between F1 and F2. The formant extrinsic approach looks roughly similar to the ANAE plot

  pb_nearey1_long %>%
    ggplot(aes(formant, nearey, fill = sex)) + 
      geom_boxplot()+
      ggtitle("Nearey1: Formant intrinsic")

center

  pb_nearey2_long %>%
    ggplot(aes(formant, nearey, fill = sex)) + 
      geom_boxplot()+
      ggtitle("Nearey2: Formant extrinsic")

center

Here’s how it looks in the vowel space. For Nearey1, I’ve included horizontal and vertical lines at 1, since these represent the “center” of the vowel space as far as the normalization is concerned.

  pb_nearey1_long %>%
    select(-hz)%>%
    spread(formant, nearey)%>%
    ggplot(aes(F2, F1, color = sex))+
      geom_hline(y = 1)+
      geom_vline(x = 1)+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()+
      ggtitle("Nearey1: Formant intrinsic")
## Error: Unknown parameters: y
  pb_nearey2_long %>%
    select(-hz)%>%
    spread(formant, nearey)%>%
    ggplot(aes(F2, F1, color = sex))+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()+
      ggtitle("Nearey2: Formant extrinsic")

center

Nearey Summary

Model of an observation

  • Hz for each formant for each vowel token for each speaker

Scope resets

  • None

Z-Score (a.k.a. Lobanov)

Z-scores are one of the most wide spread forms of normalization across any area of research. For example, Gelman, Jakulin, Pittau & Su recommend that for logistic regressions, all continuous variables be converted to a modified z-score: \(\frac{x-mean(x)}{2sd(x)}\). The ubiquity of the z-score across all domains of research, and the relative obscurity of what the “Lobanov” normalization is, is why I’ll always describe this procedure as “z-score normalization (also known as Lobanov normalization)”.

The z-score function itself is straight forward. R actually has a native z-scoring function scale(), but I’ll write out a fuction called zscore() just for explicitness.

  zscore <- function(hz){
    (hz-mean(hz))/sd(hz)
  }

To apply it to the data, it requires just one grouped application.

  pb_long %>%
    group_by(speaker, formant) %>%
    mutate(zscore = zscore(hz)) ->  pb_z_long

Z-scores eliminate the relative ordering of the formants, and mitigate the different variances between them.

  pb_z_long %>%
    ggplot(aes(formant, zscore, fill = sex))+
      geom_boxplot()

center

Maybe I’m just biased because z-scores are what I always use, but this to me looks like what a vowel space should.

  pb_z_long %>%
    select(-hz)%>%
    spread(formant, zscore)%>%
    ggplot(aes(F2, F1, color= sex))+
      geom_hline(y = 0)+
      geom_vline(x = 0)+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()
## Error: Unknown parameters: y

Watt & Fabricius

The Watt & Fabricius method (and ensuing modifications) focuses on normalizing vowels in terms of aligning vowel triangles in F1xF2 space. As such, it’s model of an observation is implicitly F1 and F2 for each vowel token.

The basic Watt & Fabricius function involves dividing each speaker’s F1 and F2 by “centroid” values, corresponding to the center of the vowel triangle.

  wf_fun <- function(hz, centroid){
    hz/centroid
  }

The method involves two steps. First, estimating the centroid values for F1 and F2 for each speaker. Second, after merging these centroid values back onto the original data, dividing F1 and F2 by them.

# Part 1: Centriod Estimation
  pb_long %>%
    filter(formant %in% c("F1","F2"))%>%
    group_by(speaker, formant, phon)%>%
    summarise(hz = mean(hz))%>%
    spread(formant, hz)%>%
    group_by(speaker)%>%
    summarise(beet1 = F1[phon=="IY"],
              beet2 = F2[phon=="IY"],
              school1 = beet1,
              school2 = beet1,
              bot1 = F1[phon=="AA"],
              bot2 = ((beet2 + school2)/2) + school2) %>%
    mutate(S1 = (beet1 + bot1 + school1)/3,
           S2 = (beet2 + bot2 + school2)/3)%>%
    select(speaker, S1, S2)->speaker_s
  
# Step 2: Merging on centroids and normalizing
  pb_long %>%
    spread(formant, hz)%>%
    left_join(speaker_s)%>%
    group_by(speaker)%>%
    mutate(F1_norm = wf_fun(F1, S1),
           F2_norm = wf_fun(F2, S2)) -> pb_wf
## Joining, by = "speaker"

Here’s the resulting vowel space. I’ll say it does a helluva job normalizing /iy/.

  pb_wf %>%
    ggplot(aes(F2_norm, F1_norm, color = sex))+
      geom_hline(y = 1)+
      geom_vline(x = 1)+
      geom_point(alpha = 0.75)+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()
## Error: Unknown parameters: y

Watt & Fabricius Summary

Observation Model

  • F1 and F2 for each vowel token for each speaker

Scope Resets

  • One

Summary

So, here’s a summary of these methods, classified by their observation models, number of scope resets, and their various trinsicities.

  Observation Model Scope Resets Speaker Formant Vowel
Bark Difference F0, F1, F2, F3 for each vowel 0 intrinsic extrinsic instrinsic
ANAE Hz for each formant 1 ex ex ex
Nearey1 Hz for each formant 0 in in ex
Nearey2 Hz for each formant 0 in ex ex
Z-score Hz for each formant 0 in in ex
Watt & Fabricius F1, F2 for each vowel 1 in in ex

It’s kind of interesting how relatively orthogonal the observation model, the scope resetting and the trinsicities are. Only two of the procedures I looked at involved scope resetting (ANAE, and Watt & Fabricius). I believe scope resets are going to always be a feature of speaker-extrinsic models, since you’ll need to estimate some kind of over-arching parameter for all speakers, then go back over each individual speaker’s data. But the fact Watt & Fabricius also involved a scope reset goes to show that they’re not a property exclusive to speaker extrinsic normalization. It’s also probably the case that any given vowel intrinsic method will have an observation model similar to the Bark Difference metric, but again, Watt & Fabricius, an vowel extrinsic method, has a very similar model.

by Joe

Comments

Tidy Data, Split-Apply-Combine, and Vowel Data: Part 1

-June 13, 2015

I’ve been thinking a bit about principles of Tidy Data, the Split-Apply-Combine strategy for data analysis, and how they apply to the kind of data I work with most of the time. I think walking through some of the vowel normalization methods sociolinguists tend to use makes for a good example case for exploring these principles. This will be a two part post, first focusing on Tidy Data.

Tidy Data

The three principles of Tidy Data are:

  1. Every variable has a column
  2. Every row is an observation
  3. Each type of observational unit forms a table.

It’s probably best to discuss these principles in the context of a specific data set. I’ll be using the Peterson & Barney vowels.

  library(downloader)
  library(knitr)
  library(dplyr)
  library(magrittr)
  library(tidyr)
  library(ggplot2)
  ## temp file for downloading
  tmp <- tempfile()
  pb_tar <- download("https://www.cs.cmu.edu/Groups/AI/areas/speech/database/pb/pb.tgz", 
                     tmp, quiet = T)
  
  ## decompress the data
  untar(tmp, "PetersonBarney/verified_pb.data")
  
  ## read the data
  pb_data <- read.delim("PetersonBarney/verified_pb.data", header = F)
  
  ## cleanup
  unlink(tmp)
  success <- file.remove("PetersonBarney/verified_pb.data")
  success <- file.remove("PetersonBarney")

  ## clean up the pb data a bit
  ## - rename columns,
  ## - recode sex to be a factor
  ## - eliminate some cruft
  ## - add a unique id for each vowel
  pb_data %<>%
    set_colnames(c("sex", "speaker", "phonN", "phon", 
                   "F0", "F1", "F2", "F3")) %>%
    mutate(sex = factor(sex, labels = c("M","F","C")),
           phon = gsub("\\*", "", phon)) %>%
    select(sex, speaker, phon, F0, F1, F2, F3) %>%
    mutate(id = 1:n())  

I’ll be using functionality from dplyr, magrittr and tidyr.

tidyr - Functions for turning columns to rows and rows to columns. - functions gather(), spread(), and separate() used here.

dplyr - Functions for implementing Split-Apply-Combine - group_by(), summarise(), mutate(), filter(), select(), and left_join() used here.

magrittr - Additional piping functionality. - %>%, set_colnames(), multiply_by(), subtract(), divide_by(), and lambda notation used here.

Vowels and Tidy Data

Looking at the Peterson and Barney data, it looks fairly tidy.

  pb_data %>%
    head()
##   sex speaker phon  F0  F1   F2   F3 id
## 1   M       1   IY 160 240 2280 2850  1
## 2   M       1   IY 186 280 2400 2790  2
## 3   M       1   IH 203 390 2030 2640  3
## 4   M       1   IH 192 310 1980 2550  4
## 5   M       1   EH 161 490 1870 2420  5
## 6   M       1   EH 155 570 1700 2600  6

So let’s convert it to what would unambiguously be considered “messy” data. I’ll estimate mean F1 and F2 values for each vowel, and then create a column for each vowel.

  pb_data %>%
    select(sex, speaker, id, phon, F1, F2) %>%
    gather(formant, hz, F1:F2) %>%
    group_by(sex, speaker, phon, formant) %>%
    summarise(hz = mean(hz)) %>%
    mutate(vowel_formant = paste(phon, formant, sep = "_")) %>%
    ungroup() %>%
    select(-formant, -phon)%>%
    spread(vowel_formant, hz)->pb_messy
  
  pb_messy
## # A tibble: 76 x 22
##       sex speaker AA_F1 AA_F2 AE_F1  AE_F2 AH_F1  AH_F2 AO_F1 AO_F2 EH_F1
## *  <fctr>   <int> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1       M       1 770.0  1065 595.0 1760.0 605.0 1275.0 630.0 975.0 530.0
## 2       M       2 660.0  1070 695.0 1650.0 637.5 1070.0 570.0 570.0 505.0
## 3       M       3 680.0  1085 644.0 1732.5 606.0 1180.0 545.0 855.0 545.0
## 4       M       4 775.0  1110 760.0 1595.0 667.5 1170.0 547.5 870.0 562.5
## 5       M       5 740.0  1240 790.0 1705.0 640.0 1235.0 545.0 925.0 565.0
## 6       M       6 705.0  1030 700.5 1678.5 703.0 1210.0 690.0 995.0 585.0
## 7       M       7 650.0   975 757.5 1805.0 630.0 1245.0 470.0 760.0 622.0
## 8       M       8 682.0  1160 533.0 1940.0 632.0 1325.0 584.0 967.5 472.5
## 9       M       9 706.5  1052 591.0 1762.5 633.5 1122.5 607.0 854.0 584.0
## 10      M      10 763.5  1053 642.0 1675.0 661.0 1240.0 582.5 809.5 500.0
## # ... with 66 more rows, and 11 more variables: EH_F2 <dbl>, ER_F1 <dbl>,
## #   ER_F2 <dbl>, IH_F1 <dbl>, IH_F2 <dbl>, IY_F1 <dbl>, IY_F2 <dbl>,
## #   UH_F1 <dbl>, UH_F2 <dbl>, UW_F1 <dbl>, UW_F2 <dbl>

I have seen vowel formant data stored in this kind of format before, and it definitely counts as “messy”. Looking at this data set, there are the following variables:

  • speaker identity
  • speaker sex
  • vowel measured
  • vowel F1
  • vowel F2

And one observation is

  • The F1 and F2 of each vowel for each speaker.

But, there are actually three different variables smushed together along the columns, Vowel + (F1,F2), and there are multiple observations per row. We definitely don’t want to store data or work with data in this format. If you already have data in this format, or have the misfortune of needing to work with data in this format, it’s easy enough to convert this data to a tidyier format.

Step 1: “gather” the data into a long format.

  pb_messy %>%
    gather(vowel_formant, hz, AA_F1:UW_F2)->pb_gathered
  
  pb_gathered
## # A tibble: 1,520 x 4
##       sex speaker vowel_formant    hz
##    <fctr>   <int>         <chr> <dbl>
## 1       M       1         AA_F1 770.0
## 2       M       2         AA_F1 660.0
## 3       M       3         AA_F1 680.0
## 4       M       4         AA_F1 775.0
## 5       M       5         AA_F1 740.0
## 6       M       6         AA_F1 705.0
## 7       M       7         AA_F1 650.0
## 8       M       8         AA_F1 682.0
## 9       M       9         AA_F1 706.5
## 10      M      10         AA_F1 763.5
## # ... with 1,510 more rows

Step 2: Separate the vowel + formant column into two separate variables

  pb_gathered %>%
    separate(vowel_formant, 
             into = c("vowel","formant"), 
             sep = "_")->pb_gathered_split
  
  pb_gathered_split
## # A tibble: 1,520 x 5
##       sex speaker vowel formant    hz
## *  <fctr>   <int> <chr>   <chr> <dbl>
## 1       M       1    AA      F1 770.0
## 2       M       2    AA      F1 660.0
## 3       M       3    AA      F1 680.0
## 4       M       4    AA      F1 775.0
## 5       M       5    AA      F1 740.0
## 6       M       6    AA      F1 705.0
## 7       M       7    AA      F1 650.0
## 8       M       8    AA      F1 682.0
## 9       M       9    AA      F1 706.5
## 10      M      10    AA      F1 763.5
## # ... with 1,510 more rows

Step 3: Use the values in the formant column as column names of their own.

  pb_gathered_split%>%
    spread(formant, hz)->pb_spread
  
  pb_spread
## # A tibble: 760 x 5
##       sex speaker vowel    F1    F2
## *  <fctr>   <int> <chr> <dbl> <dbl>
## 1       M       1    AA   770  1065
## 2       M       1    AE   595  1760
## 3       M       1    AH   605  1275
## 4       M       1    AO   630   975
## 5       M       1    EH   530  1785
## 6       M       1    ER   415  1425
## 7       M       1    IH   350  2005
## 8       M       1    IY   260  2340
## 9       M       1    UH   420  1095
## 10      M       1    UW   255   985
## # ... with 750 more rows

And with the magic of %>%, you can do all three in one long chain.

  pb_messy %>%
    gather(vowel_formant, hz, AA_F1:UW_F2)%>%
    separate(vowel_formant, 
             into = c("vowel","formant"), 
             sep = "_")%>%
    spread(formant, hz)
## # A tibble: 760 x 5
##       sex speaker vowel    F1    F2
## *  <fctr>   <int> <chr> <dbl> <dbl>
## 1       M       1    AA   770  1065
## 2       M       1    AE   595  1760
## 3       M       1    AH   605  1275
## 4       M       1    AO   630   975
## 5       M       1    EH   530  1785
## 6       M       1    ER   415  1425
## 7       M       1    IH   350  2005
## 8       M       1    IY   260  2340
## 9       M       1    UH   420  1095
## 10      M       1    UW   255   985
## # ... with 750 more rows

Now, it’s a lot easier to make the classing F2xF1 plot.

  pb_spread %>%
    ggplot(aes(F2, F1, color = sex))+
      geom_point()+
      scale_y_reverse()+
      scale_x_reverse()+
      coord_fixed()

center

What is an observation?

When I described the variables in the data, and what counted as an observation, I was assuming, like I think most sociolinguists do, that one “observation” is one vowel, and that we have multiple different kinds of measurements per vowel (F1, F2, duration, etc.). However, as I think will be made clearer in the next post on normalization, some techniques actually assume a finer grained “observation” than that. Let’s look at the relatively tidy raw Peterson & Barney data.

  pb_data %>%
    tbl_df()
## # A tibble: 1,520 x 8
##       sex speaker  phon    F0    F1    F2    F3    id
##    <fctr>   <int> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1       M       1    IY   160   240  2280  2850     1
## 2       M       1    IY   186   280  2400  2790     2
## 3       M       1    IH   203   390  2030  2640     3
## 4       M       1    IH   192   310  1980  2550     4
## 5       M       1    EH   161   490  1870  2420     5
## 6       M       1    EH   155   570  1700  2600     6
## 7       M       1    AE   140   560  1820  2660     7
## 8       M       1    AE   180   630  1700  2550     8
## 9       M       1    AH   144   590  1250  2620     9
## 10      M       1    AH   148   620  1300  2530    10
## # ... with 1,510 more rows

We could identify the following variables:

  • speaker identity
  • speaker sex
  • vowel measure id
  • vowel label
  • formant estimated
  • value in hz

And we could define an observation as:

  • One hz value from one formant for one vowel measured from one speaker.

Under this definition of the variables and observations, one of the variables (the vowel which was measured) is spread out along the columns. Let’s apply the tidying approach I described above:

  pb_data %>% 
    tbl_df() %>%
    gather(formant, hz, F0:F3) -> pb_long
  
  pb_long
## # A tibble: 6,080 x 6
##       sex speaker  phon    id formant    hz
##    <fctr>   <int> <chr> <int>   <chr> <dbl>
## 1       M       1    IY     1      F0   160
## 2       M       1    IY     2      F0   186
## 3       M       1    IH     3      F0   203
## 4       M       1    IH     4      F0   192
## 5       M       1    EH     5      F0   161
## 6       M       1    EH     6      F0   155
## 7       M       1    AE     7      F0   140
## 8       M       1    AE     8      F0   180
## 9       M       1    AH     9      F0   144
## 10      M       1    AH    10      F0   148
## # ... with 6,070 more rows

Why would you want to have your data in this kind for weird format? Well, for one, as I’ll demonstrate in the next post, it makes it a bit easier to do some kinds of vowel normalization that require you to estimate the mean of F1 and F2 together (like the ANAE method, or versions of Nearey normalization). Second, this can be a relatively interesting format of its own.

  pb_long %>%
    filter(phon %in% c("IY","AE","AA"))%>%
    group_by(phon, formant, sex, speaker)%>%
    summarise(hz = mean(hz)) %>%
    summarise(hz = mean(hz)) %>%
    ggplot(aes(formant, hz, color = phon))+
      geom_line(aes(group = phon))+
      facet_wrap(~sex)

center


Edit August 8, 2016: Spelling

by Joe

Comments

Testing Figures

-June 12, 2015

I’m not sure if I’m doing figures right.

  library(ggplot2)
  library(dplyr)
  library(magrittr)

  faithful %<>%
    mutate(clust = kmeans(., centers = 2) %>% extract2("cluster") %>% factor())


  ggplot(faithful, aes(waiting, eruptions, color = clust))+
    geom_point()

center

Update

So, I had to get knitr to save the figure file to one directory, but call it another in the code. That is, knitr would insert this line into the markdown:

![alt text](//jofrhwld.github.io/blog/figs/figure.png)

And what I needed it to insert was

![alt text]({{ site.baseurl }}/figs/figure.png)

So I just hacked the hell out of it by reading in the output of render_jekyll(), doing a regex substition, and writing it back out again.

It’s so ugly. You can check it out here: https://github.com/JoFrhwld/blog/blob/gh-pages/knit_post.R#L7

by Joe

Comments