-May 1, 2016
Today I went to Hadley Wickham’s talk to the EdinbR users’ group called Managing Many Models in R. It highlighted some workflows for cleanly fitting many models to subsets of a data set, then comparing across model fits I thought I’d experiment by applying them to some linguistic data that I’m familiar with.
Very briefly, “TD Deletion”, as I’m calling it, is a process common to every variety of English where a word final /t/ or /d/ in a cluster is deleted. Some example words are:
Word | Deleted |
---|---|
mist | mis’ |
kept | kep’ |
missed | miss’ |
It’s more common in words where the /t d/ is just part of the word, like mist than in words where it’s part of the past tense marker, like missed.
In the data, the mist case is labeled mono
for “monomorpheme” and the missed case is labeled past
.
The TD Deletion data I’m using is derived from The Buckeye Corpus, and is available in the grammarOfVariatonData
package.
library(devtools)
if(!require("grammarOfVariationData")){
install_github("jofrhwld/grammarOfVariationData")
library(grammarOfVariationData)
}
The functions in these workflows come from what some call the “Hadleyverse”, but he’s not the primary author of two of these packages (broom
and magrittr
).
When I’m using functions from these packages, I’ll use the package::function()
notation so it’s clear which functions come from which package, but in normal practice, you probably won’t need to include the package::
part yourself.
library(dplyr)
library(purrr)
library(broom)
library(tidyr)
library(magrittr)
library(ggplot2)
I’ll also be fitting some more standard mixed effects models for comparison sake.
library(lme4)
I’ll be looking at two factors that seem to have an influence on the rate of TD Deletion
Specifically, I’m going to see if there are any meaningful individual differences between how individual speakers are affected by these factors.
This is a subset of columns from the full Buckeye TD deletion data.
buckeye %>%
dplyr::select(Speaker, Word, seg, DictRate, Gram2, td)%>%
head()
## Speaker Word seg DictRate Gram2 td
## 1 s01 lived d 6.036144 past 1
## 2 s01 and d 6.182479 and 0
## 3 s01 raised d 6.431375 past 1
## 4 s01 west t 4.227088 mono 0
## 5 s01 kind d 6.612163 mono 0
## 6 s01 received d 4.600222 past 1
DictRate
corresponds to how many syllables per second there were in an 8 word window surrounding the target word, based on cannonical pronunciations.
td
is the outcome measure, and has a value of 1
if the /t d/ was pronounced and a value of 0
if it wasn’t.
DictRate
is heavily leftward skewed, with some extremely (implausibly) high syllable per second rates.
Things look more plausible if we exclude data where the target word was not right or left aligned with the window in which speech rate was estimated.
So, we’ll use this subset for the speech rate modeling.
buckeye %>%
ggplot(aes(DictRate))+
geom_density()
buckeye %>% filter(PreWindow > 1 | PostWindow > 1 ) %>%
ggplot(aes(DictRate))+
geom_density()
Step 1 is to create a data frame of data frames. First, let’s select the rows and columns of the full dataset we want to use.
rate_to_use <- buckeye %>%
dplyr::filter(PreWindow > 1 | PostWindow > 1,
Gram2 %in% c("past", "mono")) %>%
dplyr::mutate(RateCenter = DictRate - median(DictRate)) %>%
dplyr::select(Speaker, Word, RateCenter, Gram2, td)
I want to fit 1 model for each speaker for each grammatical class (past tense or monomorpheme), so I need to group this data frame by Speaker
and Gram2
.
Then, I’ll use nest()
to create a new column that contains a data frame with the remaining columns.
rate_nest <- rate_to_use %>%
dplyr::group_by(Speaker, Gram2) %>%
tidyr::nest()
Here’s what the data frame of data frames looks like:
rate_nest
## Source: local data frame [76 x 3]
##
## Speaker Gram2 data
## (fctr) (fctr) (chr)
## 1 s01 past <tbl_df [58,3]>
## 2 s01 mono <tbl_df [171,3]>
## 3 s02 mono <tbl_df [218,3]>
## 4 s02 past <tbl_df [46,3]>
## 5 s03 mono <tbl_df [219,3]>
## 6 s03 past <tbl_df [56,3]>
## 7 s04 mono <tbl_df [240,3]>
## 8 s04 past <tbl_df [41,3]>
## 9 s05 mono <tbl_df [184,3]>
## 10 s05 past <tbl_df [38,3]>
## .. ... ... ...
Here’s what the first item in the data
column looks like.
rate_nest$data[[1]]
## Source: local data frame [58 x 3]
##
## Word RateCenter td
## (fctr) (dbl) (int)
## 1 lived 0.8881393 1
## 2 raised 1.2833697 1
## 3 received -0.5477829 1
## 4 sustained 0.4536305 1
## 5 moved -1.1036793 1
## 6 involved 0.3295448 1
## 7 involved 3.3815070 1
## 8 looked -3.4224800 1
## 9 informed -1.5658356 1
## 10 watched 0.7513170 1
## .. ... ... ...
The next step is to write a little function that takes one of our small data frames as input, and will produce a model as an output.
fit_one_rate_mod <- function(data){
mod <- glm(td ~ RateCenter , data = data, family = binomial)
return(mod)
}
Here’s what happens when we pass just one of our smaller data frames to fit_one_rate_mod
.
fit_one_rate_mod(rate_nest$data[[1]])
##
## Call: glm(formula = td ~ RateCenter, family = binomial, data = data)
##
## Coefficients:
## (Intercept) RateCenter
## 1.35381 0.02941
##
## Degrees of Freedom: 57 Total (i.e. Null); 56 Residual
## Null Deviance: 59.14
## Residual Deviance: 59.12 AIC: 63.12
But what we want to do is apply fit_one_rate_mod
to all of the nested data frames.
For this, we an use the purrr::map()
function.
It takes a list as its first argument, and a function as its second argument.
It then applies that function to every item in the list.
rate_nest <- rate_nest %>%
dplyr::mutate(mod = purrr::map(data, fit_one_rate_mod))
rate_nest
## Source: local data frame [76 x 4]
##
## Speaker Gram2 data mod
## (fctr) (fctr) (chr) (chr)
## 1 s01 past <tbl_df [58,3]> <S3:glm, lm>
## 2 s01 mono <tbl_df [171,3]> <S3:glm, lm>
## 3 s02 mono <tbl_df [218,3]> <S3:glm, lm>
## 4 s02 past <tbl_df [46,3]> <S3:glm, lm>
## 5 s03 mono <tbl_df [219,3]> <S3:glm, lm>
## 6 s03 past <tbl_df [56,3]> <S3:glm, lm>
## 7 s04 mono <tbl_df [240,3]> <S3:glm, lm>
## 8 s04 past <tbl_df [41,3]> <S3:glm, lm>
## 9 s05 mono <tbl_df [184,3]> <S3:glm, lm>
## 10 s05 past <tbl_df [38,3]> <S3:glm, lm>
## .. ... ... ... ...
rate_nest$mod[[1]]
##
## Call: glm(formula = td ~ RateCenter, family = binomial, data = data)
##
## Coefficients:
## (Intercept) RateCenter
## 1.35381 0.02941
##
## Degrees of Freedom: 57 Total (i.e. Null); 56 Residual
## Null Deviance: 59.14
## Residual Deviance: 59.12 AIC: 63.12
In our nested data frame, we now have a coulum which contains a logistic regression model for each speaker for each grammatical class.
In order to to do anything meaningful now, we need to extract important information from each model, ideally storing it again in another data frame.
Here’s where broom::glance()
and broom::tidy()
come in.
glance()
will produce a one line summary for the entire model, and tidy()
will produce a summary of the model coefficients.
broom::glance(rate_nest$mod[[1]])
## null.deviance df.null logLik AIC BIC deviance df.residual
## 1 59.13862 57 -29.5594 63.11879 67.23968 59.11879 56
broom::tidy(rate_nest$mod[[1]])
## term estimate std.error statistic p.value
## 1 (Intercept) 1.35381379 0.3329521 4.0660918 0.0000478081
## 2 RateCenter 0.02941017 0.2086622 0.1409463 0.8879123804
What we’ll do is apply broom::tidy()
to each model, again using purr::map()
, and extract how many data points each model was fit with.
Then, with tidyr::unnest()
, we’ll pop this out into a free-standing data frame.
rate_nest <- rate_nest %>%
dplyr::mutate(tidy = purrr::map(mod, broom::tidy),
n = purrr::map(data, nrow) %>% simplify())
rate_nest
## Source: local data frame [76 x 6]
##
## Speaker Gram2 data mod tidy n
## (fctr) (fctr) (chr) (chr) (chr) (int)
## 1 s01 past <tbl_df [58,3]> <S3:glm, lm> <data.frame [2,5]> 58
## 2 s01 mono <tbl_df [171,3]> <S3:glm, lm> <data.frame [2,5]> 171
## 3 s02 mono <tbl_df [218,3]> <S3:glm, lm> <data.frame [2,5]> 218
## 4 s02 past <tbl_df [46,3]> <S3:glm, lm> <data.frame [2,5]> 46
## 5 s03 mono <tbl_df [219,3]> <S3:glm, lm> <data.frame [2,5]> 219
## 6 s03 past <tbl_df [56,3]> <S3:glm, lm> <data.frame [2,5]> 56
## 7 s04 mono <tbl_df [240,3]> <S3:glm, lm> <data.frame [2,5]> 240
## 8 s04 past <tbl_df [41,3]> <S3:glm, lm> <data.frame [2,5]> 41
## 9 s05 mono <tbl_df [184,3]> <S3:glm, lm> <data.frame [2,5]> 184
## 10 s05 past <tbl_df [38,3]> <S3:glm, lm> <data.frame [2,5]> 38
## .. ... ... ... ... ... ...
rate_coefs <- rate_nest %>%
tidyr::unnest(tidy) %>%
dplyr::select(-(std.error:p.value)) %>%
tidyr::spread(term, estimate)
rate_coefs
## Source: local data frame [76 x 5]
##
## Speaker Gram2 n (Intercept) RateCenter
## (fctr) (fctr) (int) (dbl) (dbl)
## 1 s01 mono 171 0.3109309 -0.08511930
## 2 s01 past 58 1.3538138 0.02941017
## 3 s02 mono 218 0.2834635 -0.30801155
## 4 s02 past 46 0.3795177 -0.04756514
## 5 s03 mono 219 -0.3751245 -0.42651949
## 6 s03 past 56 0.1228607 -0.34137119
## 7 s04 mono 240 -0.7632547 -0.22761177
## 8 s04 past 41 0.8512747 -0.18682255
## 9 s05 mono 184 0.8195723 -0.14171049
## 10 s05 past 38 1.5298129 -0.64609481
## .. ... ... ... ... ...
Now it’s just a matter of plotting it out.
rate_coefs%>%
ggplot(aes(`(Intercept)`, RateCenter, color = Gram2))+
geom_point(aes(size = n), alpha = 0.6)+
scale_size_area()+
coord_fixed()
It looks like the predicted effect that as speech rate increases, /t d/ retention decreases is held up. Most of the points for each speaker are below 0. There is a fair bit of inter-speaker variation in their baseline rate of /t d/ retention (along the x axis), but much less for the rate of speech effect, and there seems to be very little relationship between the two. That is, most speakers are likely react in a very similar way to increasing the rate of speech, regardless of whether they have a high or low baseline rate of /t d/ retention.
Let’s look at how this would all shake out in a normal mixed-effects model:
rate_data <- rate_nest%>%unnest(data)
rate_mod <- glmer(td ~ RateCenter*Gram2 + (RateCenter|Speaker) + (1|Word),
data = rate_data,
family = binomial)
summary(rate_mod)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: td ~ RateCenter * Gram2 + (RateCenter | Speaker) + (1 | Word)
## Data: rate_data
##
## AIC BIC logLik deviance df.resid
## 12846.1 12904.8 -6415.0 12830.1 11359
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -7.4473 -0.6971 0.3012 0.6463 4.0799
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Word (Intercept) 1.044249 1.02189
## Speaker (Intercept) 0.255016 0.50499
## RateCenter 0.002359 0.04857 -0.03
## Number of obs: 11367, groups: Word, 1000; Speaker, 38
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.74630 0.10817 6.900 5.21e-12 ***
## RateCenter -0.17929 0.01742 -10.294 < 2e-16 ***
## Gram2past 0.67300 0.11483 5.861 4.61e-09 ***
## RateCenter:Gram2past 0.08665 0.03441 2.518 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) RtCntr Grm2ps
## RateCenter -0.010
## Gram2past -0.363 0.002
## RtCntr:Grm2 0.001 -0.356 0.002
Looks mostly like we’d expect given the speaker-level modelling. The fixed effect of speech rate is negative, meaning a higher rate of speech has a higher rate of deletion. Looking at the random effects, we see a larger standard deviation for speaker-level intercepts than for speaker-level speech rate effects, just like in the figure above, and they are very weakly correlated.
Let’s do the same thing for frequency effects. I’ll be using word frequency norms from SUBTLEX-US, specifically a centered version of the Zipf Scale.
I’m going to compress this down into as few lines as possible just to show off the power of the workflow.
subtlex <- read.delim("../data/subtlex/SUBTLEXus74286wordstextversion-3.txt")
# This is just transforming and centering the frequency data for modelling.
# It's important to center the frequencies over word types in this way.
zipf_scores <- subtlex %>%
dplyr::mutate(Word = tolower(Word),
zipf = log10(SUBTLWF)) %>%
dplyr::select(Word, zipf) %>%
dplyr::semi_join(buckeye %>%
mutate(Word = tolower(Word)) %>%
filter(Gram2 %in% c("past", "mono"))) %>%
dplyr::mutate(zipf_center = zipf - median(zipf))
## Joining by: "Word"
# getting the subset of the data to use for analysis
freq_to_use <- buckeye%>%
dplyr::filter(Gram2 %in% c("past", "mono"))%>%
dplyr::mutate(Word = tolower(Word))%>%
dplyr::left_join(zipf_scores)%>%
dplyr::select(Speaker, Word, zipf_center, Gram2, td)
## Joining by: "Word"
# The model fitting function
fit_one_freq_mod <- function(data){
mod <- glm(td ~ zipf_center, data = data, family = binomial)
return(mod)
}
# this block does all of the modelling and summaries of the models.
freq_nest <- freq_to_use %>%
dplyr::group_by(Speaker, Gram2) %>%
tidyr::nest()%>%
dplyr::mutate(mod = purrr::map(data, fit_one_freq_mod),
tidy = map(mod, broom::tidy),
n = purrr::map(data, nrow)%>%simplify())
# extrating the tidy summaries for plotting
freq_coefs <- freq_nest %>%
tidyr::unnest(tidy) %>%
dplyr::select(-(std.error:p.value)) %>%
tidyr::spread(term, estimate)
freq_coefs%>%
ggplot(aes(`(Intercept)`, zipf_center, color = Gram2))+
geom_point(aes(size = n), alpha = 0.6)+
coord_fixed()
Unlike the rate of speech, it looks like there is quite a bit of interspeaker variation in their sensitivity to word frequency, and it’s related to their baseline rate of /t d/ retention. Speakers with a higher baseline deletion rate have a weaker frequency effect. You can kind of see this in the output of a standard mixed effects model, but it’s rolled up in the random-effects report that we usually don’t pay enough attention to.
freq_data <- freq_nest%>%unnest(data)
freq_mod <- glmer(td ~ zipf_center*Gram2 + (zipf_center|Speaker) + (1|Word),
data = freq_data,
family = binomial)
summary(freq_mod)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: td ~ zipf_center * Gram2 + (zipf_center | Speaker) + (1 | Word)
## Data: freq_data
##
## AIC BIC logLik deviance df.resid
## 13746.6 13805.9 -6865.3 13730.6 12163
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.1754 -0.6912 0.3073 0.6549 3.9010
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Word (Intercept) 1.05171 1.0255
## Speaker (Intercept) 0.28874 0.5373
## zipf_center 0.07198 0.2683 -0.43
## Number of obs: 12171, groups: Word, 1007; Speaker, 38
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.88120 0.12105 7.279 3.35e-13 ***
## zipf_center -0.13821 0.09446 -1.463 0.143
## Gram2past 0.65529 0.12451 5.263 1.42e-07 ***
## zipf_center:Gram2past -0.07027 0.13884 -0.506 0.613
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) zpf_cn Grm2ps
## zipf_center -0.480
## Gram2past -0.432 0.321
## zpf_cntr:G2 0.218 -0.530 -0.328
Up there in the random effects summary, you can see that the correlation between speaker-level intercepts and speaker-level frequency effects is -0.43, which means more or less the same thing as the figure above.
I think there’s a tendency when we have a nice big data set in hand to skip straight to fitting one big mixed-effects model to rule them all. But in rushing headlong into the one big model and focusing on the p-values on the fixed effects, we might be missing some important and interesting patterns in our data.
by Joe
-October 9, 2015
I’ve been playing around with an analysis method called “Functional Data Analysis”, which has me looking at the kinds of analyses we do of formant measurements a bit differently. In sociolinguistics, we’ve inherited an intellectual and methodological tradition of characterizing vowels, including diphthongs, using single point measurements. More historiography needs to be done on why this is, but I think it might be a long carry over from Daniel Jones’ cardinal vowels theory, plus Trager & Bloch’s theory characterizing English has having 6 phonemic vowels. I’ve got a table here of Trager & Bloch’s phonemic theory, cause it’s kind of beautiful.
phoneme | V | /V/+j | /V/+w | /V/+h |
---|---|---|---|---|
/i/ | pit | pea | dew | idea |
/e/ | pet | pay | – | yeah |
/a/ | pat | buy | bough | pa |
/o/ | pot | boy | – | paw |
/ə/ | butt | – | beau | huh |
/u/ | put | – | boo | Kahlua |
In any event, a common alternative to characterizing vowels in terms of point measurements of steady states is to take a fixed number of measurements along the full formant track, or to measure the formants at specific points proportional to the vowel’s duration. This latter approach is supported by FAVE, which outputs F1 and F2 at 20%, 35%, 50%, 65% and 80% of the vowel’s duration. My own take is that this approach is better for trying to understand formant dynamics, but has its own shortcomings that I’ll discuss here. I’ll also discuss a potential alternative using functional data analysis.
# Setting up for this post.
# NB: For some reason summarise() and mutate() from plyr
# wind up getting loaded when I compile this post, so I
# have to use dplyr::mutate and dplyr::summarise throughout
library(knitr)
library(dplyr)
library(tidyr)
library(magrittr)
library(fda)
library(ggplot2)
opts_chunk$set(cache = T, message= F, fig.width = 8/1.5,
fig.height = 5/1.5, dev = 'svg', message = F, autodep = T,
cache.path = "rmd_drafts/prop_problems/")
dep_auto()
To demonstrate for this post, I’ll be using formant tracks from my own speech as an interviewer from 2006.
Since FAVE v1.2, the option has existed to save the full formant tracks of each vowel analyzed with the --tracks
flag.
I’ll be focusing on pre-voiceless /ay/ since it’s got nice and dynamic F1 to explore, and it’s my favorite.
I’ve put the data online, so you can run all of this code at home.
tracks <- read.delim("http://jofrhwld.github.io/data/fruehwald.txt")
ays <- tracks %>%
filter(plt_vclass %in% c("ay0")) %>%
filter(!word == "LIKE") %>%
group_by(id) %>%
dplyr::mutate(t_rel = t-min(t),
t_prop = t_rel/max(t_rel))
As a quick demonstration, here’s a plot of 3 cherry picked F1 tracks, plotted across proportional time, with the 20%, 35%, 50%, 65% and 80% points marked.
example_tracks <- ays %>%
filter(id %in% c(756, 758, 1088))
example_points <- example_tracks %>%
group_by(id) %>%
slice(round(c(0.2, 0.35, 0.5, 0.65, 0.8) * n())) %>%
mutate(prop = c(0.2, 0.35, 0.5, 0.65, 0.8))
example_tracks %>%
ggplot(aes(t_prop, F1))+
geom_line(aes(group = id))+
geom_point(data = example_points, aes(color = factor(prop)), size = 3)
Looking at these formant tracks, it looks like an important feature is that F1 increases, reaches a maximum, then decreases. But that landmark feature happens at different points in proportional time for each of these formant tracks. for the lowest curve, the 20% time point seems closesest to the F1-maximum landmark, for the middle curve, the 35% point is closest, and for the highest curve it’s actually the 50% point. This poses a general problem for averaging across these proportional points in order to explore formant dynamics, since each token is at a different phase of its track at each point. At 50% of the duration, for example, two of the tokens are well into the transition from the nucleus to the glide, while the third is just reaching its maxmimum. By just averaging at the 50% point, you’re averaging over values that aren’t dynamically equivalent.
There is another approach that still allows for the analysis of formant trajectories, and tries to address the problem of dynamic equivalency. It involves the following steps:
The first step needs some explaining. In the dataframe example_tracks
, the data is represented as being just a collection of times and F1s.
For each row \(i\), you have a pair of observations, \(\langle x_i, y_i \rangle\), specifically \(\langle \text{time}_i, \text{F1}_i \rangle\).
But really, we want to think about F1 as a function of time, or \(y=f(x)\), or \(\text{F1}=f(\text{time})\).
I’ll be making this conversion from the \(\langle x_i, y_i \rangle\) to the \(y=f(x)\) representation for each formant track.
There are a few ways to go about this using the fda
package, and a few different parameters under my control that will affect the resulting functional representation for each formant track, all of which takes a while to explain.
For now, I’ll describe without explaining too much.
First, I’ll be using a relatively large b-spline basis, and penalize the roughness of the acceleration of the function (a.k.a. its second derivative).
# I'm leveraging dplyr's ability to store lists in columns.
example_tracks %>%
group_by(id) %>%
dplyr::mutate(t_rel = t-min(t),
t_prop = t_rel/max(t_rel),
duration = max(t_rel)) %>%
group_by(word, fol_seg, id, plt_vclass, duration, context) %>%
do(f1 = c(.$F1),
f2 = c(.$F2),
t_prop = c(.$t_prop))->examples_vector
# Defining the b-spline basis.
# This one has a pretty large number of knots given the
# amount of wiggliness we're looking for.
basic_basis <- create.bspline.basis(rangeval = c(0, 1),
nbasis = (30 + 7)-2,
norder = 7)
Here’s a demonstration of how the FDA works for one of the formant tracks. The figure plots the original F1 formant track measurements with the open circles, and the bold red line is the new functional representation.
sample_time <- examples_vector$t_prop[[2]]
sample_f1 <- examples_vector$f1[[2]]
# Maximum Likelihood fit.
# With the size of the basis, this is almost completely perfect fit.
# RMSE = 0.1972364, certainly overfit.
sample_fit <- smooth.basis(sample_time, sample_f1, basic_basis)
# Smooth with a roughness penalty on the acceleration
# Lfdobj = the derivative number
# 0 = original F1, 1 = velocity, 2 = acceleration
sample_smooth <- smooth.fdPar(sample_fit$fd, Lfdobj = 2, lambda = 1e-5)
plot(sample_time, sample_f1, cex = 0.5)
x <- plot(sample_smooth, add = T, col = 'red', lwd = 3)
One of the great benefits of converting the scalar representation to the functional representation is that we can also calculate the first and second derivatives of the function, to give us the rate of change in F1, and the rate of acceleration in F1.
# the assignment here is just to capture an annoying
# "done" value plot.fd produces
par(mfrow = c(1,3))
x <- plot(sample_smooth, Lfdobj = 0)
title("F1")
x <- plot(sample_smooth, Lfdobj = 1)
title("F1 velocity")
x <- plot(sample_smooth, Lfdobj = 2)
title("F1 acceleration")
If it’s not clear from eyeballing this figure, where the velocity of F1 crosses zero with a negative slope is where F1 reaches its maximum. I’ve written a little function that will return zero-crossing locations if you give it a functional data object. You can look at it here.
library(devtools)
source_gist("https://gist.github.com/JoFrhwld/f75528f7b358148ed9fb",
quiet = T)
sample_zero_cross <- zero_crossings(sample_smooth, Lfdobj = 1, slope = -1)
x <- plot(sample_smooth)
abline(v = sample_zero_cross, lty = 2)
So our next steps will be to smooth all three of these formant tracks in the same way, and identify the same landmark in all of them.
Here’s the dplyr
code I’ve used to fit a b-spline curve to each formant track, and smooth it with a roughness penalty on the acceleration.
example_smoothed <- examples_vector %>%
rowwise()%>%
do(fdobj = smooth.basis(.$t_prop,
.$f1,
basic_basis)$fd %>%
smooth.fdPar(., Lfdobj = 2, lambda = 1e-5))
# a list2fd function
source_gist("https://gist.github.com/JoFrhwld/0b0f69ca1a341f38f275",
quiet = T)
example_fd <- list2fd(example_smoothed$fdobj, basic_basis)
par(mfrow = c(1,3))
x <- plot(example_fd, lty = 1)
title("F1")
x <- plot(example_fd, Lfdobj = 1, lty = 1)
title("F1 velocity")
x <- plot(example_fd, Lfdobj = 2, lty = 1)
title("F1 acceleration")
Just like it was in the original plot of the raw formant tracks, these functions reach their F1 maximum at different points in proportional time, which is really clear to see if you look at where the velocity crosses zero for each. This next figure plots the F1 function with a vertical line indicating the position of the zero crossing.
example_zero_cross <- zero_crossings(example_fd, Lfdobj = 1, slope = -1)
x <- plot(example_fd, lty = 1)
abline(v = example_zero_cross, col = 1:3, lty = 2)
The next step in the functional data analysis is to register these functions with regards to their landmarks. I have to admit that I don’t fully understand how this works, but it involves warping the time domain. The warping can be non-linear, so we’ll be defining a basis function similar to the original basis we used for smoothing the formant tracks. One thing to note is that we’ll be warping the time domain for these functions according to just one landmark, but it is possible to define multiple landmarks for each function.
wbasis <- create.bspline.basis(rangeval=c(0,1),
norder=4,
breaks=seq(0, 1, len=10))
Wfd0 <- fd(matrix(0, wbasis$nbasis, 1), wbasis)
WfdPar <- fdPar(Wfd0, 1, 1e-4)
example_regfd <- landmarkreg(example_fd,
ximarks = example_zero_cross,
WfdPar = WfdPar)
## Progress: Each dot is a curve
## ...
# Plotting a comparison between the original and registered curves
par(mfrow = c(1,2))
x <- plot(example_fd, lty = 1)
abline(v = example_zero_cross, col = 1:3, lty = 2)
title("original")
x <- plot(example_regfd$regfd, lty = 1)
abline(v = mean(example_zero_cross), lty = 2)
title("registered")
Each curve has been warped so that their F1 maximum ocurrs at the same time. This is going to have the biggest effect on the mean value of each set of curves. By the way, b-splines have some really pleasing properties when it comes to estimating means, and broader analyses. You just take the mean across each b-spline coefficient.
# getting the mean function for the original
# and registered functions
mean_fd <- mean(example_fd)
mean_regfd <- mean(example_regfd$regfd)
# Prediction and plotting
data_frame(time = seq(0, 1, length = 100),
original = predict(mean_fd, newdata = time)[,1],
registered = predict(mean_regfd, newdata = time)[,1])%>%
gather(type, F1, original:registered) %>%
ggplot(aes(time, F1, color = type))+
geom_line()+
scale_color_brewer(palette = "Dark2")+
theme_bw()
The mean of the registered functions is different from the mean of the original functions in two ways: the location of the F1 maximum is different, and the magnitude of the F1 maximum is different. I would argue that the mean of the registered function is different in a good direction.
locations = c(original = zero_crossings(mean_fd, Lfdobj = 1, slope = -1),
registered = zero_crossings(mean_regfd, Lfdobj = 1, slope = -1))
magnitudes = c(original = eval.fd(locations[1], mean_fd),
registered = eval.fd(locations[2], mean_regfd))
rbind(locations, magnitudes)
## original registered
## locations 0.3080 0.3310
## magnitudes 777.6664 781.8984
The peak F1 occurs later in the mean for the registered function than the original functions. We previously estimated the peak F1 time for each of the original functions, and if we take the average of those, it’s the same as the peak F1 time for the mean of the registered functions.
example_zero_cross
## [1] 0.192 0.305 0.499
mean(example_zero_cross)
## [1] 0.332
Similarly, the mean F1 maximum of the registered functions is higher than for the original functions. Again, the mean of the registered functions is closer to the mean of each of the original functions F1 maximum.
# getting f1 maximum from each original function
f1_maxes <- eval.fd(example_zero_cross, example_fd) %>% diag
f1_maxes
## [1] 735.2567 791.2660 819.1646
mean(f1_maxes)
## [1] 781.8957
The lesson here is that the simple average of formant dynamics will not have the average dynamic properties of the input.
There are pros and cons to doing landmark registration like this. For example, it seems pretty clear that the duration of the vowel is going to be a predictor of where the F1 maximum occurs in proportional time. If you wanted to explore those properties, you wouldn’t want to do that analysis on registered functions.
by Joe
-June 17, 2015
This is part two of my blog post on how vowel formant data ought to be thought about in terms of Tidy Data, and Vowel Normalization can be thought about in terms of Split-Apply-Combine procedures. Let’s start out by getting our data in order.
library(downloader)
library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
## temp file for downloading
tmp <- tempfile()
pb_tar <- download("https://www.cs.cmu.edu/Groups/AI/areas/speech/database/pb/pb.tgz",
tmp, quiet = T)
## decompress the data
untar(tmp, "PetersonBarney/verified_pb.data")
## read the data
pb_data <- read.delim("PetersonBarney/verified_pb.data", header = F)
## cleanup
unlink(tmp)
success <- file.remove("PetersonBarney/verified_pb.data")
success <- file.remove("PetersonBarney")
pb_data %>%
set_colnames(c("sex", "speaker", "phonN", "phon",
"F0", "F1", "F2", "F3")) %>%
mutate(sex = factor(sex, labels = c("C","F","M"),levels = 3:1),
phon = gsub("\\*", "", phon)) %>%
select(sex, speaker, phon, F0, F1, F2, F3) %>%
mutate(id = 1:n())%>%
gather(formant, hz, F0:F3)%>%
tbl_df()-> pb_long
We’re going to be treating vowel formant data as having the following variables:
This means that each individual vowel token will have 4 rows in the pb_long
data frame, one for each formant.
pb_long %>%
filter(id == 1)%>%
kable()
sex | speaker | phon | id | formant | hz |
---|---|---|---|---|---|
M | 1 | IY | 1 | F0 | 160 |
M | 1 | IY | 1 | F1 | 240 |
M | 1 | IY | 1 | F2 | 2280 |
M | 1 | IY | 1 | F3 | 2850 |
The need for vowel formant normalization can be neatly summarized in the following plot, which has a boxplot for each “sex” (in Peterson & Barney, that’s C: Child, F: Female, M: Male) for each formant. You can see how:
pb_long %>%
ggplot(aes(formant, hz, fill = sex))+
geom_boxplot()
One thing that’s not immediately evident from the figure above is that the variance of formant values also differs between each sex, also following the pattern C > F > M
pb_long %>%
group_by(speaker, sex, formant) %>%
summarise(hz_sd = sd(hz))%>%
ggplot(aes(formant, hz_sd, fill = sex))+
geom_boxplot()
The usual way of representing these sex effects emphasizes the effect it has on the vowel triangle, which is especially evident in the high front region.
pb_long %>%
spread(formant, hz) %>%
ggplot(aes(F2, F1, color = sex))+
geom_point()+
scale_x_reverse()+
scale_y_reverse()+
coord_fixed()
The fact there are these differences in vowel spectra and the reasons for it are interesting, but when you’re doing empirical research, not all interesting phenomena are the object of study. Usually, sociolinguists want to factor out gross sex differences involving the location and scale of the formant values. This usually involves transforming formant values by subtraction and division with reference to some other set of values. How you characterize that process is the point of this post.
A literature developing a typology for normalization in sociolinguistics has sprung up, beginning with Adank (2003). Procedures are usually characterized by what subsets of the data are used to estimate normalization parameters. For example:
I’ll be looking at 6 normalization procedures in this post, and they break down like so (speaker extrinsic in bold).
Vowel Token Intrinsic | Vowel Token Extrinsic | |
---|---|---|
Formant Intrinsic | Z-score, Nearey1, Watt & Fabricius | |
Formant Extrinsic | Bark Difference | Nearey2, ANAE |
To a large degree, this “intrinsic” vs “extrinsic” distinction can be captured by which variables are included in a dplyr::group_by()
operation.
I’ll try to be pedantic about that fact in the R code below, even if the particular grouping is going to be vacuous with respect to R’s vectorized arithmetic.
Another thing you might notice looking at the code is that these normalization procedures can be characterized by a few other data base operations.
For example some require tidr::spread()
while others don’t, meaning they have different models of what constitutes an observation.
Some also require what I’ll call a resetting of the “scope” of analysis.
That is, they take the raw data, run through a chain of operations, then need to return back to the raw data, merging back in the results of the first chain of operations.
The point of this post isn’t really to compare the effectiveness of these procedures relative to what kinds of grouping and data base operations they require, or their model of a observation.
Comparisons of normalization procedures have been done to death are well understood.
This is more about exploring the structure of data and database operations in a comfortable domain for sociolinguists.
First off is the Bark Difference Metric, which is the only vowel intrinsic method I’ll be looking at. It tries to normalize F1 and F2 with respect to either F0 or F3. It also converts the Hz to Bark first, which tamps down on the broader spread of the higher formants a bit. ±1 on the Bark scale is also supposed to be psychoacoustically meaningful.
First, I’ll define a function that takes Hz values and converts them to bark.
bark <- function(hz){
(26.81/(1+(1960/hz)))-0.53
}
Here’s the chain of operations that does the Bark difference normalization.
hz
to bark
hz
columnbark
group_by()
vowel token id (vacuous) pb_long %>%
mutate(bark = bark(hz))%>%
select(-hz)%>%
spread(formant, bark)%>%
group_by(id)%>%
mutate(backness = F3-F2,
height = F1-F0)->pb_bark
Here’s how that looks.
pb_bark %>%
ggplot(aes(backness, height, color = sex))+
geom_point(alpha = 0.75)+
scale_y_reverse()+
coord_fixed()
Some things to notice about the resulting normalized space:
The Atlas of North American English is a formant extrinsic, speaker extrinsic normalization technique. It involves calculating the grand average log(hz) for all speakers for F1 and F2 combined. Here’s a function to do the actual normalization calculation.
anae_fun <- function(hz, grand_mean){
hz * (exp(grand_mean - mean(log(hz))))
}
While pb_long
already has the data in the correct format to do this easilly, the original ANAE approach only looked at F1 and F2, so we need to filter the data to only have those rows corresponding to F1 and F2 estimates.
This dplyr
chain does that filtering, groups the data by speaker, estimates the mean log(hz) for each speaker, then the grand mean across all speakers.
pb_long %>%
filter(formant %in% c("F1","F2"))%>%
group_by(speaker)%>%
summarise(log_mean = mean(log(hz)))%>%
summarise(log_mean = mean(log_mean))%>%
use_series("log_mean")-> grand_mean
We now need to return to the whole data set, and use this grand mean to normalize the data within each speaker. We need to group by the speaker, so that we can estimate individual speakers’ mean log(hz).
pb_long %>%
filter(formant %in% c("F1","F2"))%>%
group_by(speaker)%>%
mutate(norm_hz = anae_fun(hz, grand_mean))->pb_anae_long
This method preserves both the relative ordering of F1 and F2, as well as the greater spread in F2 relative to F1.
pb_anae_long %>%
ggplot(aes(formant, norm_hz, fill = sex))+
geom_boxplot()
Here’s how it looks in the vowel triangle.
pb_anae_long %>%
select(-hz)%>%
spread(formant, norm_hz)%>%
ggplot(aes(F2, F1, color = sex))+
geom_point(alpha = 0.75)+
scale_x_reverse()+
scale_y_reverse()+
coord_fixed()
Nearey normalization is essentially identitcal to the ANAE normalization, except it’s speaker intrinsic.
That’s not always clear given the way the ANAE and Nearey formulas are usually given, so I’ve defined the Nearey formula below in a way to emphasize this fact.
nearey_fun()
is identical to anae_fun()
except it takes a single argument, and grand_mean
is replaced by 0.
nearey_fun <- function(hz){
hz * (exp(0 - mean(log(hz))))
}
There are two instantiations of Nearey. Nearey1 is formant intrinsic, and Nearey2 is formant extrinsic.
Nearey is easy peasy.
pb_long %>%
group_by(speaker, formant)%>%
mutate(nearey = nearey_fun(hz)) -> pb_nearey1_long
pb_long %>%
filter(formant %in% c("F1","F2"))%>%
group_by(speaker)%>%
mutate(nearey = nearey_fun(hz)) -> pb_nearey2_long
The formant intrinsic approach eliminates the relative ordering of each formant, and mitigates the difference in variance, at least between F1 and F2. The formant extrinsic approach looks roughly similar to the ANAE plot
pb_nearey1_long %>%
ggplot(aes(formant, nearey, fill = sex)) +
geom_boxplot()+
ggtitle("Nearey1: Formant intrinsic")
pb_nearey2_long %>%
ggplot(aes(formant, nearey, fill = sex)) +
geom_boxplot()+
ggtitle("Nearey2: Formant extrinsic")
Here’s how it looks in the vowel space. For Nearey1, I’ve included horizontal and vertical lines at 1, since these represent the “center” of the vowel space as far as the normalization is concerned.
pb_nearey1_long %>%
select(-hz)%>%
spread(formant, nearey)%>%
ggplot(aes(F2, F1, color = sex))+
geom_hline(y = 1)+
geom_vline(x = 1)+
geom_point(alpha = 0.75)+
scale_y_reverse()+
scale_x_reverse()+
coord_fixed()+
ggtitle("Nearey1: Formant intrinsic")
## Error: Unknown parameters: y
pb_nearey2_long %>%
select(-hz)%>%
spread(formant, nearey)%>%
ggplot(aes(F2, F1, color = sex))+
geom_point(alpha = 0.75)+
scale_y_reverse()+
scale_x_reverse()+
coord_fixed()+
ggtitle("Nearey2: Formant extrinsic")
Z-scores are one of the most wide spread forms of normalization across any area of research. For example, Gelman, Jakulin, Pittau & Su recommend that for logistic regressions, all continuous variables be converted to a modified z-score: \(\frac{x-mean(x)}{2sd(x)}\). The ubiquity of the z-score across all domains of research, and the relative obscurity of what the “Lobanov” normalization is, is why I’ll always describe this procedure as “z-score normalization (also known as Lobanov normalization)”.
The z-score function itself is straight forward. R actually has a native z-scoring function scale()
, but I’ll write out a fuction called zscore()
just for explicitness.
zscore <- function(hz){
(hz-mean(hz))/sd(hz)
}
To apply it to the data, it requires just one grouped application.
pb_long %>%
group_by(speaker, formant) %>%
mutate(zscore = zscore(hz)) -> pb_z_long
Z-scores eliminate the relative ordering of the formants, and mitigate the different variances between them.
pb_z_long %>%
ggplot(aes(formant, zscore, fill = sex))+
geom_boxplot()
Maybe I’m just biased because z-scores are what I always use, but this to me looks like what a vowel space should.
pb_z_long %>%
select(-hz)%>%
spread(formant, zscore)%>%
ggplot(aes(F2, F1, color= sex))+
geom_hline(y = 0)+
geom_vline(x = 0)+
geom_point(alpha = 0.75)+
scale_y_reverse()+
scale_x_reverse()+
coord_fixed()
## Error: Unknown parameters: y
The Watt & Fabricius method (and ensuing modifications) focuses on normalizing vowels in terms of aligning vowel triangles in F1xF2 space. As such, it’s model of an observation is implicitly F1 and F2 for each vowel token.
The basic Watt & Fabricius function involves dividing each speaker’s F1 and F2 by “centroid” values, corresponding to the center of the vowel triangle.
wf_fun <- function(hz, centroid){
hz/centroid
}
The method involves two steps. First, estimating the centroid values for F1 and F2 for each speaker. Second, after merging these centroid values back onto the original data, dividing F1 and F2 by them.
# Part 1: Centriod Estimation
pb_long %>%
filter(formant %in% c("F1","F2"))%>%
group_by(speaker, formant, phon)%>%
summarise(hz = mean(hz))%>%
spread(formant, hz)%>%
group_by(speaker)%>%
summarise(beet1 = F1[phon=="IY"],
beet2 = F2[phon=="IY"],
school1 = beet1,
school2 = beet1,
bot1 = F1[phon=="AA"],
bot2 = ((beet2 + school2)/2) + school2) %>%
mutate(S1 = (beet1 + bot1 + school1)/3,
S2 = (beet2 + bot2 + school2)/3)%>%
select(speaker, S1, S2)->speaker_s
# Step 2: Merging on centroids and normalizing
pb_long %>%
spread(formant, hz)%>%
left_join(speaker_s)%>%
group_by(speaker)%>%
mutate(F1_norm = wf_fun(F1, S1),
F2_norm = wf_fun(F2, S2)) -> pb_wf
## Joining, by = "speaker"
Here’s the resulting vowel space. I’ll say it does a helluva job normalizing /iy/.
pb_wf %>%
ggplot(aes(F2_norm, F1_norm, color = sex))+
geom_hline(y = 1)+
geom_vline(x = 1)+
geom_point(alpha = 0.75)+
scale_y_reverse()+
scale_x_reverse()+
coord_fixed()
## Error: Unknown parameters: y
So, here’s a summary of these methods, classified by their observation models, number of scope resets, and their various trinsicities.
Observation Model | Scope Resets | Speaker | Formant | Vowel | |
---|---|---|---|---|---|
Bark Difference | F0, F1, F2, F3 for each vowel | 0 | intrinsic | extrinsic | instrinsic |
ANAE | Hz for each formant | 1 | ex | ex | ex |
Nearey1 | Hz for each formant | 0 | in | in | ex |
Nearey2 | Hz for each formant | 0 | in | ex | ex |
Z-score | Hz for each formant | 0 | in | in | ex |
Watt & Fabricius | F1, F2 for each vowel | 1 | in | in | ex |
It’s kind of interesting how relatively orthogonal the observation model, the scope resetting and the trinsicities are. Only two of the procedures I looked at involved scope resetting (ANAE, and Watt & Fabricius). I believe scope resets are going to always be a feature of speaker-extrinsic models, since you’ll need to estimate some kind of over-arching parameter for all speakers, then go back over each individual speaker’s data. But the fact Watt & Fabricius also involved a scope reset goes to show that they’re not a property exclusive to speaker extrinsic normalization. It’s also probably the case that any given vowel intrinsic method will have an observation model similar to the Bark Difference metric, but again, Watt & Fabricius, an vowel extrinsic method, has a very similar model.
by Joe
-June 13, 2015
I’ve been thinking a bit about principles of Tidy Data, the Split-Apply-Combine strategy for data analysis, and how they apply to the kind of data I work with most of the time. I think walking through some of the vowel normalization methods sociolinguists tend to use makes for a good example case for exploring these principles. This will be a two part post, first focusing on Tidy Data.
The three principles of Tidy Data are:
It’s probably best to discuss these principles in the context of a specific data set. I’ll be using the Peterson & Barney vowels.
library(downloader)
library(knitr)
library(dplyr)
library(magrittr)
library(tidyr)
library(ggplot2)
## temp file for downloading
tmp <- tempfile()
pb_tar <- download("https://www.cs.cmu.edu/Groups/AI/areas/speech/database/pb/pb.tgz",
tmp, quiet = T)
## decompress the data
untar(tmp, "PetersonBarney/verified_pb.data")
## read the data
pb_data <- read.delim("PetersonBarney/verified_pb.data", header = F)
## cleanup
unlink(tmp)
success <- file.remove("PetersonBarney/verified_pb.data")
success <- file.remove("PetersonBarney")
## clean up the pb data a bit
## - rename columns,
## - recode sex to be a factor
## - eliminate some cruft
## - add a unique id for each vowel
pb_data %<>%
set_colnames(c("sex", "speaker", "phonN", "phon",
"F0", "F1", "F2", "F3")) %>%
mutate(sex = factor(sex, labels = c("M","F","C")),
phon = gsub("\\*", "", phon)) %>%
select(sex, speaker, phon, F0, F1, F2, F3) %>%
mutate(id = 1:n())
I’ll be using functionality from dplyr
, magrittr
and tidyr
.
tidyr
- Functions for turning columns to rows and rows to columns.
- functions gather()
, spread()
, and separate()
used here.
dplyr
- Functions for implementing Split-Apply-Combine
- group_by()
, summarise()
, mutate()
, filter()
, select()
, and left_join()
used here.
magrittr
- Additional piping functionality.
- %>%
, set_colnames()
, multiply_by()
, subtract()
, divide_by()
, and lambda notation used here.
Looking at the Peterson and Barney data, it looks fairly tidy.
pb_data %>%
head()
## sex speaker phon F0 F1 F2 F3 id
## 1 M 1 IY 160 240 2280 2850 1
## 2 M 1 IY 186 280 2400 2790 2
## 3 M 1 IH 203 390 2030 2640 3
## 4 M 1 IH 192 310 1980 2550 4
## 5 M 1 EH 161 490 1870 2420 5
## 6 M 1 EH 155 570 1700 2600 6
So let’s convert it to what would unambiguously be considered “messy” data. I’ll estimate mean F1 and F2 values for each vowel, and then create a column for each vowel.
pb_data %>%
select(sex, speaker, id, phon, F1, F2) %>%
gather(formant, hz, F1:F2) %>%
group_by(sex, speaker, phon, formant) %>%
summarise(hz = mean(hz)) %>%
mutate(vowel_formant = paste(phon, formant, sep = "_")) %>%
ungroup() %>%
select(-formant, -phon)%>%
spread(vowel_formant, hz)->pb_messy
pb_messy
## # A tibble: 76 x 22
## sex speaker AA_F1 AA_F2 AE_F1 AE_F2 AH_F1 AH_F2 AO_F1 AO_F2 EH_F1
## * <fctr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M 1 770.0 1065 595.0 1760.0 605.0 1275.0 630.0 975.0 530.0
## 2 M 2 660.0 1070 695.0 1650.0 637.5 1070.0 570.0 570.0 505.0
## 3 M 3 680.0 1085 644.0 1732.5 606.0 1180.0 545.0 855.0 545.0
## 4 M 4 775.0 1110 760.0 1595.0 667.5 1170.0 547.5 870.0 562.5
## 5 M 5 740.0 1240 790.0 1705.0 640.0 1235.0 545.0 925.0 565.0
## 6 M 6 705.0 1030 700.5 1678.5 703.0 1210.0 690.0 995.0 585.0
## 7 M 7 650.0 975 757.5 1805.0 630.0 1245.0 470.0 760.0 622.0
## 8 M 8 682.0 1160 533.0 1940.0 632.0 1325.0 584.0 967.5 472.5
## 9 M 9 706.5 1052 591.0 1762.5 633.5 1122.5 607.0 854.0 584.0
## 10 M 10 763.5 1053 642.0 1675.0 661.0 1240.0 582.5 809.5 500.0
## # ... with 66 more rows, and 11 more variables: EH_F2 <dbl>, ER_F1 <dbl>,
## # ER_F2 <dbl>, IH_F1 <dbl>, IH_F2 <dbl>, IY_F1 <dbl>, IY_F2 <dbl>,
## # UH_F1 <dbl>, UH_F2 <dbl>, UW_F1 <dbl>, UW_F2 <dbl>
I have seen vowel formant data stored in this kind of format before, and it definitely counts as “messy”. Looking at this data set, there are the following variables:
And one observation is
But, there are actually three different variables smushed together along the columns, Vowel + (F1,F2), and there are multiple observations per row. We definitely don’t want to store data or work with data in this format. If you already have data in this format, or have the misfortune of needing to work with data in this format, it’s easy enough to convert this data to a tidyier format.
Step 1: “gather” the data into a long format.
pb_messy %>%
gather(vowel_formant, hz, AA_F1:UW_F2)->pb_gathered
pb_gathered
## # A tibble: 1,520 x 4
## sex speaker vowel_formant hz
## <fctr> <int> <chr> <dbl>
## 1 M 1 AA_F1 770.0
## 2 M 2 AA_F1 660.0
## 3 M 3 AA_F1 680.0
## 4 M 4 AA_F1 775.0
## 5 M 5 AA_F1 740.0
## 6 M 6 AA_F1 705.0
## 7 M 7 AA_F1 650.0
## 8 M 8 AA_F1 682.0
## 9 M 9 AA_F1 706.5
## 10 M 10 AA_F1 763.5
## # ... with 1,510 more rows
Step 2: Separate the vowel + formant column into two separate variables
pb_gathered %>%
separate(vowel_formant,
into = c("vowel","formant"),
sep = "_")->pb_gathered_split
pb_gathered_split
## # A tibble: 1,520 x 5
## sex speaker vowel formant hz
## * <fctr> <int> <chr> <chr> <dbl>
## 1 M 1 AA F1 770.0
## 2 M 2 AA F1 660.0
## 3 M 3 AA F1 680.0
## 4 M 4 AA F1 775.0
## 5 M 5 AA F1 740.0
## 6 M 6 AA F1 705.0
## 7 M 7 AA F1 650.0
## 8 M 8 AA F1 682.0
## 9 M 9 AA F1 706.5
## 10 M 10 AA F1 763.5
## # ... with 1,510 more rows
Step 3: Use the values in the formant
column as column names of their own.
pb_gathered_split%>%
spread(formant, hz)->pb_spread
pb_spread
## # A tibble: 760 x 5
## sex speaker vowel F1 F2
## * <fctr> <int> <chr> <dbl> <dbl>
## 1 M 1 AA 770 1065
## 2 M 1 AE 595 1760
## 3 M 1 AH 605 1275
## 4 M 1 AO 630 975
## 5 M 1 EH 530 1785
## 6 M 1 ER 415 1425
## 7 M 1 IH 350 2005
## 8 M 1 IY 260 2340
## 9 M 1 UH 420 1095
## 10 M 1 UW 255 985
## # ... with 750 more rows
And with the magic of %>%
, you can do all three in one long chain.
pb_messy %>%
gather(vowel_formant, hz, AA_F1:UW_F2)%>%
separate(vowel_formant,
into = c("vowel","formant"),
sep = "_")%>%
spread(formant, hz)
## # A tibble: 760 x 5
## sex speaker vowel F1 F2
## * <fctr> <int> <chr> <dbl> <dbl>
## 1 M 1 AA 770 1065
## 2 M 1 AE 595 1760
## 3 M 1 AH 605 1275
## 4 M 1 AO 630 975
## 5 M 1 EH 530 1785
## 6 M 1 ER 415 1425
## 7 M 1 IH 350 2005
## 8 M 1 IY 260 2340
## 9 M 1 UH 420 1095
## 10 M 1 UW 255 985
## # ... with 750 more rows
Now, it’s a lot easier to make the classing F2xF1 plot.
pb_spread %>%
ggplot(aes(F2, F1, color = sex))+
geom_point()+
scale_y_reverse()+
scale_x_reverse()+
coord_fixed()
When I described the variables in the data, and what counted as an observation, I was assuming, like I think most sociolinguists do, that one “observation” is one vowel, and that we have multiple different kinds of measurements per vowel (F1, F2, duration, etc.). However, as I think will be made clearer in the next post on normalization, some techniques actually assume a finer grained “observation” than that. Let’s look at the relatively tidy raw Peterson & Barney data.
pb_data %>%
tbl_df()
## # A tibble: 1,520 x 8
## sex speaker phon F0 F1 F2 F3 id
## <fctr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 M 1 IY 160 240 2280 2850 1
## 2 M 1 IY 186 280 2400 2790 2
## 3 M 1 IH 203 390 2030 2640 3
## 4 M 1 IH 192 310 1980 2550 4
## 5 M 1 EH 161 490 1870 2420 5
## 6 M 1 EH 155 570 1700 2600 6
## 7 M 1 AE 140 560 1820 2660 7
## 8 M 1 AE 180 630 1700 2550 8
## 9 M 1 AH 144 590 1250 2620 9
## 10 M 1 AH 148 620 1300 2530 10
## # ... with 1,510 more rows
We could identify the following variables:
And we could define an observation as:
Under this definition of the variables and observations, one of the variables (the vowel which was measured) is spread out along the columns. Let’s apply the tidying approach I described above:
pb_data %>%
tbl_df() %>%
gather(formant, hz, F0:F3) -> pb_long
pb_long
## # A tibble: 6,080 x 6
## sex speaker phon id formant hz
## <fctr> <int> <chr> <int> <chr> <dbl>
## 1 M 1 IY 1 F0 160
## 2 M 1 IY 2 F0 186
## 3 M 1 IH 3 F0 203
## 4 M 1 IH 4 F0 192
## 5 M 1 EH 5 F0 161
## 6 M 1 EH 6 F0 155
## 7 M 1 AE 7 F0 140
## 8 M 1 AE 8 F0 180
## 9 M 1 AH 9 F0 144
## 10 M 1 AH 10 F0 148
## # ... with 6,070 more rows
Why would you want to have your data in this kind for weird format? Well, for one, as I’ll demonstrate in the next post, it makes it a bit easier to do some kinds of vowel normalization that require you to estimate the mean of F1 and F2 together (like the ANAE method, or versions of Nearey normalization). Second, this can be a relatively interesting format of its own.
pb_long %>%
filter(phon %in% c("IY","AE","AA"))%>%
group_by(phon, formant, sex, speaker)%>%
summarise(hz = mean(hz)) %>%
summarise(hz = mean(hz)) %>%
ggplot(aes(formant, hz, color = phon))+
geom_line(aes(group = phon))+
facet_wrap(~sex)
Edit August 8, 2016: Spelling
by Joe
-June 12, 2015
I’m not sure if I’m doing figures right.
library(ggplot2)
library(dplyr)
library(magrittr)
faithful %<>%
mutate(clust = kmeans(., centers = 2) %>% extract2("cluster") %>% factor())
ggplot(faithful, aes(waiting, eruptions, color = clust))+
geom_point()
So, I had to get knitr to save the figure file to one directory, but call it another in the code. That is, knitr would insert this line into the markdown:
![alt text](//jofrhwld.github.io/blog/figs/figure.png)
And what I needed it to insert was
![alt text]({{ site.baseurl }}/figs/figure.png)
So I just hacked the hell out of it by reading in the output of render_jekyll()
, doing a regex substition, and writing it back out again.
It’s so ugly. You can check it out here: https://github.com/JoFrhwld/blog/blob/gh-pages/knit_post.R#L7
by Joe