*-July 29, 2014*

Recently, there was a pretty excellent post on the New York Times upshot blog about how your date of birth affects your political orientation. It attributes the results to “a new model” written in “new statistical software.” When I looked at the associated technical writeup, it was fun to see that it was actually a Stan model! I used Stan quite a bit in my dissertation.

The technical writeup was pretty clear, as I’ve found most things Andrew Gelman’s associated with to be. It also utilized a method for doing non-linear modeling in Stan that hadn’t occurred to me before. For my dissertation, I used b-splines together with Stan to do non-linear modelling, but Ghitza & Gelman’s approach is, I think, a bit easier to understand, and a bit more flexible. So, I thought I’d do a little writeup of how to do it.

The one requirement is that your continuous predictor needs to be quantizable. Fortunately for me, most of the stuff I look at involves date of birth. In this example, I’ll be using duration, which isn’t exactly the best candidate, but the duration measurements here are drawn from FAVE-align, which only has a granularity of 10ms, so I’ll be using 10ms increments as my quantized predictor.

More or less what you do is model the change in your outcome variable as you move from one point in your predictor to the next. In this example, I’m looking at how the normalized F1 of the word “I” changes as the vowel gets longer. So, if normalized F1 changes by 0.1 as the vowel goes from being 50ms long to 60ms long, we’d call the δ of 60ms 0.1. The predicted value of normalized F1 at any particular duration would then be the sum of all deltas up to that duration, plus an intercept term.

```
intercept <- 0.785
deltas = c(0, 0.1,0.097, 0.096,0.089)
durations = c(50,60,70,80,90)
rbind(durs = durations,
deltas = deltas,
cumsum_deltas = cumsum(deltas),
time_mu = cumsum(deltas) + intercept )
```

```
## [,1] [,2] [,3] [,4] [,5]
## durs 50.000 60.000 70.000 80.000 90.000
## deltas 0.000 0.100 0.097 0.096 0.089
## cumsum_deltas 0.000 0.100 0.197 0.293 0.382
## time_mu 0.785 0.885 0.982 1.078 1.167
```

This is conceptually similar to how normal linear regression works. If you constrain the δ of every duration to be the same, you’re going to end up with just a straight line.

We’d expect that the δ of any particular duration wouldn’t be exceptionally different from the δ of the previous duration. Putting that in statistical terms, we’ll say that
δ_{i} ~ normal(δ_{i-1}, σ_{δ}). As σ_{δ} gets larger, δ_{i} could be more exceptionally different from δ_{i-1}.

That’s more or less the basics of it. Here’s how to implement that in Stan. To play along here, you’ll need to get Stan and rstan appropriately installed. In this example, I’ll also be using plyr, dplyr, and ggplot2.

```
library(plyr)
library(dplyr)
library(rstan)
library(ggplot2)
```

Here’s the data. It’s measurements of one person’s (pseudonymously “Jean”) production of “I” and contractions of “I”.

```
I_jean <- read.delim("https://jofrhwld.github.io/data/I_jean.txt")
head(I_jean)
```

```
## Name Age Sex Word FolSegTrans Dur_msec F1 F2 F1.n F2.n
## 1 Jean 61 f I'M M 130 861.7 1336 1.6609 -0.8855
## 2 Jean 61 f I N 140 1010.4 1349 2.6883 -0.8536
## 3 Jean 61 f I'LL L 110 670.1 1293 0.3370 -0.9873
## 4 Jean 61 f I'M M 180 869.8 1307 1.7168 -0.9536
## 5 Jean 61 f I R 80 743.0 1419 0.8407 -0.6897
## 6 Jean 61 f I'VE V 120 918.2 1581 2.0512 -0.3068
```

Here’s the Stan model code. It consists of four program blocks (data, parameters, transformed parameters, and model), and I’ve included in-line comments to explain bits of it. If you’re more familiar with other Monte Carlo software, you’ll notice I haven’t defined priors for some of the declared parameters. That’s because a declaration of `real<lower=0, upper=100>`

effecively defines a uniform prior between 0 and 100.

```
model_code <- '
data{
int<lower=0> N; // number of observations
real y[N]; // the outcome variable
int<lower=0> max_time; // the largest time index
int<lower=0> max_word; // the largest word index
int<lower=0> time[N]; // the time explanatory variable
int<lower=0> word_id[N]; // the word explanatory variable
}
parameters{
// more or less (1|word) in lmer terms
vector[max_word] word_effects;
// scaling parameters for sampling
real<lower=0, upper=100> word_sigma;
real<lower=0, upper=100> sigma;
// Ghitza & Gelman used normal(delta[i-1],1) for sampling deltas,
// but in some other work I found this led to overfitting for my data.
// So, Im using this hyperprior.
real<lower=0, upper=100> delta_sigma;
// time_deltas is shorter than max_time,
// because the first delta logically
// has to be 0.
vector[max_time-1] time_deltas;
real intercept;
}
transformed parameters{
// time_mu will be the expected
// F1 at each time point
vector[max_time] time_mu;
// real_deltas is just time_deltas
// with 0 concatenated to the front
vector[max_time] real_deltas;
real_deltas[1] <- 0.0;
for(i in 1:(max_time-1)){
real_deltas[i+1] <- time_deltas[i];
}
// The cumulative sum of deltas, plus
// the initial value (intercept) equals
// the expected F1 at that time index
time_mu <- cumulative_sum(real_deltas) + intercept;
}
model{
// this y_hat variable is to allow
// for vectorized sampling from normal().
// Sampling is just quicker this way.
vector[N] y_hat;
// The first time_delta should be less constrained
// than the rest. delta_sigma could be very small,
// and if so, the subsequent delta values would be
// constrained to be too close to 0.
time_deltas[1] ~ normal(0, 100);
for(i in 2:(max_time-1)){
time_deltas[i] ~ normal(time_deltas[i-1], delta_sigma);
}
intercept ~ normal(0, 100);
// this is vectorized sampling for all of the
// word effects.
word_effects ~ normal(0, word_sigma);
// This loop creates the expected
// values of y, from the model
for(i in 1:N){
y_hat[i] <- time_mu[time[i]] + word_effects[word_id[i]];
}
// this is equivalent to;
// y[i] <- time_mu[time[i]] + word_effects[word_id[i]] + epsilon[i];
// epsilon[i] ~ normal(0, sigma);
y ~ normal(y_hat, sigma);
}
'
```

To fit the model, we need to do some adjustments to the data. For example, we need to convert the duration measurements (which currently increment like 50, 60, 70, 80… ) into indices starting at 1, and incrementing like 1, 2, 3, 5, etc. We also need to convert the word labels into numeric indices.

```
mod_data <- I_jean %>%
mutate(Dur1 = round(((Dur_msec-min(Dur_msec))/10)+1),
wordN = as.numeric(as.factor(Word)))
```

rstan takes its data input as a list, so here we create the list, and fit the model with 3 chains, 1000 iterations per chain. By default, the first half of the iterations will be discarded as the burn-in.

```
data_list <- list(N = nrow(mod_data),
y = mod_data$F1.n,
max_time = max(mod_data$Dur1),
max_word = max(mod_data$wordN),
time = mod_data$Dur1,
word_id = mod_data$wordN)
mod <- stan(model_code = model_code, data = data_list, chains = 3, iter = 1000)
```

```
##
## TRANSLATING MODEL 'model_code' FROM Stan CODE TO C++ CODE NOW.
## COMPILING THE C++ CODE FOR MODEL 'model_code' NOW.
##
## SAMPLING FOR MODEL 'model_code' NOW (CHAIN 1).
##
## Iteration: 1 / 1000 [ 0%] (Warmup)
## Iteration: 100 / 1000 [ 10%] (Warmup)
## Iteration: 200 / 1000 [ 20%] (Warmup)
## Iteration: 300 / 1000 [ 30%] (Warmup)
## Iteration: 400 / 1000 [ 40%] (Warmup)
## Iteration: 500 / 1000 [ 50%] (Warmup)
## Iteration: 501 / 1000 [ 50%] (Sampling)
## Iteration: 600 / 1000 [ 60%] (Sampling)
## Iteration: 700 / 1000 [ 70%] (Sampling)
## Iteration: 800 / 1000 [ 80%] (Sampling)
## Iteration: 900 / 1000 [ 90%] (Sampling)
## Iteration: 1000 / 1000 [100%] (Sampling)
## # Elapsed Time: 1.1518 seconds (Warm-up)
## # 0.777117 seconds (Sampling)
## # 1.92892 seconds (Total)
##
##
## SAMPLING FOR MODEL 'model_code' NOW (CHAIN 2).
##
## Iteration: 1 / 1000 [ 0%] (Warmup)
## Iteration: 100 / 1000 [ 10%] (Warmup)
## Iteration: 200 / 1000 [ 20%] (Warmup)
## Iteration: 300 / 1000 [ 30%] (Warmup)
## Iteration: 400 / 1000 [ 40%] (Warmup)
## Iteration: 500 / 1000 [ 50%] (Warmup)
## Iteration: 501 / 1000 [ 50%] (Sampling)
## Iteration: 600 / 1000 [ 60%] (Sampling)
## Iteration: 700 / 1000 [ 70%] (Sampling)
## Iteration: 800 / 1000 [ 80%] (Sampling)
## Iteration: 900 / 1000 [ 90%] (Sampling)
## Iteration: 1000 / 1000 [100%] (Sampling)
## # Elapsed Time: 0.916265 seconds (Warm-up)
## # 0.68692 seconds (Sampling)
## # 1.60318 seconds (Total)
##
##
## SAMPLING FOR MODEL 'model_code' NOW (CHAIN 3).
##
## Iteration: 1 / 1000 [ 0%] (Warmup)
## Iteration: 100 / 1000 [ 10%] (Warmup)
## Iteration: 200 / 1000 [ 20%] (Warmup)
## Iteration: 300 / 1000 [ 30%] (Warmup)
## Iteration: 400 / 1000 [ 40%] (Warmup)
## Iteration: 500 / 1000 [ 50%] (Warmup)
## Iteration: 501 / 1000 [ 50%] (Sampling)
## Iteration: 600 / 1000 [ 60%] (Sampling)
## Iteration: 700 / 1000 [ 70%] (Sampling)
## Iteration: 800 / 1000 [ 80%] (Sampling)
## Iteration: 900 / 1000 [ 90%] (Sampling)
## Iteration: 1000 / 1000 [100%] (Sampling)
## # Elapsed Time: 0.926489 seconds (Warm-up)
## # 0.771759 seconds (Sampling)
## # 1.69825 seconds (Total)
```

rstan has a nice summary function for its models, which includes the Rubin-Gelman Convergence diagnostic. It looks like this model is well converged, with all parameters having an Rhat very close to 1.

```
mod_summary <- as.data.frame(summary(mod)$summary)
ggplot(mod_summary, aes(Rhat)) +
geom_bar()
```

```
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
```

```
## Warning: position_stack requires constant width: output may be incorrect
```

I also think rstan’s `traceplot()`

has an attractive color palette. These two traceplots show the posterior samples of `sigma`

. The first one includes the burn-in, and the second one excludes it.

`traceplot(mod, "sigma")`

`traceplot(mod, "sigma", inc_warmup = F)`

This is a function that I wrote to extract summaries of specific parameters from the model. I wrote it while I was writing my dissertation, so I can’t actually parse it right now, but it gets the job done nicely.

```
extract_from_summary <- function(summary = NULL, pars){
library(stringr)
library(plyr)
pars <- paste(paste("^", pars, sep = ""), collapse = "|")
if(class(summary) == "matrix"){
summary.df <- as.data.frame(summary)[grepl(pars, row.names(summary)),]
}else{
summary.df <- summary[grepl(pars, row.names(summary)),]
}
summary.df$full_pars <- row.names(summary.df)
summary.df$pars <- gsub("\\[.*\\]", "", summary.df$full_pars)
dim_cols <- ldply(
llply(
llply(
str_split(
gsub("\\[|\\]","",
str_extract(summary.df$full_pars, "\\[.*\\]")
),
","),
as.numeric),
rbind),
as.data.frame)
summary.df <- cbind(summary.df, dim_cols)
return(summary.df)
}
```

So first, here’s a plot of the `time_delta`

variable, with 95% credible intervals. I’ve also included a horizontal line at 0, so we can more easilly see when the rate of change isn’t reliably different from 0.

```
extract_from_summary(summary(mod)$summary, "time_delta")%>%
ggplot(., aes(((V1)*10)+50, mean)) +
geom_hline(y=0, color = "grey50")+
geom_line()+
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`), alpha = 0.3)+
xlab("Duration (msec)")+
scale_y_reverse("time delta")
```

Already, this is pretty cool. I made a fuss in my dissertation about how important the rate of change can be important for understanding language change. At the time, I was stuck with the smoothers I understood, which meant pre-specifying how wobbly the δ curve could be. Using this method, it could be arbitrarilly wobbly, but still fairly smooth.

Next, here’s `time_mu`

, which is the expected normalized F1 at different durations with 95% credible intervals. I’ve also plotted the original data points on here, so it looks like most plots + smooths out there.

```
extract_from_summary(summary(mod)$summary, "time_mu")%>%
ggplot(., aes(((V1-1)*10)+50, mean)) +
geom_point(data = I_jean, aes(Dur_msec, F1.n), alpha = 0.7)+
geom_line()+
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`), alpha = 0.2)+
xlab("Duration (msec)")+
scale_y_reverse("time mu")
```

Pretty good, huh? Here’s a direct comparison of the smooth from Stan, and a loess smooth.

```
extract_from_summary(summary(mod)$summary, "time_mu")%>%
ggplot(., aes(((V1-1)*10)+50, mean)) +
geom_line(color = "red3")+
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`), alpha = 0.3, fill = "red3")+
stat_smooth(data = I_jean, aes(Dur_msec, F1.n),
color= "blue3",
fill = "blue3",method = "loess")+
xlab("Duration (msec)")+
scale_y_reverse("time mu")
```

So, the Stan model has broader itervals than the loess smooth, but that makes a lot of sense, since there is uncertainty in the estimate of each δ, and that uncertainty will accumulate across the cumulative sum of all δs. The uncertainty isn’t necessarilly greater towards the end of the range of the predictor. It’s just that way here because there isn’t that much data in the longer durations. In other models I’ve fit like this, the uncertainty flares out at the beginning and the end of the predictor’s range in the normal way.

And just for fun, here’s a plot of the estimated word effects.

```
word_effects <- extract_from_summary(summary(mod)$summary, "word_effects") %>%
mutate(word = levels(mod_data$Word)[V1])
ggplot(word_effects, aes(word, mean)) +
geom_hline(y=0,color = "grey50")+
geom_pointrange(aes(ymin = `2.5%`, ymax = `97.5%`))+
coord_flip()
```

None of their 95% credible intervals exclude 0.

*by
Joe
*

*-May 23, 2014*

Like many sociolinguists, I have a lot of vowel formant data which is stored in separate files for each speaker. For example, I’ve organized my copy of the Philadelphia Neighborhood Corpus as follows:

```
Street_1/
Speaker_1/
Speaker_1_meas.txt
Speaker_2/
Speaker_2_meas.txt
Street_2/
Speaker_3/
Speaker_3_meas.txt
Speaker_4/
Speaker_4_meas.txt
```

It’s easy enough to load all of one speakers’ data into R.

`df <- read.delim("Street_1/Speaker_1/Speaker_1_meas.txt")`

It’s also easy enough, using `plyr`

, to read in all of the data from all speakers by globbing for the measurement files.

```
library(plyr)
speakers <- Sys.glob("Street*/Speaker*/*_meas.txt")
df <- ldply(speakers, read.delim)
```

But with this plyr approach, I start facing two problems.

- There are just about 1 million measurements in the whole PNC, and reading all of that into R’s memory can start making things a little bit wonky.
- I don’t actually want or need
*all*of the data anyway.I frequently only want some vowel data from some speakers, and sometimes the data I want is even narrower than that.

My previous solution to these problems has been pretty hacky and inflexible. But now, I think I’ve got something better going with `sqldf`

. Briefly, the `sqldf`

package, and its namesake function, `sqldf()`

, allows you to read data from a delimited file using SQL queries. I’m totally new to SQL, but its Wikipedia page is surprisingly useful for learning how to use it.

First, you need to assign the file connection to the speaker’s data to a named variable.

```
speaker <- "Street_1/Speaker_1/Speaker_1_meas.txt"
fi <- file(speaker)
```

Now, you can use `sqldf()`

to read in a specific data from the file connection using SQL queries. Here’s a simple one where I selected just the vowel /ow/. The bit that says `plt_vclass`

is referring to a column called `plt_vclass`

which exists in the data.

```
library(sqldf)
df <- sqldf("select * from fi where plt_vclass == 'ow'",
file.format = list(header = TRUE, sep = "\t"))
# good practice to close the connection
close(fi)
table(df$plt_vclass)
```

```
##
## ow
## 145
```

` summary(df$F2)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 824 1130 1230 1240 1330 1640
```

I could’ve done something more fancy, where I select /uw/ and /ow/ with duration greater than 100ms.

```
# re-open connection
fi <- file(speaker)
df <- sqldf("select * from fi where plt_vclass in ('ow', 'uw') and dur > 0.1",
file.format = list(header = TRUE, sep = "\t"))
close(fi)
table(df$plt_vclass)
```

```
##
## ow uw
## 52 10
```

```
ddply(df, .(plt_vclass),
summarise,
mean_F2 = mean(F2),
min_dur = min(dur))
```

```
## plt_vclass mean_F2 min_dur
## 1 ow 1254.2 0.101
## 2 uw 912.5 0.101
```

The trick now is how to do this over and over again with a list of files, and with any arbitrary conditions. Here’s the function I’ve come up with.

To use it, first get a list of speakers’ data files by globbing for them.

`speakers <- Sys.glob("Street*/Speaker*/*_meas.txt")`

Now, feed this vector of file names into `ldply()`

.

```
df <- ldply(speakers,
sql_load,
condition = "where plt_vclass in ('ow', 'uw') and dur > 0.1")
table(df$plt_vclass)
```

```
##
## ow uw
## 11819 2539
```

```
ddply(df, .(plt_vclass),
summarise,
mean_F2 = mean(F2),
min_dur = min(dur))
```

```
## plt_vclass mean_F2 min_dur
## 1 ow 1267 0.101
## 2 uw 1151 0.101
```

*by
Joe
*

*-February 11, 2014*

R man-god Hadley Wickham is at it again, releasing the `dplyr`

package.
It’s similar to his previous `plyr`

package, but designed specifically for working with data frames, and highly optimized.
Just to show off a nice feature of its functionality, here’s how we’d do the calcuation of TD-Deletion rates as I orignally described here.

First, let’s load the data.

```
library(plyr)
library(RCurl)
b <- read.delim(textConnection(getURL("https://raw.github.com/JoFrhwld/TD-Classifier/master/Data/new_buckeye.txt")))
eval(parse(text = getURL("https://raw.github.com/JoFrhwld/TD-Classifier/master/R/buckCoder.R")))
b <- subset(b, !(FolSeg %in% c("apical")) & PreSeg != "/r/")
psm <- subset(b, Gram2 %in% c("past", "semiweak", "mono"))
psm$Word <- as.character(psm$Word)
psm$Gram2 <- as.factor(psm$Gram2)
psm$Gram2 <- relevel(psm$Gram2, "past")
```

The first key `dplyr`

function is `group_by()`

, which creates a grouped data frame (it also worth a bunch of other data table types, including `data.table`

, see `?group_by()`

).

```
library(dplyr)
psm_gr <- group_by(psm, Gram2, Speaker, Word)
psm_gr
```

```
## Source: local data frame [8,962 x 26]
## Groups: Gram2, Speaker, Word
##
## Speaker Recording Word WordBegin WordEnd POS seg SegTrans
## 1 s01 s0101a lived 47.53 47.66 VBN d d
## 3 s01 s0101a raised 51.17 51.24 VBN d d
## 4 s01 s0101a west 51.50 51.69 JJ t s
## 5 s01 s0101a kind 60.12 60.21 NN d nx
## 6 s01 s0101a received 67.05 67.41 VBD d d
## 11 s01 s0101a different 92.08 92.24 JJ t t
## 12 s01 s0101a different 94.10 94.81 JJ t t
## 17 s01 s0101a kind 124.20 124.24 NN d nx
## 19 s01 s0101a kind 126.27 126.36 NN d nx
## 21 s01 s0101a suggest 162.60 163.41 VB t t
## .. ... ... ... ... ... ... ... ...
## Variables not shown: PreSegTrans (fctr), FolSegTrans (fctr), DictNSyl
## (int), NSyl (int), PreWindow (int), PostWindow (int), WindowBegin (dbl),
## WindowEnd (dbl), DictRate (dbl), Rate (dbl), FolWord (fctr), Context
## (fctr), Gram (chr), Gram2 (fctr), PreSeg (fctr), FolSeg (chr), DepVar
## (chr), td (dbl)
```

Now, first things first, there are `8962`

rows to this data, and I made the previously stupid move of just printing it all, but the `grouped_df`

class is a bit smarter in the way it prints, just showing the first few rows and columns. That’s pretty nice.
I’ll also call your attention to the second line of the printed output, where it says `Groups: Gram2, Speaker, Word`

.

Next, we can `summarise`

the data.

`summarise(psm_gr, td = mean(td))`

```
## Source: local data frame [3,299 x 4]
## Groups: Gram2, Speaker
##
## Gram2 Speaker Word td
## 1 past s40 caused 0.5000
## 2 mono s40 racist 0.5000
## 3 mono s40 adopt 1.0000
## 4 past s40 teased 1.0000
## 5 mono s40 effect 1.0000
## 6 mono s40 fact 1.0000
## 7 semiweak s40 felt 1.0000
## 8 mono s40 against 1.0000
## 9 past s40 legalized 1.0000
## 10 past s40 used 0.3333
## .. ... ... ... ...
```

Here we see the mean rate of TD retention per grammatical class, per speaker, per word.
But look at the second line of the output.
It started out with `Groups: Gram2, Speaker, Word`

, but now it’s `Groups: Gram2, Speaker`

.
Apparently, ever time you `summarise`

a grouped data frame, the outermost layer of grouping is stripped away.
This is really useful for the kind of nested calculation of proportions that I argued is necessary for descriptive statistics in that Val Systems post.
And it’s easy to do with `dplyr`

which uses a custom operator, `%.%`

, which allows you to chain up these data manipulation operations.
Here’s how you’d do the calculation of the rate of TD retention.

```
td_gram <- psm %.%
group_by(Gram2, Speaker, Word) %.%
summarise(td = mean(td)) %.% # over word
summarise(td = mean(td)) %.% # over speaker
summarise(td = mean(td)) # over grammatical class
td_gram
```

```
## Source: local data frame [3 x 2]
##
## Gram2 td
## 1 past 0.7779
## 2 mono 0.5981
## 3 semiweak 0.7150
```

I’ve been doing something similar for calculating mean F1 and F2 for vowels, which would look like this.

```
vowel_means <- vowels %.%
group_by(Speaker, Age, Sex, DOB, VowelClass, Word) %.%
summarise(F1.n = mean(F1.n), F2.n = mean(F2.n)) %.%
summarise(F1.n = mean(F1.n), F2.n = mean(F2.n))
```

*by
Joe
*

*-February 8, 2014*

I’m considering blogging at my github domain. Right now, my main github site is hosting my academic web page, so I’ve set up a repository called ‘blog’ to host the blog, so that it’ll show up with the nice url jofrhwld.github.io/blog. It seems like a pretty nice approach, and I wonder how many other use a github ‘blog’ repository in a similar way.

Thematically, I think I’ll write up things here that I feel are a bit too technical for Val Systems, but I’ll probably cross reference between here and there a lot.

Or, I might just abandon this whole thing, who knows.