library(tidyverse)
library(ggdag)
library(broom)
Setup
set.seed(2023-9-25)
source(here::here("_defaults.R"))
Upshot
I get confused thinking about the causality DAGs in terms of models. The arrows and paths in the DAGs don’t directly translate to terms. in the models. An included term in a model won’t necessarily correspond to any simple path from the term to the outcome.
DAG typology
I’ve been reading up a bunch more about causal inference, how these DAGs work, and getting more comfortable with tools like {ggdag}
and dagitty. So here’s my attempt to work through the “Confronting Confounding” section of chapter 6. I’ll try to make up linguistic examples as I go along.
The classic confounder: the fork
The classic confounding variable is one which is a common cause of both the outcome and the predictor of interest. For example, we might be interested in how speech rate affects the voice onset time of voiceless stops.
two variable DAG
dagify(
~ rate,
vot labels = list(rate = "speech rate",
vot = "VOT"),
outcome = "vot",
exposure = "rate",
coords = tribble(
~name, ~x, ~y,
"vot", 1, 0,
"rate",0, 0
)|>
) ggdag_status(use_labels = "label",
text = F)+
::scale_color_bright()+
khroma::scale_fill_bright()+
khromatheme_dag()
But, there are variables (observed or unobserved) that can be a common cause of both speech rate and VOT. For example, speech style.
three variable dag
dagify(
~ rate + style,
vot ~ style,
rate labels = list(rate = "speech rate",
vot = "VOT",
style = "speech style"),
outcome = "vot",
exposure = "rate",
coords = tribble(
~name, ~x, ~y,
"vot", 1, 0,
"rate",0, 0,
"style",0.5, 1
)->
) three_dag
Code
|>
three_dag ggdag_status(use_labels = "label",
text = F)+
::scale_color_bright(na.value = "grey30")+
khroma::scale_fill_bright(na.value = "grey30")+
khromatheme_dag()
You can plot all of the paths between speech rate and VOT with ggdag_paths()
, but to be honest, I still struggle with reading the fork as a path “from” speech rate, “to” VOT because of the directions of the arrows.
tidy_dagitty(three_dag) |>
dag_paths() |>
node_status()|>
filter(!is.na(path)) |>
ggplot(
aes(x = x, y = y, xend = xend, yend = yend)
+
)geom_dag_point(aes(color = status))+
geom_dag_edges()+
geom_dag_label_repel(aes(label = label))+
::scale_color_bright(na.value="grey60", guide = "none")+
khromafacet_wrap(~set)+
theme_dag()
It’s easier for me to see what each of the paths are by using dagitty::paths()
::paths(three_dag)$path dagitty
[1] "rate -> vot" "rate <- style -> vot"
In order to accurately estimate the direct effect of speech rate on VOT, we need to “condition on” or “adjust for” speech style. That is, we need to include it in the model. You can get the adjustment sets from {ggdag}
or {dagitty}
.
ggdag_adjustment_set(three_dag, use_labels = "label", text = F)+
theme_dag()
I also find this graphical representation a little bit confusing, with the adjusted-for variable kind of just floating there? The model we wind up fitting is going to look like
vot ~ rate + style
Meaning we are going to get an estimate of style -> vot
. Maybe one of these mermaid diagrams is the way to go? Here I’ve got the original DAG with the “backdoor” open. Links that aren’t included in the model are in dotted lines.
By including style in the model, you’re blocking its flow backward through rate.
Other DAG configurations
The “elemental” DAG configurations are
The Pipe
The pipe is the important configuration for “post-treatment bias.
dagify(
~ vtl,
schwa ~ height,
vtl outcome = "schwa",
exposure = "height",
labels = list(schwa = "schwa formants",
vtl = "vocal tract length",
height = "height")
->pipe_dag )
Code
|>
pipe_dag ggdag_status(use_labels = "label", text = F)+
theme_dag()+
::scale_color_bright(na.value = "grey60")+
khroma::scale_fill_bright(na.value = "grey60") khroma
::paths(pipe_dag)$paths dagitty
[1] "height -> vtl -> schwa"
There’s no direct causal effect of height on schwa formants, but if you wanted the total effect, you’d want to just use
formants ~ height
Including vocal tract length will block the “flow” from height. Just modeling schwa formants based on height allow the relationship to “flow” through vocal tract length, which wasn’t included in the model.
If VTL is included in the model, it blocks the “flow” from height.
The collider
Colliders are still confusing to me. Let’s imagine a situation where both word frequency and segment markedness both increase the likelihood of lenition, and we want to see if there’s a relationship between segment markedness and word frequency
dagify(
~ freq + markedness,
lenition outcome = "markedness",
exposure = "freq",
labels = list(
lenition = "lenition rate",
freq = "word frequency",
markedness = "segment markedness"
)->
) collide_dag
Code
|>
collide_dag ggdag_status(use_labels = "label",
text = F)+
theme_dag()+
::scale_color_bright(na.value = "grey60")+
khroma::scale_fill_bright(na.value = "grey60") khroma
::paths(collide_dag)$paths dagitty
[1] "freq -> lenition <- markedness"
There’s no direct effect of frequency on markedness, so markedness ~ freq
should have no effect. But by including the lenition rate, you’d open a causal pathway between the two. I’m not even sure how to indicate this in a mermaid diagram. I need to sketch this out “manually”.
Without “lenition” included in the model, there’s no causal pathway open between markedness and frequency.
But, when you add lenition, it “opens up” a causal pathway where before there wasn’t one.
Revisiting Waffle House Divorces
data(WaffleDivorce, package = "rethinking")
head(WaffleDivorce) |> rmarkdown::paged_table()
So, the possible DAG here is
dagify(
~ w + m + a,
d ~ s,
w ~ s + a,
m labels = list(
d = "divorce rate",
w = "waffle houses",
a = "age at marriage",
m = "marriage rate",
s = "south"
),outcome = "d",
exposure = "w"
-> waffle_dag )
|>
waffle_dag ggdag_status(
use_labels = "label",
text = F
+
)theme_dag()+
::scale_color_bright(na.value = "grey60")+
khroma::scale_fill_bright(na.value = "grey60") khroma
We can get the paths from waffle houses to divorce rate:
::paths(waffle_dag)$paths dagitty
[1] "w -> d" "w <- s -> m -> d" "w <- s -> m <- a -> d"
It looks like the most parsimonious thing to do would be to include “south” in the model. Simply including “marriage rate” would actually open up a path, because marriage rate is a collider. According to dagitty (and the book), if we include marriage rate, we also need to include age, to close the path?
library(brms)
library(tidybayes)
|>
WaffleDivorce mutate(
across(
where(is.double),
- mean(x))/sd(x)
\(x) (x
),WaffleHouses = (WaffleHouses-mean(WaffleHouses))/sd(WaffleHouses)
->
) waffle_to_mod
Just the waffle mod
<- brm(
w_d_mod ~ WaffleHouses,
Divorce data = waffle_to_mod,
backend = "cmdstanr",
prior = prior(normal(0, 3), class = b),
file = "w_d"
)
Now, with the South adjustment
<- brm(
w_d_s_mod ~ WaffleHouses + South,
Divorce data = waffle_to_mod,
backend = "cmdstanr",
prior = prior(normal(0, 3), class = b),
file = "w_d_s"
)
With the inappropriate adjustment for just “marriage”.
<- brm(
w_d_m_mod ~ WaffleHouses + Marriage,
Divorce data = waffle_to_mod,
backend = "cmdstanr",
prior = prior(normal(0, 3), class = b),
file = "w_d_m"
)
With the marriage and age adjustment
<- brm(
w_d_m_a_mod ~ WaffleHouses + Marriage + MedianAgeMarriage,
Divorce data = waffle_to_mod,
backend = "cmdstanr",
prior = prior(normal(0, 3), class = b),
file = "w_d_m_a"
)
Now, just to grab estimates for the effect of WaffleHouses for these all.
Code
list(
`~waffle` = w_d_mod,
`~waffle + south` = w_d_s_mod,
`~waffle + marriage` = w_d_m_mod,
`~waffle + marriage + age` = w_d_m_a_mod
|>
) map(
~ spread_draws(.x, b_WaffleHouses)
|>
) list_rbind(
names_to = "model"
|>
) mutate(model = fct_reorder(model, b_WaffleHouses)) |>
ggplot(aes(b_WaffleHouses, model))+
geom_vline(xintercept = 0)+
stat_halfeye()
I wonder whether including a random intercept of location would have a similar effect adjusting for “South”?
<- brm(
w_d_L_mod ~ WaffleHouses + (1|Loc),
Divorce data = waffle_to_mod,
backend = "cmdstanr",
prior = prior(normal(0, 3), class = b),
file = "w_d_l"
)
Code
list(
`~waffle` = w_d_mod,
`~waffle + south` = w_d_s_mod,
`~waffle + marriage` = w_d_m_mod,
`~waffle + marriage + age` = w_d_m_a_mod,
`~waffle + (1|location)` = w_d_L_mod
|>
) map(
~ spread_draws(.x, b_WaffleHouses)
|>
) list_rbind(
names_to = "model"
|>
) mutate(model = fct_reorder(model, b_WaffleHouses)) |>
ggplot(aes(b_WaffleHouses, model))+
geom_vline(xintercept = 0)+
stat_halfeye()
Well, not much!