Colliders

Author
Published

September 6, 2023

Listening

Setup

Code
library(tidyverse)
library(tidybayes)
library(brms)
library(gt)
library(gtsummary)
library(patchwork)
library(ggblend)
library(ggdensity)
library(ggforce)
library(marginaleffects)
library(dagitty)
library(ggdag)


source(here::here("_defaults.r"))
set.seed(2023-9-6)

Collider Bias

When A has an effect on Z, and B has an effect on Z, then Z is a “collider” variable between A and B.

graph LR
  a["A"]
  b["B"]
  z["Z"]
  
  a --> z
  b --> z

So, if I understand things right, fitting a model with

A ~ B

would result in no effect of B on A, which we can check with a DAG.

dagify(
  Z ~ A,
  Z ~ B
) |> 
  impliedConditionalIndependencies()
A _||_ B

But, if we fit a model with

A ~ B + Z

now there would suddenly seem to be an effect of B on A. As a simple demo, we can repeat the simulation of newsworthiness and trustworthiness.

simulating selection
tibble(
  trustworthiness = rnorm(200),
  newsworthiness = rnorm(200),
  score = trustworthiness + newsworthiness,
  score_percentile = ecdf(score)(score),
  selected = score_percentile >= 0.9
) ->
  research
Code
research |> 
  ggplot(
    aes(
      newsworthiness,
      trustworthiness,
      color = selected
    )
  ) +
    geom_point()+
    theme(
      aspect.ratio = NULL
    )+
    coord_fixed()

Figure 1: Selection bias

Last time we just fit a model using the subset of selected values. But here it goes modelling with the whole dataset. First, without the collider.

no collider model
brm(
  trustworthiness ~ newsworthiness,
  data = research,
  backend = 'cmdstanr',
  file = "research1"
)->
  research1_mod

Then, with the collider.

collider model
brm(
  trustworthiness ~ newsworthiness + selected,
  data = research,
  backend = 'cmdstanr',
  file = "research2"
)->
  research2_mod
getting parameters
list(
  `no collider` = research1_mod,
  collider = research2_mod
) |> 
  map(
    ~.x |> 
      gather_draws(
        `b_.*`,
        regex = T
      )
  ) |> 
  list_rbind(
    names_to = "model"
  )->
  collide_comp

Just focusing on the intercept and newsworthiness effects, they went from (correctly) being unrelated in the model without the collider, to having pretty reliable effects by just including the selected variable.

Code
collide_comp |> 
  filter(
    str_detect(
      .variable,
      "selected",
      negate = T
    )
  ) |> 
  ggplot(
    aes(
      .value,
      .variable,
      color = model
    )
  ) +
    geom_vline(xintercept = 0) +
    stat_pointinterval(
      position = position_dodge(width = 0.2)
    )+
    theme(legend.position = "top")

Figure 2: Comparing estimates from models with and without the collider

Haunting!

Here’s the scary part. The illustration from the book is about inter-generational effects on education. Grandparents will have an effect on their children (the parent) and parents will have an effect on their children. The question is, is there any direct effect of grandparents on children.

graph LR
  g[Grandparent]
  p[Parent]
  c[Child]
  
  g --> p
  g -.->|?| c
  p --> c
finding the direct effect
dagify(
  p ~ g,
  c ~ p,
  c ~ g
) |> 
  adjustmentSets(
    exposure = "g", 
    outcome = "c",
    effect = "direct"
  )
{ p }

Ok, but the spooky thing is what if there’s a variable (like, neighborhood) that’s shared by the parent and child, but not the grandparent, which we didn’t record.

graph LR
  g[Grandparent]
  p[Parent]
  c[Child]
  n[Neighborhood]
  
  g -.->|?| c
  g --> p
  p --> c
  n --> p
  n --> c
  
  style n stroke-dasharray: 5 5

Parent has apparently become a collider, but I’m still trying to noodle through why.


Ok, having stepped away for a bit, I think my problem was some confusion about how the “paths” work in DAGs.

I realized:
  • The connections from one node to another are directed.

  • But when charting a path from a variable to the outcome, you ignore the directedness.

  • Then, you add back in the directedness to diagnose confounder, mediator, collider etc.

So, ignoring the directedness, we have the following paths from Grandparent to Child.

Undirected Paths
  1. Grandparent — Child
  2. Grandparent — Parent — Child
  3. Grandparent — Parent — Neighborhood — Child

Then, we can add in the directedness

Directed Paths
  1. Grandparent → Child
  2. Grandparent → Parent → Child
  3. Grandparent → Parent ← Neighborhood → Child

Because of path 2, (Grandparent → Parent → Child), in order to get the “direct effect” of Grandparent, we need to include Parent. But because of path 3 (Grandparent → Parent ← Neighborhood → Child), Parent is also a Collider. If we don’t include Neighborhood in the model (maybe because we didn’t measure it!) the estimate for Grandparent is going to get all screwy!

I’m still developing my intuitions for why and how the estimate will get screwy.

With {dagitty} and {ggdag} you’re supposed to be able to get the paths automatically, but I can’t get the {ggdag} one to work give me the collider path.

making the dag
dagify(
  c ~ g + p + n,
  p ~ g + n
) ->
  haunted_dag
with dagitty
haunted_dag |> 
  paths(
    from = "g",
    to = "c"
  ) |> 
  pluck("paths")
[1] "g -> c"           "g -> p -> c"      "g -> p <- n -> c"
with ggdag
haunted_dag |> 
  ggdag_paths(
    from = "g",
    to = "c",
    directed = F
  )+
    theme_dag()

I think ggdag::ggdag_paths() is calling dagitty::paths() underneath, and just isn’t passing the directed argument correctly as of

Sys.Date()
[1] "2023-09-06"

I can get it to plot colliders, though.

haunted_dag |> 
  ggdag_collider()+
  theme_dag()

Simulating the haunted dag

I’ll mostly just copy the simulation parameters from the book.

setting the direct effects
# Grandparent on parent
b_GP <- 1

# Parent on Child
b_PC <- b_GP

# Grandparent on Child
b_GC <- 0

# neighborhood
b_N <- 2
The simulation
n = 200

tibble(
  grandparent = rnorm(n),
  neighborhood = rbinom(
    n, 
    size = 1, 
    prob = 0.5
  ),
  parent = rnorm(
    n,
    mean = b_GP * grandparent + 
           b_N * neighborhood
  ),
  child = rnorm(
    n,
    mean = b_GC * grandparent +
           b_PC * parent + 
           b_N  * neighborhood
  )
) ->
  haunted_sim
Code
haunted_sim |> 
  ggplot(
    aes(
      grandparent,
      child,
      color = factor(neighborhood)
    )
  )+
  geom_point()+
  guides(
    color = "none"
  )+
  theme(
    aspect.ratio = 0.8
  )->
  gc_plot

haunted_sim |> 
  ggplot(
    aes(
      parent,
      child,
      color = factor(neighborhood)
    )
  )+
    geom_point()+
  theme_blank_y()+
  guides(
    color = "none"
  )+  
  theme(
    aspect.ratio = 0.8
  )->  
  pc_plot

haunted_sim |> 
  ggplot(
    aes(
      factor(neighborhood),
      child,
      fill = factor(neighborhood),
      color = factor(neighborhood)
    )
  )+
    geom_dots(
      dotsize = 1.3,
      layout = "hex",
      side = "both"
    )+
  labs(x = "neighborhood")+
  theme_blank_y()+
  guides(
    color = "none",
    fill = "none"
  )+  
  theme(
    aspect.ratio = 0.8
  )->
  nc_plot

gc_plot + pc_plot + nc_plot

Figure 3: Simulated ‘haunted’ data

Now, let’s fit the model with just grandparent and parent (because, in this example, we don’t know about the neighborhood).

the haunted collider model
brm(
  child ~ parent + grandparent,
  data = haunted_sim,
  prior = c(
    prior(normal(0,3), class = b)
  ),
  backend = "cmdstanr",
  cores = 4,
  file = "haunted1"
)->
  haunted1_mod

Now let’s get the parameters and compare them to the true values that we created the simulation with.

true values
tribble(
  ~.variable, ~.value,
  "b_Intercept", 0,
  "b_grandparent", b_GC,
  "b_parent", b_PC
)->
  true_param
haunted params
haunted1_mod |> 
  gather_draws(
    `b_.*`,
    regex = T 
  ) ->
  haunted1_params
Code
haunted1_params |> 
  ggplot(
    aes(
      .value,
      .variable
    )
  ) +
    geom_vline(xintercept = 0)+
    stat_halfeye()+
    geom_point(
      data = true_param,
      aes(color = "true value"),
      size = 3
    )+
    scale_color_manual(
      values = ptol_red,
      name = NULL
    )+
    theme(
      legend.position = "top"
    )

Figure 4: Haunted parameters!

As far as things go, the model will make good predictions, because the statistical associations are correct, but the causal interpretation (“Grandparents have a negative effect”) is wrong.

posterior predictive check
pp_check(haunted1_mod)

Figure 5: Haunted posterior predictive check

Including the variable haunting the DAG ought to improve things, but in reality that assumes we know what it is, and have some measure of it.

Exorcised model
brm(
  child ~ parent + grandparent + neighborhood,
  data = haunted_sim,
  prior = c(
    prior(normal(0,3), class = b)
  ),
  backend = "cmdstanr",
  cores = 4,
  file = "exorcised1"
)->
  exorcised1_mod
true params
tribble(
  ~.variable, ~.value,
  "b_Intercept", 0,
  "b_grandparent", b_GC,
  "b_parent", b_PC,
  "b_neighborhood", b_N
)->
  true_param
exorcised params
exorcised1_mod |> 
  gather_draws(
    `b_.*`,
    regex = T 
  ) ->
  exorcised1_params
Code
exorcised1_params |> 
  ggplot(
    aes(
      .value,
      .variable
    )
  ) +
    geom_vline(xintercept = 0)+
    stat_halfeye()+
    geom_point(
      data = true_param,
      aes(color = "true value"),
      size = 3
    )+
    scale_color_manual(
      values = ptol_red,
      name = NULL
    )+
    theme(
      legend.position = "top"
    )

Figure 6: Exoricsed params

I’m not sure why my posterior estimates are so much further off from McElreath’s…