Haunted DAGSs

Author
Published

September 1, 2023

Listening

Setup

Code
library(tidyverse)
library(tidybayes)
library(brms)
library(gt)
library(gtsummary)
library(patchwork)
library(ggblend)
library(ggforce)
library(marginaleffects)
library(modelsummary)
library(dagitty)
library(ggdag)

source(here::here("_defaults.r"))
set.seed(2023-9-1)

Newsworthiness & Trustworthiness

The first example in the book is about Berkson’s Paradox, which I believe is a kind of selection bias. The question is “Why do so many research results that are newsworthy seem unreliable?” The idea being that funding (or whatever, maybe “editorial decisions to publish a paper”) is based jointly on its trustworthiness and its newsworthiness.

tibble(
  trustworthiness = rnorm(200),
  newsworthiness = rnorm(200),
  score = trustworthiness + newsworthiness,
  score_percentile = ecdf(score)(score)
) ->
  research
Code
research |> 
  filter(
    score_percentile >= 0.9
  )->
  selected

research |> 
  ggplot(
    aes(
      newsworthiness,
      trustworthiness
    )
  )+
    geom_point()+
    geom_mark_hull(
      aes(
        filter = score_percentile >= 0.9,
        color = "selected"
      ),
      fill = "grey"
    )+
    stat_smooth(
      aes(
        color = "all"
      ),
      method = lm
    )+
    stat_smooth(
      data = selected,
      aes(
        color = "selected"
      ),
      method = lm
    )+
    coord_fixed()+
    theme(
      aspect.ratio = 1
    )+
    labs(
      color = NULL
    )

Figure 1: The selection effect on the (non)correation between newsworthiness and trustworthiness

Multicollinearity

In fact, there is nothing wrong with multicollinearity. The model will work fine for prediction. You will just be frustrated trying to understand it.

When I was starting to get into advanced statistical modelling during my PhD (some time around 2010?) everyone suddenly learned about multicollinearity and got freaked out about it, so this was genuinely new info to me. See also Vanhove (2021), Collinearity isn’t a disease that needs curing.

The illustration from the book was about the relationship between total body height and leg length.

leg-heigh simulation
tibble(
  n = 100,
  # height in inches
  height = rnorm(n, mean = 70, sd = 3),
  # legs as a proportion of height
  leg_prop = runif(n, 0.4, 0.5),
  left_leg = height * leg_prop,
  right_leg = left_leg + 
    rnorm(n, sd = 0.02)
)->
  height_legs

The simulated right leg length is going to be, max, ±0.06 (116th inch) the left leg.

Code
height_legs |> 
  ggplot(aes(left_leg, height))+
    geom_point()+
    theme(aspect.ratio = 1)->
  lh

height_legs |> 
  ggplot(aes(right_leg, height))+
    geom_point()+
    theme(aspect.ratio = 1)->
  rh

height_legs |> 
  ggplot(aes(left_leg, right_leg))+
    geom_point()+
    theme(aspect.ratio = 1)->
  lr

lh + rh +lr

Figure 2: data relationships

A model with only left or right leg is going to be fine.

left leg model
brm(
  height ~ left_leg,
  data = height_legs,
  prior = c(
    prior(normal(70, 100), class = Intercept),
    prior(normal(0, 10), class = b)
  ),
  backend = "cmdstanr",
  file = "height_left"
)->
  height_left_mod

The right leg model:

right leg model
brm(
  height ~ right_leg,
  data = height_legs,
  prior = c(
    prior(normal(70, 100), class = Intercept),
    prior(normal(0, 10), class = b)
  ),
  backend = "cmdstanr",
  file = "height_right"
)->
  height_right_mod

I’m going to get a little fancy to get the parameter estimates from both models all at once.

getting parameter estimates
list(
  left = height_left_mod,
  right = height_right_mod
) |> 
  map(
    ~ gather_draws(
      .x, 
      `.*_leg`,
      regex = T
    )
  ) |> 
  list_rbind(
    names_to = "model"
  )->
  rl_params
Code
rl_params |> 
  ggplot(
    aes(
      .value, 
      .variable,
    )
  )+
    stat_halfeye(
      aes(
        fill = model
      )
    )+
  expand_limits(
    x = 0
  )+
  ylim(
    "b_right_leg",
    "b_left_leg"
  )

Figure 3: Estimated leg parameter

But if we include both left and right leg, the estimate of each one’s parameter gets weird.

both leg model
brm(
  height ~ left_leg + right_leg,
  data = height_legs,
  prior = c(
    prior(normal(70, 100), class = Intercept),
    prior(normal(0, 10), class = b)
  ),
  backend = "cmdstanr",
  file = "height_both"
)->
  height_both_mod
getting parameter estimates
list(
  left = height_left_mod,
  right = height_right_mod,
  both = height_both_mod
) |> 
  map(
    ~ gather_draws(
      .x, 
      `.*_leg`,
      regex = T
    )
  ) |> 
  list_rbind(
    names_to = "model"
  )->
  all_params
Code
all_params |> 
  mutate(
    model = model |>  
      fct_relevel("both", after = Inf)
  ) |> 
  ggplot(
    aes(
      .value, 
      .variable,
    )
  )+
    stat_pointinterval(
      aes(
        color = model
      ),
      position = "dodge"
    )+
  expand_limits(
    x = 0
  )+
  ylim(
    "b_right_leg",
    "b_left_leg"
  )

Figure 4: Oops, multicollinear!

This was the kind of outcome that I was taught to do things like residualize, but McElreath says that while the model we tried to write out was:

\[ \text{height} = \text{Intercept} + \beta_1\text{left} + \beta_2\text{right} \]

because the right leg and the left leg are basically the same, we have something more like

\[ \text{height} = \text{Intercept} + (\beta_1 + \beta_2)\text{leg} \]

And because there’s nothing else in the model to specify what the value of \(\beta_1\) and \(\beta_2\) are, they’re all over the place. But crucially, they ought to add up to a similar value to what we got for just left_leg and right_leg in the first two models. They also should be negatively correlated, so when one is large and positive, the other should be large and negative, so they cancel out to around the values we got before.

the multicollinear estimates
height_both_mod |> 
  spread_draws(
    `.*_leg`,
    regex = T
  )->
  leg_ests
Code
leg_ests |> 
  ggplot(
    aes(
      b_left_leg,
      b_right_leg
    )
  )+
    geom_point(
      alpha = 0.1
    )+
    coord_fixed()+
    theme(
      aspect.ratio = 1
    )

Figure 5: Correlated leg estimates

We can add each parameter together and compare it to the original two models.

adding together multicollinear estimates
leg_ests |> 
  mutate(
    .variable = "b_leg",
    .value = b_left_leg + b_right_leg,
    model = "both"
  ) |> 
  bind_rows(
    rl_params
  ) |> 
  mutate(
    model = fct_relevel(
      model,
      "both",
      after = Inf
    )
  )->
  rl_comp
Code
rl_comp |> 
  ggplot(
    aes(
      .value,
      .variable
    )
  )+
    stat_halfeye(
      aes(
        fill = model
      )
    ) +
  expand_limits(
    x = 0
  ) +
  ylim(
    "b_leg",
    "b_right_leg",
    "b_left_leg"
  )

“The model will work fine for prediction.”

Just to hammer home the point that the predictive value of the multicollinear model, we can compare its posterior predictive checks to the left and right leg models.

posterior predictive checks
pp_check(height_left_mod)+
  labs(title = "model: left")->
  left_pp

pp_check(height_right_mod)+
  labs(title = "model: right") ->
  right_pp

pp_check(height_both_mod)+
  labs(title = "model: both") ->
  both_pp

left_pp + right_pp + both_pp + plot_layout(guides = "collect")

Figure 6: Posterior predictive checks

We can also compare their predictions for new leg length.

getting predicted values
tibble(
  left_leg = 32,
  right_leg = left_leg+0.01
)->
  newleg

list(
  left = height_left_mod,
  right = height_right_mod,
  both = height_both_mod
) |> 
  map(
    ~ predictions(
      .x, 
      newdata = newleg
    ) |> 
      posterior_draws()
  ) |> 
  list_rbind(
    names_to = "model"
  )->
  all_pred
Code
all_pred |> 
  ggplot(
    aes(
      draw,
      model
    )
  )+
    labs(
      x = "predicted height"
    )+
    stat_halfeye()

Figure 7: Predicted heights

So what to do?

The upshot of McElreath’s recommendation for what to do about all this multicollinearity is “have a bad time.” There’s no generic answer. Maybe there’s an acceptable way to specify the model depending on the DAG, but also maybe some questions aren’t well put, like “what are the individual contribution of the left leg and the right leg to total height?”

References

Vanhove, Jan. 2021. “Collinearity Isn’t a Disease That Needs Curing.” Meta-Psychology 5 (April). https://doi.org/10.15626/MP.2021.2548.