Lin611 - Spring 2023 - Multivariate models aren’t just univariate models glued together

Like the title says, multivariate models aren’t just univariate models glued together. In fact, there’s no straightforward way to calculate, ahead of time, what the parameters of a model y ~ x1 + x2 will be if all you have are separate univariate models y ~ x1 and y ~ x2. I’ll demonstrate that with an example modelling the bill depth of penguins, using bill length and species as predictors.

Artwork by @allison_horst

Loading packages

These are the packages we’ll use in this lesson:

library(tidyverse)

library(ggbeeswarm)
library(khroma)
library(patchwork)
theme_set(theme_bw())

library(palmerpenguins)

library(broom)
library(marginaleffects)

1: General data processing
2: Data visualization
3: The penguins data
4: Model investigation

Penguins, first look

Let’s grab just the data columns we’re focusing on.

penguins |>
  select(bill_depth_mm, bill_length_mm, species) |>
  drop_na() ->
  penguin_focus

1: The original penguins data.
2: Selecting just the data columns we’re focused on.
3: Dropping rows that have NA values.
4: The new data frame.

Preparing to model

It’s always a good idea to take a look at how each of the variables (outcome and predictor) are distributed.

penguin_focus |> 
  ggplot(aes(bill_depth_mm)) +
    stat_bin() +
    labs(title = "Bill Depth Histogram (outcome)")

1: Plotting a histogram of bill_depth_mm.

penguin_focus |> 
  ggplot(aes(bill_length_mm)) +
    stat_bin() +
    labs(title = "Bill Length Histogram (predictor)")

1: Plotting a histogram of bill_length_mm

penguin_focus |> 
  ggplot(aes(species, fill = species))+
    stat_count()+
    khroma::scale_fill_bright() +
    labs(title = "Species Counts (predictor)")

1: Plotting counts of each species.
2: Setting a color scheme for species that we’ll reuse later.

Let’s “normalize” the bill_length_mm data for modeling by z-scoring it.

penguin_focus |> 
  mutate(
    bill_length_z = (bill_length_mm -
                       mean(bill_length_mm)) /
                    sd(bill_length_mm)
  ) ->
  penguin_focus

1: mutate() adds new columns.
2: We’re subtracting the mean and dividing by the standard deviation. \(\frac{x-\text{mean}(x)}{\text{sd}(x)}\)
3: Assigning the result back out.

Let’s look at the z-scored histogram

penguin_focus |> 
  ggplot(aes(bill_length_z)) +
    stat_bin()+
    labs(title = "Bill Length (z-scored)")

1: Plotting a histogram of bill_length_z.

Fitting the univariate models

Normally, I’d recommend making a plot of the outcome & predictor variable before fitting a model, but just to avoid repeating too much code here, we’ll fit the models first, then plot their predicted values on the data.

Bill Length Model

We’ll start with a univariate (one variable) model, predicting bill_depth_mm with bill_length_z.

mod_len <- lm(
   bill_depth_mm ~
     bill_length_z,
   data = penguin_focus)

1: Fitting a linear model
2: bill_depth_mm is our outcome variable
3: bill_length_z is our predictor variable
4: Using the penguin_focus data

We can look at the model parameters in a dataframe format using broom::tidy.

tidy(mod_len)

# A tibble: 2 × 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     17.2       0.104    165.   0        
2 bill_length_z   -0.464     0.104     -4.46 0.0000112

The estimated effect of bill_length_z is -0.464, meaning that for every 1 standard deviation increase in bill length, we expect bill depth to decrease by -0.464 millimeters.

We can get the bill depth values this model predicts with the marginaleffects::predictions() function.

predictions(
  mod_len,
  newdata = datagrid(
    bill_length_z = c(-2, 1, 0, 1, 2)
  )
) |> 
  tibble()->
  pred_len

1: predictions() is from the {marginaleffects} package.
2: This is the model we want to get predictions from.
3: We want to get predicted values at specific points/combinations of predictors, so we use the newdata argument. datagrid() is another function from {marginaleffects} that will create a data frame.
4: We want to get predicted bill_depth_mm values for some representative bill_length_z values.
5: The output of predictions() isn’t quite what we want for making a plot, so we convert it with tibble().

The pred_len dataframe

pred_len |> 
  rmarkdown::paged_table()

We can plot these predicted values over top the original data. One thing to know is that the line we plot here is identical to what we’d get if we just added stat_smooth(method = lm) to the plot.

ggplot(
  penguin_focus,
  aes(bill_length_z, bill_depth_mm)
)+
  geom_point() +
  geom_line(
    data = pred_len,
    aes(y = estimate),
    color = "steelblue"
  )+
  geom_ribbon(
    data = pred_len,
    aes(
      ymin = conf.low,
      ymax = conf.high
    ),
    alpha = 0.3
  )

1: Scatterplot of the full data.
2: Adding a line to represent the predicted values, which are in the pred_len dataframe.
3: In pred_len, the predicted bill depth is in the estimate column.
4: Adding a confidence interval onto the predicted values with geom_ribbon().
5: The width of geom_ribbon() is determined by the ymax and ymin aesthetics, and the high and low points of the confidence interval are in the conf.low and conf.high columns of pred_len.
6: Setting a transparency on the confidence interval.

So, as you can see, the line of predicted values of bill depth slopes downwards as bill length increases when we include only bill_length_z in the model.

Species Model.

Now, let’s fit a model looking at just an effect of Species. I’ve left the alphabetically first penguin species, Adelie, as the reference level.

mod_species <- lm(
  bill_depth_mm ~
    species,
  data = penguin_focus
)

1: We’re fitting a linear model.
2: Our outcome variable is bill_depth_mm.
3: Our predictor variable is species.
4: We’re drawing the data from penguin_focus.

And again, we can look at the model parameters with broom::tidy.

tidy(mod_species)

# A tibble: 3 × 5
  term             estimate std.error statistic  p.value
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)       18.3       0.0912   201.    0       
2 speciesChinstrap   0.0742    0.164      0.453 6.50e- 1
3 speciesGentoo     -3.36      0.136    -24.7   7.93e-78

So, the predicted bill depth for our reference level (Adelie) is 18.35 millimeters. Chinstraps are estimated to have deeper bills by 0.07mm (not significant), and Gentoos are estimated to have less deep bills by -3.36mm.

Let’s get the predicted bill depths from our model to plot over our data.

predictions(
  mod_species,
  newdata = datagrid(
    species = c("Adelie", "Chinstrap", "Gentoo")
  )
) |> 
  tibble()->
  pred_species

1: predictions() from {marginaleffects}.
2: We’re getting predictions from our mod_species model.
3: We’re getting predicted values for this newdata generated by datagrid.
4: We want predicted values for these specific species.
5: Conversion to a dataframe.

pred_species

pred_species |> 
  rmarkdown::paged_table()

To plot the original data by species, I’ll create a “beeswarm” plot, and add the predicted values on top with a “pointrange”

ggplot(
  penguin_focus,
  aes(
    species,
    bill_depth_mm,
    color = species
  )
)+
  geom_beeswarm(
    size = 0.5,
    shape = 1
  )+
  geom_pointrange(
    data = pred_species,
    aes(
      y = estimate,
      ymin = conf.low,
      ymax = conf.high
    )
  )+ 
  khroma::scale_color_bright()

1: Setting up the plot of the original data.
2: Strictly speaking, coloring the data by species is redundant for this plot, but we’re sticking to a common color scheme across all plots.
3: ggbeeswarm::geom_beeswarm() is one way to try to deal with plotting a continuous variable against a categorical variable.
4: I’m making the data points small and open circles so that they’ll be visually distinct from the points we use to plot the predicted values.
5: geom_pointrange() plot a point to represent the estimated value, and a “whisker” to represent the confidence interval.
6: Our predicted values are in pred_species.
7: The predicted value is in the estimate column, and the high and low points for the confidence intervals are in conf.high and conf.low.
8: Re-using our consistent color scheme.

Univariate model recap.

Just to recap what we’ve seen from our two separate univariate models:

When we modelled bill_depth_mm with bill_length_z, we got an estimated slope of -0.464 .
When we predicted bill_depth_mm with species, we estimated that Chinstraps have a bill depth 0.07mm more than Adelies, and Gentoos have a bill depth -3.36 less than Adelies.

Fitting the multivariate model

Now, let’s see what we get when we fit a multivatiate model, including both bill length and species in the model.

mod_len_species <- lm(
  bill_depth_mm ~
    bill_length_z +
    species,
  data = penguin_focus
)

1: We’re fitting a linear model
2: Our outcome variable is bill_depth_mm
3: Our predictor variables are bill_length_z and species
4: We’re drawing the data from the penguin_focus dataframe.

Let’s look at our model parameters now!

tidy(mod_len_species)

# A tibble: 4 × 5
  term             estimate std.error statistic   p.value
  <chr>               <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)         19.4     0.119     163.   2   e-323
2 bill_length_z        1.09    0.0955     11.4  8.66e- 26
3 speciesChinstrap    -1.93    0.224      -8.62 2.55e- 16
4 speciesGentoo       -5.11    0.191     -26.7  3.65e- 85

The estimated effects of bill length and species in this multivariate model are really different from each separate univariate model. Let’s look at them right next to each other, comparing the multivariate model to the “glued together” univariate model estimates.

term	multivariate	glued togther univariate
bill_length_z	1.09	−0.46
speciesChinstrap	−1.93	0.07
speciesGentoo	−5.11	−3.36

The effect of bill_length_z has gone from negative in the univariate model to positive in the multivariate model, meaning we’re now predicting that as bill length gets longer, bill depth gets deeper.
Chinstraps have gone from having a predicted bill depth that’s just about the same as Adelies in the univariate model to a predicted shorter bill depth.
Gentoos predicted to have a less deep bill than Adelies in both models, but the difference is predicted to be larger in the multivariate model.

Why are the estimates so different between the multivariate and univariate models? Let’s get predicted values to plot over the data.

predictions(
  mod_len_species,
  newdata = datagrid(
    bill_length_z = c(-2, 1, 0, 1, 2),
    species = c("Adelie", "Chinstrap", "Gentoo")
  )
) |> 
  tibble() ->
  pred_len_species

1: predictions() from {marginaleffects}.
2: We’re getting predictions from our multivariate mod_len_species model.
3: We’re going to get predictions for the newdata generated by datagrid(). This time, because we’re creating two columns, datagrid() is going to create a row for every combination of values in bill_length_z and species.
4: We want predictions for representative values of bill_length_z and species.
5: Conversion to a dataframe.

pred_len_species

pred_len_species |> 
  rmarkdown::paged_table()

Let’s first plot these predicted values over the bill length x bill depth scatterplot.

ggplot(
  penguin_focus,
  aes(
    bill_length_z,
    bill_depth_mm,
    color = species
    )
)+
  geom_point()+
  geom_line(
    data = pred_len_species,
    aes(y = estimate)
  )+
  geom_ribbon(
    data = pred_len_species,
    aes(
      ymin = conf.low,
      ymax = conf.high,
      fill = species
    ),
    color = NA,
    alpha = 0.3
  ) +
  scale_color_bright()+
  scale_fill_bright()

1: Setting up the scatter plot, this time coloring points according to penguin species.
2: Adding on the predicted values from pred_len_species.
3: Adding on confidence intervals.
4: The plot looks nicer with these settings.
5: Using our color scheme for both the color of the points & lines and the fill of the ribbons.

What we can see now is that when looking at the data overall, the relationship between bill length and bill depth might be negative, but when looking within each species, the relationship is positive. For Adelies, penguins with longer bills have deeper bills, and the same goes for Chinstraps and Gentoos.

The original slope in the univariate model was negative because each species has a different baseline bill depth and different ranges of bill lengths. When fitting a univariate model that didn’t take species into account, this results in a negative slope! This mismatch between the effect within groups vs combining groups is known as Simpson’s Paradox.

Remaining Questions:

Could we have predicted what the slopes were going to be in the multivariate model?

When we were learning about centering and scaling predictors for univariate models, it was possible to calculate what the new parameters were going to be after (linearly) transforming them. Could we have mathematically calculated what the slopes were going to be in the multivariate model based on the two univariate models?

No. The negative slope in the bill_depth_mm ~ bill_length_z, for example, was estimated with information about species missing (i.e. bill_depth_mm’s relationship to species and bill_length_z’s relationship to species).

Did the univariate models miscalculate the relationship between bill length, species, and bill depth?

No, not really. As it turns out, the univariate models aren’t very good models for predicting bill depth. If we compare their model fits as well as \(R^2\) and goodness of fit heuristics, AIC and BIC, the multivariate model is preferred.

model	r.squared	AIC	BIC
~len	0.06	1,422	1,433
~species	0.68	1,054	1,069
~len + species	0.77	944	963

But, just because the univariate models aren’t ideal, that doesn’t mean they miscalculated anything. When trying to estimate the effect of bill length on bill depth in the absence of species information, the slope is negative.

How do I visualize fitted multivariate models?

One important takeaway is that in order to visualize the results of a multivariate model, you can’t just plot the fit of a univariate model. This plot doesn’t show a line with a slope of 1.09.

ggplot(
  penguin_focus, 
  aes(
    bill_length_z, 
    bill_depth_mm
    )
)+
  geom_point()+
  stat_smooth(method = lm)

Instead, you need to get predicted values from the multivariate model. The code above used marginaleffects::predictions() to do that, but there are actually lots of ways to go about it. For example, here’s some base R code to do it.

pred <- expand.grid(
  bill_length_z = c(-2, 1, 0, 1, 2),
  species = c("Adelie", "Chinstrap", "Gentoo")
)

pred$estimate <- predict(mod_len_species, newdata = pred)

ggplot(
  penguin_focus,
  aes(
    bill_length_z,
    bill_depth_mm,
    color = species
  )
)+
  geom_point() +
  geom_line(
    data = pred,
    aes(y = estimate)
  )+
  scale_color_bright()

Reuse

CC-BY-SA 4.0