Data Viz notes

Block 2
data-viz
Author

Josef Fruehwald

Published

January 18, 2023

Why do data visualization?

Here’s a classic example called Anscomb’s quartet.

code for making the plot
anscombe |> 
  mutate(rowid = 1:n()) |> 
  pivot_longer(-rowid) |> 
  mutate(
    dim = str_extract(name, "[xy]"),
    series = str_extract(name, "\\d")
  ) |> 
  select(-name) |> 
  pivot_wider(
    names_from = dim,
    values_from = value
  ) -> anscomb_long

anscomb_long |> 
  ggplot(aes(x, y)) +
    geom_point(
      aes(color = series),
      size = 3,
     ) +
    scale_color_bright(
      guide = "none"
    )+
    facet_wrap(~series, label = label_both) +
    theme(aspect.ratio = 1)

A scatter plot with 4 facets plotting anscomb's quartet. The 4 data series are very different in their overall shape and distributions.

Figure 1: Anscomb’s quartet

But a simple correlation test within each series results in nearly identical values.

series estimate statistic p.value parameter conf.low conf.high
1 0.816 4.241 0.002 9 0.424 0.951
2 0.816 4.239 0.002 9 0.424 0.951
3 0.816 4.239 0.002 9 0.424 0.951
4 0.817 4.243 0.002 9 0.425 0.951

The same goes for fitting linear regressions to each series.

term estimate std.error statistic p.value
1
(Intercept) 3.000 1.125 2.667 0.026
x 0.500 0.118 4.241 0.002
2
(Intercept) 3.001 1.125 2.667 0.026
x 0.500 0.118 4.239 0.002
3
(Intercept) 3.002 1.124 2.670 0.026
x 0.500 0.118 4.239 0.002
4
(Intercept) 3.002 1.124 2.671 0.026
x 0.500 0.118 4.243 0.002

A more recent and fun example of extremely different underlying data which have (nearly) identical parametric summaries is the “datasaurus dozen” (Matejka and Fitzmaurice 2017; Davies et al. 2022).

animation code
datasaurus_dozen |> 
  mutate(dataset_n = as.numeric(as.factor(dataset))) |> 
  group_by(dataset) |> 
  mutate(id = 1:n()) |> 
  ggplot(aes(x, y, color = dataset_n))+
    geom_point(aes(group = id))+
    scale_color_buda(guide = "none")+
    ggdark::dark_theme_gray(base_size = 16) + 
            theme(text = element_text(family = "sans"),
                  plot.background = element_rect(fill = "#20374c"),
                  strip.background = element_rect(fill = "#31465a"),
                  legend.background = element_rect(fill = "#20374c"),
                  panel.background = element_blank(),
                  panel.grid.major = element_line(color = "#8595A8", linewidth = 0.2),
                  panel.grid.minor = element_line(color = "#536477", linewidth = 0.2),
                  axis.ticks = element_blank())+
    labs(title = "{closest_state}")+
    transition_states(dataset, transition_length = 3, state_length = 2)+
    ease_aes(default = "cubic-in-out")

An animated gif cycling between twelve different scatter plots of data distributions. One of them is a dinosaur.

Figure 2: The datasaurus dozen (Matejka and Fitzmaurice 2017; Davies et al. 2022)

Again, each separate data series here has nearly identical parametric summaries

metrics code
datasaurus_dozen |> 
  group_by(dataset) |> 
  summarise(across(.fns = list(mean = mean, sd = sd)),
            xy_cor = cor(x,y)) |> 
  gt() |> 
  fmt_number(
    columns = -dataset,
    decimals = 3
  )
dataset x_mean x_sd y_mean y_sd xy_cor
away 54.266 16.770 47.835 26.940 −0.064
bullseye 54.269 16.769 47.831 26.936 −0.069
circle 54.267 16.760 47.838 26.930 −0.068
dino 54.263 16.765 47.832 26.935 −0.064
dots 54.260 16.768 47.840 26.930 −0.060
h_lines 54.261 16.766 47.830 26.940 −0.062
high_lines 54.269 16.767 47.835 26.940 −0.069
slant_down 54.268 16.767 47.836 26.936 −0.069
slant_up 54.266 16.769 47.831 26.939 −0.069
star 54.267 16.769 47.840 26.930 −0.063
v_lines 54.270 16.770 47.837 26.938 −0.069
wide_lines 54.267 16.770 47.832 26.938 −0.067
x_shape 54.260 16.770 47.840 26.930 −0.066

Mapping

More should be more in the spatial metaphor

  • More should be up

  • Less-to-more should probably move from Left-to-right

  • More should be more distinct from background color

    • Darker for the default “white” page

    • Lighter for darkmode

  • More should be larger or thicker, less should be smaller or thinner

Colors should adhere to, or at least not cross-cut the visual culture

  • green = go, red = stop

  • red = hot, blue = cold

References

Davies, Rhian, Steph Locke, Alberto Cairo, Justin Matejka, George Fitzmaurice, Lucy D’Agostino McGowan, Richard Cotton, Tim Book, and Jumping Rivers. 2022. datasauRus: Datasets from the Datasaurus Dozen. https://CRAN.R-project.org/package=datasauRus.
Matejka, Justin, and George Fitzmaurice. 2017. “CHI ’17: CHI Conference on Human Factors in Computing Systems.” In, 1290–94. Denver Colorado USA: ACM. https://doi.org/10.1145/3025453.3025912.