A&S500/Lin517 - Data Viz notes

Why do data visualization?

Here’s a classic example called Anscomb’s quartet.

Loading libraries

library(tidyverse)
library(khroma)
library(gt)
library(colorspace)
library(datasauRus)
library(gganimate)

code for making the plot

anscombe |> 
  mutate(rowid = 1:n()) |> 
  pivot_longer(-rowid) |> 
  mutate(
    dim = str_extract(name, "[xy]"),
    series = str_extract(name, "\\d")
  ) |> 
  select(-name) |> 
  pivot_wider(
    names_from = dim,
    values_from = value
  ) -> anscomb_long

anscomb_long |> 
  ggplot(aes(x, y)) +
    geom_point(
      aes(color = series),
      size = 3,
     ) +
    scale_color_bright(
      guide = "none"
    )+
    facet_wrap(~series, label = label_both) +
    theme(aspect.ratio = 1)

A scatter plot with 4 facets plotting anscomb's quartet. The 4 data series are very different in their overall shape and distributions. — Figure 1: Anscomb’s quartet

But a simple correlation test within each series results in nearly identical values.

series	estimate	statistic	p.value	parameter	conf.low	conf.high
1	0.816	4.241	0.002	9	0.424	0.951
2	0.816	4.239	0.002	9	0.424	0.951
3	0.816	4.239	0.002	9	0.424	0.951
4	0.817	4.243	0.002	9	0.425	0.951

The same goes for fitting linear regressions to each series.

term	estimate	std.error	statistic	p.value
1
(Intercept)	3.000	1.125	2.667	0.026
x	0.500	0.118	4.241	0.002
2
(Intercept)	3.001	1.125	2.667	0.026
x	0.500	0.118	4.239	0.002
3
(Intercept)	3.002	1.124	2.670	0.026
x	0.500	0.118	4.239	0.002
4
(Intercept)	3.002	1.124	2.671	0.026
x	0.500	0.118	4.243	0.002

A more recent and fun example of extremely different underlying data which have (nearly) identical parametric summaries is the “datasaurus dozen” (Matejka and Fitzmaurice 2017; Davies et al. 2022).

animation code

datasaurus_dozen |> 
  mutate(dataset_n = as.numeric(as.factor(dataset))) |> 
  group_by(dataset) |> 
  mutate(id = 1:n()) |> 
  ggplot(aes(x, y, color = dataset_n))+
    geom_point(aes(group = id))+
    scale_color_buda(guide = "none")+
    ggdark::dark_theme_gray(base_size = 16) + 
            theme(text = element_text(family = "sans"),
                  plot.background = element_rect(fill = "#20374c"),
                  strip.background = element_rect(fill = "#31465a"),
                  legend.background = element_rect(fill = "#20374c"),
                  panel.background = element_blank(),
                  panel.grid.major = element_line(color = "#8595A8", linewidth = 0.2),
                  panel.grid.minor = element_line(color = "#536477", linewidth = 0.2),
                  axis.ticks = element_blank())+
    labs(title = "{closest_state}")+
    transition_states(dataset, transition_length = 3, state_length = 2)+
    ease_aes(default = "cubic-in-out")

An animated gif cycling between twelve different scatter plots of data distributions. One of them is a dinosaur. — Figure 2: The datasaurus dozen (Matejka and Fitzmaurice 2017; Davies et al. 2022)

Again, each separate data series here has nearly identical parametric summaries

metrics code

datasaurus_dozen |> 
  group_by(dataset) |> 
  summarise(across(.fns = list(mean = mean, sd = sd)),
            xy_cor = cor(x,y)) |> 
  gt() |> 
  fmt_number(
    columns = -dataset,
    decimals = 3
  )

dataset	x_mean	x_sd	y_mean	y_sd	xy_cor
away	54.266	16.770	47.835	26.940	−0.064
bullseye	54.269	16.769	47.831	26.936	−0.069
circle	54.267	16.760	47.838	26.930	−0.068
dino	54.263	16.765	47.832	26.935	−0.064
dots	54.260	16.768	47.840	26.930	−0.060
h_lines	54.261	16.766	47.830	26.940	−0.062
high_lines	54.269	16.767	47.835	26.940	−0.069
slant_down	54.268	16.767	47.836	26.936	−0.069
slant_up	54.266	16.769	47.831	26.939	−0.069
star	54.267	16.769	47.840	26.930	−0.063
v_lines	54.270	16.770	47.837	26.938	−0.069
wide_lines	54.267	16.770	47.832	26.938	−0.067
x_shape	54.260	16.770	47.840	26.930	−0.066

Mapping

More should be more in the spatial metaphor

More should be up
Less-to-more should probably move from Left-to-right
More should be more distinct from background color
- Darker for the default “white” page
- Lighter for darkmode
More should be larger or thicker, less should be smaller or thinner

Colors should adhere to, or at least not cross-cut the visual culture

green = go, red = stop
red = hot, blue = cold

References

Davies, Rhian, Steph Locke, Alberto Cairo, Justin Matejka, George Fitzmaurice, Lucy D’Agostino McGowan, Richard Cotton, Tim Book, and Jumping Rivers. 2022. datasauRus: Datasets from the Datasaurus Dozen. https://CRAN.R-project.org/package=datasauRus.

Matejka, Justin, and George Fitzmaurice. 2017. “CHI ’17: CHI Conference on Human Factors in Computing Systems.” In, 1290–94. Denver Colorado USA: ACM. https://doi.org/10.1145/3025453.3025912.