R Package Exploration (Jan 2023)

R
R package exploration
Author

Josef Fruehwald

Published

January 27, 2023

As I scroll through my feeds, I often come across a really cool looking package, or a new feature of a package, that I think looks really cool, and then I forget to go back to really kick the tires to see how it works. So I’ve decided to try to set up a workflow where I send the docs or pkgdown pages for the package to a Trello board, and then come back maybe once a month and experiment with them in a blog post.

{ggforce}, {ggdensity} and {geomtextpath}

The packages I want to mess around with today are all extensions to ggplot2, so I’ll load up the palmerpenguins dataset for experimentation.

{ggforce} and convex hulls

The ggforce package as the option to add a convex hull over your data (ggforce::geom_mark_hull()), kind of indicating where the data clusters are. Here’s my base plot.

plot1 <- 
  penguins |> 
  drop_na() |> 
  ggplot(aes(bill_length_mm, bill_depth_mm, color = species))+
    geom_point()+
    scale_color_brewer(palette = "Dark2")+
    scale_fill_brewer(palette = "Dark2")
plot1
A scatter plot of penguins' bill length and bill depth. There are three species plotted: Adelie, Chinstrap, Gentoo
Figure 1: The base penguins scatterplot

I’ll throw on the default convex hull.

plot1 +
  geom_mark_hull()
Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
A scatter plot with convex hulls plotted around the three penguin's data points. The convex hulls are somewhat jagged.
Figure 2: Basic convex hull

Default is ok, but for this data set, the hulls are a bit jagged. That can be adjusted with the concavity argument. I’ll also throw in a fill color.

plot1 +
  geom_mark_hull(
    concavity = 5,
    aes(
      fill = species
    )
  )
The same scatterplot, but now the convex hulls are smoother and are filled by species.
Figure 3: Smoothed out and filled convex hulls

That’s better. It also comes with a mappable label and description aesthetics. Here, it seems a bit more touchy.

plot1 +
  geom_mark_hull(
    concavity = 5,
    aes(fill = species,
        label = species,
    ),
    label.family = "Fira Sans"
  )
The scatter plot with the sme convex hull, but no labels actually present.
Figure 4: Attempted labelling of convex hulls

The labels actually appear in the RStudio IDE for me, but not in the rendered page here because it wants more headroom around the plot. I’ll add that in by setting the expand arguments to ggplot::scale_y_continuous() and ggplot::scale_x_continuous(), and I’ll drop the legend while I’m at it.

plot1 +
  geom_mark_hull(
    concavity = 5,
    aes(fill = species,
        label = species,
    ),
    label.family = "Fira Sans"
  )+
  scale_y_continuous(
    expand = expansion(
      mult = c(0.25, 0.25)
    )
  )+
  scale_x_continuous(
    expand = expansion(
      mult = c(0.25, 0.25)
    )
  ) +
  guides(
    color = "none",
    fill = "none"
  )
The same scatter plot, now with more space around the x and y axes, with species labels drawn pointing to each convex hull
Figure 5: Labelled convex hulls

Thoughts

I like the convex hulls as a presentational aide. It probably shouldn’t be taken as a statistical statement about, for example the degree of overlap between these three species, but is useful for outlining data points of interest.

I kind of wish this was separated out into a few different, more conventional, ggplot2 layers. It’s called a geom_ but the convex hulls are definitely stat_s. The convex hull statistic layer isn’t exposed to users, so you can’t mix-and-match convex hull estimation and the geom used to draw it. On the other hand, I can see that it’s much more souped up than a typical geom. For example, you can filter the data within the aes() mapping.

plot1 +
  geom_mark_hull(
    concavity = 5,
    aes(
      filter = sex == "female"
    )
  )
The scatterplot, now with the convex hulls for each species only containung a portion of the data.
Figure 6: Filtered convex hulls

{ggdensity}

As pointed out on the ggdensity readme, there’s already a stat+geom in ggplot2 to visualize 2d density plots.

plot2 <- 
  penguins |> 
  drop_na() |> 
  ggplot(aes(bill_length_mm, bill_depth_mm))

plot2 +
  stat_density_2d_filled()
A density contour plot showing three peaks in the density distribution
Figure 7: Density contour plot

Those levels are a little hard to follow, though, which is what ggdensity::stat_hdr() is for. It will plot polygons/contours for given probability levels, of the data distribution

plot2 +
  stat_hdr()
A density contour plot showing three peaks in the density distribution at 50%, 80%, 95% and 99%
Figure 8: Highest density region contour plot

The probabilities are mapped to transparency by default, so you can map the fill color to a different dimension.

plot2 +
  stat_hdr(aes(fill = species))+
  scale_fill_brewer(palette = "Dark2")
A density contour plot showing three peaks in the density distribution. There are three sets of contours for each penguin species
Figure 9: Highest density region contour plot, filled by species

The package also has a ggdensity::stat_hdr_rug() to add density distribution rugs to plots.

plot2 +
  geom_point()+
  stat_hdr_rug(fill = "grey90")
A scatter plot with a data distribution rug indicating the density of the data distribution.
Figure 10: HDR rug

{geomtextpath}

I’ve actually been messing around with this for a bit, but geomtextpath allows you to place text along lines. There’s standalone geom_textpath() and geom_labelpath() functions, but just to stick with the penguins data, I’m going to match the textpath geom with a different stat.

plot3 <-
  penguins |> 
  drop_na() |> 
  ggplot(aes(bill_length_mm, bill_depth_mm, color = species))+
    scale_color_brewer(palette = "Dark2")

plot3 +
  stat_smooth(
    geom = "textpath", 
    # you have to map a label aesthetic
    aes(label = species),
  ) +
  guides(color = "none")
A plot with one trend line for each species with the species name written along the line.
Figure 11: Trendlines with text written along them

You can move the location of the text on the path back and forth by either setting or mapping hjust to a number between 0 and 1, and you can lift the text off the line with vjust.

plot3 +
  stat_smooth(
    geom = "textpath", 
    # you have to map a label aesthetic
    aes(label = species),
    hjust = 0.1,
    vjust = -1
  ) +
  guides(color = "none")
A plot with one trend line for each species with the species name written just above line.
Figure 12: Trendlines with text written along them

Mixing and matching statistics and these direct labels could get pretty powerful. For example, here’s the name of each species written around data ellipses.

plot3 +
  stat_ellipse(
    geom = "textpath", 
    # you have to map a label aesthetic
    aes(label = species),
    hjust = 0.1  
  ) +
  guides(color = "none")
A plot with one data ellipse drawn for each penguin species, with the species name written across it.
Figure 13: Data ellipses text written along them

Combo {ggdensity} and {geomtextpath}

Since the ggdensity statistics are ordinary stat_, we can also combine them with textpaths to label the probability levels directly.

plot2 +
  stat_hdr_lines(
    aes(label = after_stat(probs)),
    color = "grey90",
    geom = "textpath"
  ) +
  guides(alpha = "none")
A highest density region plot with the probability levels written along the contours
Figure 14: Higest density region plot with direct labels

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:
@online{fruehwald2023,
  author = {Fruehwald, Josef},
  title = {R {Package} {Exploration} {(Jan} 2023)},
  series = {Væl Space},
  date = {2023-01-27},
  url = {https://jofrhwld.github.io/blog/posts/2023/01/2023-01-27_jan-rpackages},
  langid = {en}
}
For attribution, please cite this work as:
Fruehwald, Josef. 2023. “R Package Exploration (Jan 2023).” Væl Space. January 27, 2023. https://jofrhwld.github.io/blog/posts/2023/01/2023-01-27_jan-rpackages.