What is R?

R
Author

Josef Fruehwald

Published

December 17, 2022

In the Spring 2023 semester, I’m going to be teaching two R intensive courses: a statistics for linguists course, and an R for the Arts and Sciences course. For both, I’m going to have to do a “What is R” discussion during week 1, and given the breadth of tools I hope students come away with, I’ve been rethinking my usual answers.

A programming language?

The Wikipedia slug for R says

R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing

And yeah, it is definitely a programming language. Here it is doing some programming language things:

2+2
[1] 4
2+2 < 5
[1] TRUE

And it can do statistical computing, like a linear model

cars_model <- lm(dist ~ speed, data = cars)
summary(cars_model)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

And it can do graphics.

# `PLOT_FONT <- "Fira Sans"` in my .Rprofile
par(family = PLOT_FONT)
plot(cars)
A base graphics scatter plot showing the relationship between speed and distance from the cars default data frame.
Figure 1: A plot

Obviously, none of these things are unique to R. In the other programming language I know best, Python, you can fit a linear model and make a scatter plot. What differentiates programming languages, in my experience, is what kinds of operations, data structures, and workflows it prioritizes.

For R, I think it’s uncontroversial to say it prioritizes rectangular data with mixed data-type rows & single data-type columns and also provides a lot of options for indexing column-wise. And a lot of the extensions to R have leaned into this prioritization hard.

An ecosystem?

But “R” isn’t just a programming language, it’s also an ecosystem of community created packages. “Learning R” involves learning about these packages, and how they’re interrelated. I grabbed the list of all packages on CRAN and the packages they import with the crandep package.

cran_df <- crandep::get_dep_all_packages()
Import Summaries
cran_df |> 
  filter(type == "imports", !reverse) |> 
  count(to) |> 
  arrange(desc(n)) |> 
  mutate(rank = 1:n()) -> imported

imported_10 <- imported |> slice(1:10)

If you count up how often each package gets imported and rank them, you get the familiar power-law plot. I’ve plotted this one out a bit non standard-ly so that frequency is on the x axis for both the main plot and the inset, and so that I could include the package names in the inset with horizontal text.

Plotting code
imported |> 
  ggplot(aes(n, rank))+
    geom_point()+
    scale_x_log10(labels = label_comma())+
    scale_y_log10(labels = label_comma())+
    labs(title = "Frequency by rank of imported R packages") -> mainplot

imported_10 |> 
  mutate(to = as.factor(to),
          to = fct_reorder(to, rank)) |> 
  ggplot(aes(n, to))+
    geom_col(fill = "white")+
    geom_text(aes(label = to,
                  x = 0), 
              color = "grey10",
              hjust = 0,
              nudge_x = 100, 
              family = PLOT_FONT,
              size = 4.5)+
    scale_x_continuous(expand = expansion(mult = 0),
                       labels = label_comma())+
    theme(axis.text.y = element_blank(),
          text = element_text(size = 10),
          panel.grid.minor  = element_blank(),
          panel.grid.major.y = element_blank())+
    labs(y = NULL,
         title = "top10") -> inset
 
mainplot + inset_element(inset, 0.05, 0.05, 0.5, 0.6)
The main figure shows the linear decreasing relationship between log(frequency) and log(rank). The inset is a bar graph showing the import frequency of the top 10 packages, which are stats, utils, methods, dplyr, ggplot2, Rcpp, graphics, magrittr, rlang, stringr
Figure 2: A log(rank) by log(frequency) plot of R imports

Here’s a network visualization of these imports and dependencies. I color coded the nodes according to common R package naming trends

  • gg* - Packages extending ggplot2

  • tidy* - Packages declaring their adherence to tidy-data principles (and the tidyverse more generally)

  • *r - Packages declaring that they are… R packages

Network graph setup
imported |>
  filter(n >= 5) |> 
  pull(to) -> to_network

cran_df |>
  filter(!reverse, type == "imports",
         to %in% to_network) |>
  df_to_graph(nodelist = cran_df |> rename(name = from)) -> cran_network

set.seed(300)
cran_flat <- ggnetwork(cran_network, layout = with_drl())

xclip <- quantile(cran_flat$x, c(0.0025, 0.9975))
yclip <- quantile(cran_flat$y, c(0.0025, 0.9975))
Network graph
cran_flat |> 
  mutate(name_pattern = case_when(str_detect(name, "[rR]$") ~ "thingr",
                                  str_detect(name, "tidy") ~ "tidy",
                                  str_detect(name, "^[Gg]g") ~ "gg",
                                  T ~ "else"),
         name_pattern = factor(name_pattern, levels = c("tidy", "gg", "thingr", "else"))) |> 
  arrange(desc(name_pattern)) |> 
  filter(x >= xclip[1], x <= xclip[2],
         y >= yclip[1], y <= yclip[2]) |> 
ggplot(aes(x = x, y = y, xend = xend, yend = yend, color = name_pattern))+
  #geom_nodes()+  
  geom_nodes(aes(alpha = name_pattern))+
  scale_color_bright(limits = c("tidy", "gg", "thingr", "else"),
                     labels = c("tidy*", "gg*", "*r", "else"))+
  dark_theme_void()+
  scale_alpha_manual(values = c(0.5,0.3, 0.08, 0.02), 
                     limits = c("tidy", "gg", "thingr", "else"),
                     guide = "none")+
  labs(color = NULL,
       title = "CRAN imports network visualization")+
  theme(legend.position = c(0.2,0.8), 
        legend.text = element_text(family = PLOT_FONT),
        text = element_text(family = PLOT_FONT),
        plot.title = element_text(hjust = 0.5))
A network diagram displaying just the nodes, color coded according to their name pattern: beginning with tidy, beginning with gg, ending with r, and else
Figure 3: CRAN network graph

A communications platform?

But beyond just the R packages that implement specific analysis or process data in a specific way, there are also all of the tools built around R (and mostly around the RStudio IDE) that also make R what I might call a “communications platform.” From Sweave to knitr to rmarkdown and now Quarto, the kind of literate programming you can do in R has moved from ugly1 Beamer slides to, well, full on blogs.

But, it’s not just for the novelty or nerd appeal that I think it’s important to learn about R authoring tools available. They’ve also changed my own discovery and learning process about new packages. You can always find the documentation for a package on CRAN, but you should really try to find its pkgdown site.2

What does it mean to “know R”?

When I think about what it means to “know R”, and my goal for the kind of knowledge my students should start getting a handle on, it involves all of these components: the programming syntax, the social graph of the ecosystem, and the authoring tools to use and seek out.

A lot of other programming languages have similar kinds of features, especially Python with pypi or conda keeping track of the ecosystem and Sphinx providing the authoring tools. There too I’d say that getting to “know Python” involves a lot more than learning its syntax.

Footnotes

  1. sorry, but they are↩︎

  2. Most often, click on the URL listed on CRAN, which takes you to the package’s github, which then probably has a link to the pkgdown site in its “about” box.↩︎

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:
@online{fruehwald2022,
  author = {Fruehwald, Josef},
  title = {What *Is* {R?}},
  series = {Væl Space},
  date = {2022-12-17},
  url = {https://jofrhwld.github.io/blog/posts/2022/12/2022-12-17},
  langid = {en}
}
For attribution, please cite this work as:
Fruehwald, Josef. 2022. “What *Is* R?” Væl Space. December 17, 2022. https://jofrhwld.github.io/blog/posts/2022/12/2022-12-17.