Intro

~2 Minute Setup

Create your R Notebook for today and double check that your workspace is clear from last time.

Load the important packages for today:

#' You only need to run this once
install.packages("scales")
install.packages("ggthemes")
library("ggplot2")
library("lsa2017")
library("tidyverse")
library("broom")
I_jean <- read.delim("http://bit.ly/avml_ggplot2_data")

Why Plot

We are having a lesson on plotting in the middle of a course on R modelling because it is essential for you to plot your data before you try to model it. I would go so far as to say that if you haven’t made a lot of graphs of your data, and have only looked at averages, correlations, and linear model results, that you don’t really understand your data.

There’s a classic illustration of this called Anscombe’s quartet, which when plotted looks like three very distinctive patterns.

But if you fit linear models to them, they have nearly identical statistical properties.

fit_lm <- function(df){
  lm(y ~ x, data = df)
}
anscomb_models <- tidy_anscombe %>%
                    group_by(series)%>%
                    nest()%>%
                    mutate(model = map(data, fit_lm),
                           model_param_df = map(model, tidy),
                           model_glance = map(model, glance))
anscomb_models %>%
  unnest(model_param_df)%>%
  arrange(term)
anscomb_models %>%
  unnest(model_glance)

It has been even more humorously illustrated recently that you can produce data sets of almost any arbitrary shape that have nearly identical statistical properties.

Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

Thinking about Plotting

It’s important to think of your figures as a report of your data. Try to take as much care in producing your plots as you do your writing, or reporting of your statistics. They are as important as (or for some readers, more important than) anything else in your paper.

“Accuracy”

When making a plot, you should strive for accuracy in:

  • Accurately representing the properties of numbers.
  • Accurately representing the nature of your data.

Take this very simple data set:

group value
A 2
B 5

For our purposes, these numbers have three properties.

  1. Order: 2 < 5, or A < B
  2. Magnitude: 5 = 2.5 \(\times\) 2, or B = 2.5 \(\times\) A
  3. Contextual Magnitude: If A and B are bars, and these are measure of the cost of a pint, then A must be a real dive (and a good deal), and B must be a little bit better, but still not too fancy. If A and B are people, and these are their number of legs, then A has an unsurprising number of legs and B has a surprising number of legs.

Here is an example of an inaccurate plot:

It successfully captures the order of A and B, but fails to capture the correct magnitude of the difference. The magnitude of the difference is thrown off because the y-axis doesn’t start at 0. In this plot, the B line is 7\(\times\) longer than the A line, but the actual magnitude of the difference is 2.5\(\times\). This produces a “lie factor” of \(\frac{7}{2.5} = 2.8\).

This isn’t just a hypothetical problem either. For example, British electoral mailers are notorious for the inaccurately portraying the magnitude of differences.

Both academic researchers and the producers of these political mailers may counter by saying

But the axes were labelled accurately!

If readers would understand your data better if they ignored the graphical elements of your plot, then your plot is net-negative to accurate communication, and I can only assume that accuracy was never your primary goal anyway.

Think about degree of abstraction.

Another fraught issue in academic papers is how to accurately convey the volume of their data. For example, in the iy_ah data set, you might be tempted to plot all of the data.

iy_ah %>% 
  ggplot(aes(F2, F1, color = plt_vclass))+
    geom_point(alpha = 0.5)+
    scale_y_reverse()+
    scale_x_reverse()+
    scale_color_brewer(palette = "Dark2")+
    theme_minimal()

Or, you might be tempted to plot a summary of some sort.

iy_ah %>%
  group_by(plt_vclass) %>%
  summarise(F1_mean = mean(F1),
            F2_mean = mean(F2),
            F1_hi = F1_mean + 2*(sd(F1)),
            F1_lo = F1_mean - 2*(sd(F1)),
            F2_hi = F2_mean + 2*(sd(F2)),            
            F2_lo = F2_mean - 2*(sd(F2))) -> iy_ah_summ
ggplot(iy_ah_summ, aes(F2_mean, F1_mean, color = plt_vclass))+
    geom_point()+
    geom_segment(aes(y = F1_mean, 
                     yend = F1_mean, 
                     x = F2_lo, 
                     xend = F2_hi))+
      geom_segment(aes(y = F1_lo, 
                     yend = F1_hi, 
                     x = F2_mean, 
                     xend = F2_mean))+
    stat_ellipse(data = iy_ah, aes(x = F2, y = F1))+
    scale_color_brewer(palette = "Dark2")+
    scale_y_reverse()+
    scale_x_reverse()+  
    theme_minimal()

Neither of these is really optimal. The first one looks like you’ve got a billion data points, but really, most of these are reptition of the same vowel from the same speakers. The second one is an ok-ish, summary, but hides more than it shows.

What I tend to do is calculate means per-speaker, and then plot those means:

iy_ah %>% 
  group_by(idstring, plt_vclass)%>%
  summarise(F1 = mean(F1),
            F2 = mean(F2),
            n = n())%>%
  ggplot(aes(F2, F1, color = plt_vclass))+
    geom_point(alpha = 0.5, aes(size = n))+
    scale_y_reverse()+
    scale_x_reverse()+
    scale_size_area()+
    scale_color_brewer(palette = "Dark2")+
    theme_minimal()

This is often a more accurate portrayal of how much data you have, since there is a lot of data from a smaller number of speakers.

ggplot2 basic concepts

Layers, Aesthetics, Geometries and Statistics

The first thing we’re going to do is build up to creating this plot, which is a visualization of the I_jean data frame (which you should have loaded in the setup block).

Layers

You should hopefully start looking at figures like this one like many of us look at the image below.

Those of use familiar with this kind of media know that the picture of the libarary is not what was originally capture by my phone. Rather there are multiple layers of effects, filters and text on top of the base image, which produce the final image. And in fact, some of these layers are crucially ordered. For example, the text would look different if it was added to the image first, and then the filters, instead of vice versa.

So too with the ggplot2 plot above. These plots are constructed out of layers. Every component of the graph, from the underlying data it’s plotting, to the coordinate system it’s plotted on, to the statistical summaries overlaid on top, to the axis labels, are layers in the plot. The consequence of this is that your use of ggplot2 will probably involve iterative addition of layer upon layer until you’re pleased with the results.

Aesthetics

The graphical properties which encode the data you’re presenting are the aesthetics of the plot. These include things like

  • x position
  • y position
  • size of elements
  • shape of elements
  • color of elements

Geometries

The primary visual items on the plots are called geometries and include things like

  • points
  • lines
  • line segments
  • bars
  • text

Some of these geometries have their own specific aesthetic settings. For example,

  • points
    • point shape
  • text
    • text labels
  • lines
    • line weight
    • line type

Statistics

You’ll also frequently want to plot statistics overlaid on top of, or instead of the raw data. Some of these include

  • Smoothing and regression lines
  • One and two dimensional binning
  • Mean and medians with confidence intervals.

The aesthetics, geometries and statistics constitute the most important layers of a plot, but for fine tuning a plot for publication, there are a number of other things you’ll want to adjust. The most common one of these are the scales, which encompass things like

  • A logarithmic x or y axis
  • Customized color scales
  • Customized point shapes, or linetypes

We’ll review many of these components as we build up the plot, and will circle back to more of them for greater detail.

Building the Plot

First, let’s refresh our memories of the graph we want to build.

This plot is composed of nine layers, which can be subdivided into five layer types. It’s not important for you to memorize these layer types, but it helps to structure the discussion.

Layers

The data layer

Every ggplot2 plot has a data layer, which defines the data set to plot, and the basic mappings of data to aesthetic elements. The data layer created with the functions ggplot() and aes(), and looks like this

ggplot(data, aes(...))

The first argument to ggplot() is a data frame (it must be a data frame), and its second argument is aes(). You’re never going to use aes() in any other context except for inside of other ggplot2 functions, so it might be best not to think of aes() as its own function, but rather as a special way of defining data-to-aesthetic mappings.

For the plot from above, we’ll be using data from the I_jean data frame, which looks like this:

head(I_jean)

I’ve decided that an interesting relationship in this data is between the vowel duration (Dur_msec) and the normalized F1 of the vowel (F1.n). Specifically, I’d like to map Dur_msec to the x-axis, and F1.n to the y-axis. Here’s the ggplot2 code.

  p <- ggplot(I_jean, aes(x=Dur_msec, y=F1.n))
  p

You can think of this plot as the base image, before we’ve added any extra layers, text or instagram filters to it. An important conceptual issue is that you are able to assign plots to variables (in this case, p). When you do this assignment, nothing special happens. But if you print out p, R will generate the plot.

The geometries layer

The next step, after defining the basic data-to-aesthetic mappings, is to add geometries to the data. We’ll discuss geometries in more detail below, but for now, we’ll add one of the simplest: points.

  p <- p + geom_point()
  p

There are a few things to take away from this step. First and foremost, the way you add new layers, of any kind, to a plot is with the + operator. And, as we’ll see in a moment, there’s no need to only add them one at a time. You can string together any number of layers to add to a plot, separated by +.

The next thing to notice is that all layers you add to a plot are, technically, functions. We didn’t pass any arguments to geom_point(), so the resulting plot represents the default behavior: solid black circular points.

If for no good reason at all we wanted to use a different point shape in the plot, we could specify it inside of geom_point().

ggplot(I_jean, aes(x=Dur_msec, y=F1.n)) +
  geom_point(shape = 3)

Or, if we wanted to use larger, red points, we could specify that in geom_point() as well.

ggplot(I_jean, aes(x=Dur_msec, y=F1.n)) +
  geom_point(color = "red", size = 3)

Speaking of defaults, we can see a few of the default setting of ggplot2 on display here. Most striking is the light grey background, with white grid lines. Opinion varies on whether or not this is aesthetically or technically pleasing, but don’t worry, it’s adjustable.

Another default is to label the x and y axes with the column names from the data frame. I’ll inject a bit of best practice advice here, and tell you to always change the axis names. It’s nearly guaranteed that your data frame column names will make for very poor axis labels. We’ll cover how to do that shortly.

Finally, note that we didn’t need to tell geom_point() about the x and y axes. This may seem trivial, but it’s a really important, and powerful aspect of ggplot2. When you add any layer at all to a plot, it will inherit the data-to-aesthetic mappings which were defined in the data layer. We’ll discuss inheritance, and how to override, or define new data-to-aesthetic mappings within any geom.

The statistics layer

The final figure also includes a smoothing line, which is one of many possible statistical layers we can add to a plot.

  p <- p + stat_smooth()
  p

We’ll go over the default behavior of stat_smooth() below, but in this plot, the smoothing line represents a loess smooth, and the semi-transparent ribbon surrounding the solid line is the 95% confidence interval.

One important thing to realize is that it’s not necessary to include the points in order to add a smoothing line. Here’s what the plot would look like with the points omitted.

ggplot(I_jean, aes(x = Dur_msec, y = F1.n))+
  stat_smooth()

Notice how the y-axis has zoomed in to just include the range of the smoothing line and standard error.

Scale transformations

I also wanted to make some alterations to the default x and y axis scales. For example, the y-axis is currently running in reverse to the intuitive direction of F1. Higher vowels have lower F1 values, so we want to flip the y-axis. Additionally, durations are typically best displayed along a logarithmic scale, so we should convert the x-axis as well.

p <- p + scale_x_log10(breaks = c(50, 100, 200, 300, 400))+
         scale_y_reverse()
p

It’s worth noting that the smoothing line here is calculated over the transformed data.

Cosmetic alterations

Finally, I wanted to make some cosmetic adjustments to the plot. For example, the x-axis label “Dur_msec” is not quite as useful as “Vowel duration (msec)” would be. I also added a title to the plot, and changed the color theme to black and white.

p <- p + ylab("Normalized F1")+
         xlab("Vowel duration (msec)")+
         theme_bw()+
         ggtitle("394 tokens of 'I' from one speaker")
p


Here’s all the layers, put together all at the same time.

ggplot(I_jean, aes(x=Dur_msec, y=F1.n))+
  geom_point()+
  stat_smooth()+
  scale_x_log10(breaks = c(50, 100, 200, 300, 400))+
  scale_y_reverse()+
  ylab("Normalized F1")+
  xlab("Vowel duration (msec)")+
  theme_bw()+
  ggtitle("394 tokens of 'I' from one speaker")

Idiom

As with pipes %>%, I recommend putting a new line after every + in a chain of ggplot2 layers, and indenting the following line two spaces.

Elaborating & More Options

For better or worse, ggplot2 plots are easilly identifiable, and you’ll inevitablly start seeing them everywhere now. Sometimes you’ll see a plot that does something cool that you didn’t realize you could do in ggplot2. When that happens to me, I google “how do you X in ggplot2”, and it usually works.

Aesthetics

In ggplot2, aesthetics are the graphical elements which are mapped to data, and they are defined with aes(). To some extent, the aesthetics you need to define are dependent on the geometries you want to use, because line segments have different geometric properties than points, for example. However, there is also a great deal of uniformity in the aesthetics used across geometries. Here is a list of the most common aesthetics you’ll want to define.

  • x
    • x-axis location
  • y
    • y-axis location
  • color
    • The color of lines, points, and the outside borders of two dimensional geometries (polygons, bars, etc.). Hadley Wickham, the primary ggplot2 developer, is from New Zealand, so colour is also supported!
  • fill
    • The fill color of two dimensional geometries.
  • size
    • The size of points, or the weight of lines and borders of two dimensional geometries.
  • shape
    • This is specific to points, and defines the point shape. This is one of the few aesthetics to which you can’t map a continuous variable.
  • linetype
    • This defines the line type of any kind of line, path, or border of a two dimensional geometry. This is another aesthetic which cannot be mapped to a continuous variable.
  • alpha
    • This defines the opacity of any geometric property. It’s less commonly mapped to data, and more often hard coded to a single value as a solution for overplotting.
  • xend, yend
    • You’ll use these more rarely, usually when plotting a line segment, or arrow. The beginning of the line segment will be located at x, y, and the end of the line segment will be located at xend,yend.
  • ymin, ymax, (xmin, xmax)
    • ymin and ymax are reserved for geometries which are devoted to representing ranges of data, like error bars, and ribbons. For the most part, these will be expressed along the y-axis, but xmin and xmax are utilized for some geometries as well.

The most important thing to keep in mind about aesthetics is not what they’re called, though, but how they are inherited by the layers. Let’s start by mapping the Word to color. % of the tokens are just “I”, so lets create a subset of the data that excludes “I” so it doesn’t visually swamp the plot.

Mapping Data to Aesthetics

I_subset <- subset(I_jean, Word != "I")
ggplot(I_subset, aes(Dur_msec, F1.n, color = Word))+
  geom_point()

Each point is now colored according to the word it corresponds to. ggplot2 has automatically generated a color palette of the right type and size, based on the data mapped to color, and created a legend to the side. As with everything, the specific color palette we use is adjustable, which will be discussed in more detail below under Scales. There are positive features of the default color palette, but some people don’t like it aesthetically, and they can be hard for some colorblind readers. They’ll also all print to the same shade of grey!

Inheritance

If we add one more geometry (a line), we see that it also inherits the mapping of Word to color.

ggplot(I_subset, aes(Dur_msec, F1.n, color = Word))+
  geom_point()+
  geom_line()

There are a few important things to take note of in this plot. First, you can see that we have actually added four lines to the plot, one for each color. In most cases, when you map categorical data to an aesthetic like color, you are also defining sub-groupings of the data, and ggplot2 will draw a lines, calculate statistics, etc. separately for every sub-grouping of the data.

The second important thing to notice is that geom_line() joins up points as they are ordered along the x-axis, not according to their order in the original data frame. There is a geom which will join up points that way called geom_path().

The point here, though, is that it is possible to define data-to-aesthetic mappings inside of geom functions, also by using aes(). Here, instead of mapping Word to color inside of ggplot(), we’ll do it inside of geom_point().

ggplot(I_subset, aes(Dur_msec, F1.n))+
  geom_point(aes(color = Word))+
  geom_line()

The points are still colored according to the word, but there is only one, black line. We can also try passing aes(color = Word) to geom_line().

ggplot(I_subset, aes(Dur_msec, F1.n))+
  geom_point()+
  geom_line( aes(color = Word))

Now, the lines are colored according to the word, but the points are all black. This brings up the all important point about aesthetics:

Geoms inherit aesthetic mappings from the ggplot() data layer, and not from any other layer.

Grouping

Let’s look at the effect of mapping Word to color on the calculation of statistics, like smoothing lines. Note, inside of stat_smooth() I’ve said se = F to turn off the display of standard errors.

ggplot(I_subset, aes(Dur_msec, F1.n, color=Word))+
  geom_point()+
  stat_smooth(se = F)

Just like separate lines were drawn for each group as defined by color=Word, ggplot2 has calculated separate smoothers for each subset. If we had only passed color=Word to geom_point(), though, stat_smooth() would not have inherited that mapping, resulting in a single smoother being calculated.

  ggplot(I_subset, aes(Dur_msec, F1.n))+
    geom_point(aes(color=Word))+
    stat_smooth(se = F)

It’s important to understand that when you map categorical variables to an aesthetic that you’re also defining sub-groupings. For example, if we map Word to shape, instead of color, the point shapes will now represent the word.

ggplot(I_subset, aes(Dur_msec, F1.n, shape=Word))+
  geom_point()

Now if we add a smoother to this plot, even though shape isn’t defined for lines, the smoother will still plot a different smoothing curve for each sub-grouping.

ggplot(I_subset, aes(Dur_msec, F1.n, shape=Word))+
  geom_point()+
  stat_smooth(se = F)

If you really only wanted a single smoother line for all of the data in this case, one solution would be to move the shape=Word mapping from the data layer to the geom_point() layer. But in most cases, it’s actually more desirable to override the aesthetic mapping. We can do this with the special aesthetic group.

group does exactly what it sounds like it ought to: it defines groups of data. When you want to override groups defined in the data layer, you can do so by saying group=1.

ggplot(I_subset, aes(Dur_msec, F1.n, shape=Word))+
  geom_point()+
  stat_smooth(se = F, aes(group = 1))

The effect it has on stat_smooth() is that just a single smoother is calculated. If we come back to color = Word, and then draw a line with group = 1, the effect is that we draw one line that varies in color.

ggplot(I_subset, aes(Dur_msec, F1.n, color=Word))+
  geom_line(aes(group = 1))

More aesthetics and their use.

So far, we’ve only mapped categorical variables to color, but it’s also possible to map continuous variables to color. Here we’ll redundantly map F1.n to both y and color.

ggplot(I_jean, aes(Dur_msec, F1.n, color = F1.n))+
  geom_point()

Another important aesthetics distinction is between color and fill. If we wanted to create a bar chart of word frequencies, we could do so by mapping Word to the x-axis, and adding geom_bar() without any y-axis variable defined.

ggplot(I_jean, aes(Word))+
  geom_bar()

If you also wanted to color the bars according to the word, your first instinct would probably be to map color = Word. But the result is that only the colors of the bars’ outlines are mapped to Word.

ggplot(I_jean, aes(Word, color = Word))+
  geom_bar()

What is probably more advisable is to map Word to fill, which control the filling color of two dimensional geoms.

ggplot(I_jean, aes(Word, fill = Word))+
  geom_bar()

As you might have figured out now, it’s technically possible to map the fill color of bars to one variable, and the outline color to different variable. My advice is to never do such a thing, because the results almost always come out a jumbled mess. Instead, I would suggest setting the color of the bars to black. I find it more pleasing to the eye, and helps to emphasize the divisions between bars when they’re stacked. Compare this plot:

ggplot(I_subset, aes(Name, fill = Word))+
  geom_bar()

to this one.

ggplot(I_subset, aes(Name, fill = Word))+
  geom_bar(color = "black")

Geometries

So far, we’ve used the following geometries:

  • geom_point()
  • geom_line()
  • geom_bar()

All geometries begin with geom_, meaning you can get a full list using apropos().

apropos("^geom_")
 [1] "geom_abline"     "geom_area"       "geom_bar"       
 [4] "geom_bin2d"      "geom_blank"      "geom_boxplot"   
 [7] "geom_col"        "geom_contour"    "geom_count"     
[10] "geom_crossbar"   "geom_curve"      "geom_density"   
[13] "geom_density_2d" "geom_density2d"  "geom_dotplot"   
[16] "geom_errorbar"   "geom_errorbarh"  "geom_freqpoly"  
[19] "geom_hex"        "geom_histogram"  "geom_hline"     
[22] "geom_jitter"     "geom_label"      "geom_line"      
[25] "geom_linerange"  "geom_map"        "geom_path"      
[28] "geom_point"      "geom_pointrange" "geom_polygon"   
[31] "geom_qq"         "geom_quantile"   "geom_raster"    
[34] "geom_rect"       "geom_ribbon"     "geom_rug"       
[37] "geom_segment"    "geom_smooth"     "geom_spoke"     
[40] "geom_step"       "geom_text"       "geom_tile"      
[43] "geom_violin"     "geom_vline"     

This is a quite extensive list, and we won’t be able to cover them all today. Many of them are actually convenience functions for special settings of other geoms. For example, geom_histogram() is really just geom_bar() with special settings.

ggplot(I_jean, aes(F1.n))+
  geom_histogram()

Other geoms are just convenience functions for statistical layers. For example, you’ll notice geom_smooth(), which if you add it to a plot will have the same behavior of stat_smooth(), which we’ve already been using extensively.

ggplot(I_jean, aes(Dur_msec, F1.n))+
  geom_smooth()

stat_smooth() can actually plot many different kinds of smoothers. For a linear model, for example:

ggplot(I_jean, aes(Dur_msec, F1.n))+
  geom_smooth(method = 'lm')

Some special geoms

Some geoms are both unique and common enough in their usage to warrant special mention.

geom_text() and geom_label()

Adding text, and text labels to a plot, is a very common task, and is done with geom_text(). There is a special aesthetic just for geom_text() called label, which defines the column that should be used as the text label.

ggplot(I_subset, aes(Dur_msec, F1.n))+
  geom_text(aes(label = Word))

ggplot(I_subset, aes(Dur_msec, F1.n))+
  geom_label(aes(label = Word))

Positioning

Just like inheritance was the big idea for aesthetics, positioning is the big idea for geoms. For various reasons, you may want to adjust where geometries are plotted. As a solution to overplotting, for example, you may want to add some jitter to points. When dealing with bars, you need to decide whether they should be stacked, or arranged next to each other. These small adjustments

  • identity
    • This is the default in most cases, simply plotting geometries where they’re defined by x and y.
  • jitter
    • This adds some random noise either to the x position or the the y position, and is typically used just for points.
  • stack
    • This stacks geometries on top of each other. This is the default for bars
  • dodge
    • This pushes geometries out of each other’s way, to the left and right.
  • fill
    • This stacks geometries on top of each other, and expands or contracts them to fill the space between 0 and 1. Good for plotting proportions.
Jitter

Some people, like Andrew Gelman, hate boxplots.

ggplot(I_jean, aes(Word, F1.n))+
  geom_boxplot()

A frequent suggestion for a replacement to boxplots is just to plot the raw data points with some jitter. To get started, we’ll replace geom_boxplot() with geom_point().

ggplot(I_jean, aes(Word, F1.n))+
  geom_point()

And the add some jitter, by defining position = "jitter" inside of geom_point().

ggplot(I_jean, aes(Word, F1.n))+
  geom_point(position = "jitter")

In this example, you can see the benefit of jittered points over boxplots. With boxplots, there’s no hint that one category, “I” has enormously more data than the others.

As a convenience, there’s a geom called geom_jitter(), which is just a convenience function for geom_point(position = "jitter").

ggplot(I_jean, aes(Word, F1.n))+
  geom_jitter()

Scales

Scales provide you more fine grained control over over the presentation of aesthetics. For every aesthetic, there is a corresponding scale you can use. For more flexibility, you should also install and load the scales package.

  library(scales)

x and y scales

All scales begin with scale_ followed by the name of the aesthetic it controls, then followed by its sub-type. Here are all the x-axis scales, for which there are identical y-axis scales.

apropos("^scale_x_")
[1] "scale_x_continuous" "scale_x_date"       "scale_x_datetime"  
[4] "scale_x_discrete"   "scale_x_log10"      "scale_x_reverse"   
[7] "scale_x_sqrt"       "scale_x_time"      

scale_x_continuous, scale_x_discrete, scale_x_datetime and scale_x_date are the basic kinds of x and y axes you can construct in ggplot2. For the most part, ggplot2 will figure out which kind of scale to use, and you’ll only need to add one of these if you want to modify the default appearance.

scale_x_log10, scale_x_sqrt and scale_x_reverse are basic transformations to a continuous scale. Many more kinds of transformations are possible, and it’s even possible to define your own custom transformations, but these are the only ones for which there are special convenience scale_x_* functions.

scale_ arguments

There are a few basic arguments that you can pass to a scale.

  • name
    • This is the title that will be displayed on the scale, either on the axes for x and y scales, or in the legend for other scales.
  • limits
    • This defines the range of data to be presented in the plot. For discrete scales, it’ll define the order with which to display categorical factors.
  • breaks
    • These are the labeled points along the scale, either the major axis labels, or the labels on the legend.
  • labels
    • This defines the labels to use at each break point, if you want to override them
  • trans
    • This defines a numerical transformation you’d like to apply to a scale, and usually only applicable to continuous scales. We’ll talk more about transformations below.

scale_[xy]_continuous

Here’s our basic duration by F1 plot again, which has two continuous scales for the x and y axes.

ggplot(I_jean, aes(Dur_msec, F1.n))+
  geom_point()

If we want to change the x axis label, we can do so by adding scale_x_continious() and passing our desired title to name.

ggplot(I_jean, aes(Dur_msec, F1.n))+
  geom_point()+
  scale_x_continuous(name = "Vowel Duration (msec)")

Since changing the axis titles is something you’re likely to do very frequently, there are two convenience functions for this purpose with shorter names that save you typing: xlab() and ylab().

ggplot(I_jean, aes(Dur_msec, F1.n))+
  geom_point()+
  xlab("Vowel Duration (msec)")

The next most important thing you’ll want to do to x and y axis scales is transform them. For example, duration measurements tend to be left-skewed, so a log transformation is advisable. The benefit of transforming the scale over simply plotting log2(Dur_msec) is that ggplot2 will very nicely label the axis according to the original values. Compare the labels of the x-axes of these two plots.

ggplot(I_jean, aes(log2(Dur_msec), F1.n))+
  geom_point()+
  scale_x_continuous("log2 Vowel Duration (msec)")

ggplot(I_jean, aes(Dur_msec, F1.n))+
  geom_point()+
  scale_x_continuous("Vowel Duration (msec)",
                     trans = "log2")

These are all of the pre-defined transformations.

apropos("_trans$")
 [1] "asn_trans"         "atanh_trans"      
 [3] "boxcox_trans"      "coord_trans"      
 [5] "date_trans"        "exp_trans"        
 [7] "hms_trans"         "identity_trans"   
 [9] "log_trans"         "log10_trans"      
[11] "log1p_trans"       "log2_trans"       
[13] "logit_trans"       "probability_trans"
[15] "probit_trans"      "reciprocal_trans" 
[17] "reverse_trans"     "sqrt_trans"       
[19] "time_trans"       

color and fill scales

Here are all of the color scales, for all of which there is an accompanying fill scale.

apropos("^scale_color_")
 [1] "scale_color_brewer"     "scale_color_continuous"
 [3] "scale_color_discrete"   "scale_color_distiller" 
 [5] "scale_color_gradient"   "scale_color_gradient2" 
 [7] "scale_color_gradientn"  "scale_color_grey"      
 [9] "scale_color_hue"        "scale_color_identity"  
[11] "scale_color_manual"    

There are two very distinct kinds of color scales here: categorical and gradient. The use of one over the other has everything to do with the kind of data which is passed to color and fill, and they have their own specific customizations.

Categorical color scales

Here is the default categorical color scale, which is called scale_[fill/color]_hue()

ggplot(I_jean, aes(Word, Dur_msec, fill = Word))+
  stat_summary(fun.y = mean, geom = "bar")+
  scale_fill_hue()

Just like the x and y scales, you can change the title, limits, breaks and labels of this scale, which will control how it appears in the legend. Here’s an illustrative example, which actually detracts from the default

ggplot(I_jean, aes(Word, Dur_msec, fill = Word))+
  stat_summary(fun.y = mean, geom = "bar")+
  scale_fill_hue(name = "Lexical Item",
                 limits = c("I'D","I'VE","I'LL","I'M","I"),
                 labels = c("'D","'VE","'LL","'M",""))

Many people don’t like the default color scheme, but there are other available color palettes, and you can define your own. One really nice set of color palettes comes from the package RColorBrewer. You can explore the set of available color palettes available in it here. A personal favorite of mine is called Set1, which you can apply to the plot with scale_fill_brewer()

ggplot(I_jean, aes(Word, Dur_msec, fill = Word))+
  stat_summary(fun.y = mean, geom = "bar")+
  scale_fill_brewer(palette = "Set1")

If you are particularly picky, or want to express your questionable aesthetic sense to the world, there is also scale_fill_manual(), where you define an arbitrary list of colors to use for the scale.

ggplot(I_jean, aes(Word, Dur_msec, fill = Word))+
  stat_summary(fun.y = mean, geom = "bar")+
  scale_fill_manual(values=c("bisque", "chartreuse4",
                             "hotpink","yellow", "red"))

And finally, if you’re preparing a plot for publication, and you want to be sure that the colors will be distinguishable when printed in black and white, there is scale_fill_grey().

ggplot(I_jean, aes(Word, Dur_msec, fill = Word))+
  stat_summary(fun.y = mean, geom = "bar")+
  scale_fill_grey()

Gradient color scales

There are also a nice set of customizable gradient color scales. Here’s the default gradient color scale, which is called scale_fill_gradient()

ggplot(I_jean, aes(-F2.n, -F1.n))+
  stat_density2d(geom = "tile",
                 contour = F, 
                 aes(fill = ..density..))+
  scale_fill_gradient()

scale_fill_gradient() constructs a color continuum from A to B, where the lowest values in the plot will have solid A, and the highest values in the plot will have solid B, and everything else will fall along the gradient. It’s possible to override the original two colors that the gradient is built between by passing the colors you prefer to low and high.

ggplot(I_jean, aes(-F2.n, -F1.n))+
  stat_density2d(geom = "tile",
                 contour = F, 
                 aes(fill = ..density..))+
  scale_fill_gradient(low="darkblue",high="darkred")

It’s also possible to define a gradient that passes from A to B via C with scale_fill_gradient2(). Here, you define low and high, as well as a third color you want the gradient to transition through, mid. You also need to define the value the scale should treat as a midpoint.

ggplot(I_jean, aes(-F2.n, -F1.n))+
  stat_density2d(geom = "tile",
                 contour = F, 
                 aes(fill = ..density..))+
  scale_fill_gradient2(low="darkblue",
                       high="darkred",
                       mid="white",
                       midpoint=0.5)

And, finally, you’re able to define a color gradient passing through any arbitrary colors with scale_fill_gradientn(). Here’s a pretty ugly one:

ggplot(I_jean, aes(-F2.n, -F1.n))+
  stat_density2d(geom = "tile",
                 contour = F, 
                 aes(fill = ..density..))+
  scale_fill_gradientn(colours = c("bisque", 
                                   "chartreuse4",
                                   "hotpink",
                                   "yellow"))

A benefit to having scale_fill_gradientn() is that you can utilize some of R’s built in color palettes, like rainbow(), terrain.colors() and topo.colors()

ggplot(I_jean, aes(-F2.n, -F1.n))+
  stat_density2d(geom = "tile",
                 contour = F, 
                 aes(fill = ..density..))+
  scale_fill_gradientn(colours = rainbow(6))

ggplot(I_jean, aes(-F2.n, -F1.n))+
  stat_density2d(geom = "tile",
                 contour = F, 
                 aes(fill = ..density..))+
  scale_fill_gradientn(colours = terrain.colors(6))

ggplot(I_jean, aes(-F2.n, -F1.n))+
  stat_density2d(geom = "tile",
                 contour = F, 
                 aes(fill = ..density..))+
  scale_fill_gradientn(colours = topo.colors(6))

Guides

A set of new mechanics for handling the presentation of legends was introduced with version 0.9.0, including a new kind of legend for continuous color scales: the colorbar. The plots above have all used the default guide type (legend) which displays the color value for specific breaks, but not the gradient in between. The color bar guides show the entire gradient, with the breaks labeled over top.

ggplot(I_jean, aes(-F2.n, -F1.n))+
  stat_density2d(geom = "tile",
                 contour = F, 
                 aes(fill = ..density..))+
  scale_fill_gradientn(colours = rainbow(6),
                       guide = "colorbar")

shape and linetype

The shape and linetype scales are much more limited. Despite the apparent existence of scale_shape_continuous and scale_linetype_continuous, you can’t actually pass continuous variable to these aesthetics.

apropos("^scale_shape_")
[1] "scale_shape_continuous" "scale_shape_discrete"  
[3] "scale_shape_identity"   "scale_shape_manual"    
apropos("^scale_linetype_")
[1] "scale_linetype_continuous" "scale_linetype_discrete"  
[3] "scale_linetype_identity"   "scale_linetype_manual"    

The only time you’ll be likely to use the shape and linetype scales is when you want to manually control the point shape and linetypes to be used. I actually find the default shape scale to not be strongly contrastive, and prefer to contrast point types based on filled vs hollow shapes.

ggplot(I_subset, aes(Dur_msec, F1.n, shape = Word))+
  geom_point()

ggplot(I_subset, aes(Dur_msec, F1.n, shape = Word))+
  geom_point()+
  scale_shape_manual(values=c(1,1, 19, 19))

Other scales

apropos("^scale_size_")
[1] "scale_size_area"       "scale_size_continuous"
[3] "scale_size_date"       "scale_size_datetime"  
[5] "scale_size_discrete"   "scale_size_identity"  
[7] "scale_size_manual"    
apropos("^scale_alpha_")
[1] "scale_alpha_continuous" "scale_alpha_discrete"  
[3] "scale_alpha_identity"   "scale_alpha_manual"    

Faceting

A really powerful graphical technique is the small multiple, and ggplot2 allows for easy creation of small multiples via faceting. Let’s create an additional categorical variable for the entire data set (we did this already for the subset excluding “I”).

  I_jean <- I_jean %>%
                mutate(Dur_cat = Dur_msec > mean(Dur_msec))

facet_wrap

If we wanted to create an F2 x F1 plot for every word, we’d start out by creating a simple F1 x F2 plot:

ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()

and then faceting by Word with facet_wrap().

ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()+
  facet_wrap(~Word)

You can exercise some control about the layout of facets with ncol and nrow. For example. if you really only wanted there to be 2 columns of facets, you could make that happen with by passing ncol=2 to facet_wrap()

ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()+
  facet_wrap(~Word, ncol = 2)

Or, if you wanted all the facets to be lined up in one row, you would pass nrow = 1.

ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()+
  facet_wrap(~Word, nrow = 1)

By default, the ranges of the axes in each facet are fixed to be the same across all facets, and that should be changed only in very limited circumstances. You can set the x axis, y axis, or both to be free by passing the following arguments to scales inside of facet_wrap().

## Inadvisable
ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()+
  facet_wrap(~Word, scales = "free_x")

## Inadvisable
ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()+
  facet_wrap(~Word, scales = "free_y")

## Inadvisable
ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()+
  facet_wrap(~Word, scales = "free")

facet_grid

facet_grid() is another form of faceting in two dimensions.

ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()+
  facet_grid(Dur_cat~Word)

ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()+
  facet_grid(Word~Dur_cat)

ggplot(I_jean, aes(-F2.n, -F1.n ))+
  geom_point()+
  facet_grid(Dur_cat~Word, scales = "free")

