Starting with the tidyverse

Author

Josef Fruehwald

Published

January 24, 2023

Let’s start by loading the tidyverse.

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Tidyverse functions as verbs

Most tidyverse functions are written to be verbs, taje a data frame as their first argument, and also return a data frame.

# a data frame
mtcars <- as_tibble(mtcars)
mtcars
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 22 more rows
# filter the dataframe to 
# only the rows with cyl==6
filter(mtcars, cyl == 6)
# A tibble: 7 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
4  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
5  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
6  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
7  19.7     6  145    175  3.62  2.77  15.5     0     1     5     6
# count how many rows 
# have these values of gears
count(mtcars, gear)
# A tibble: 3 × 2
   gear     n
  <dbl> <int>
1     3    15
2     4    12
3     5     5

Piping

Since tidyverse functions take data frames as input, and produce data frames as output, you might want to combine them.

What are the counts of gear for cars with cyl==6?

count(
  filter(
    mtcars, 
    cyl == 6
    ), 
  gear
  )
# A tibble: 3 × 2
   gear     n
  <dbl> <int>
1     3     2
2     4     4
3     5     1

A problem here is that you have to write, and read your functions inside out. Wouldn’t it be great if we could write code that looks like:

First take the mtcars data, and then filter it by cyl==6, then get the count of gears.

That’s where the pipe |> comes in. The pipe takes everything to its left, and inserts it as the first argument to the function on its right.

# this
mtcars |> filter(cyl == 6)

# is equivalent to this
filter(mtcars, cyl == 6)

This lets us chain tidyverse verbs together.

mtcars |> 
  filter(cyl == 6) |> 
  count(gear)
# A tibble: 3 × 2
   gear     n
  <dbl> <int>
1     3     2
2     4     4
3     5     1
Work it out

The dataframe starwars contains demographic and personal data for many characters from the Star Wars universe. Using dplyr verbs like

  1. Find out which planet is the most common homeworld for humans.
  2. Find out who was the tallest Droid.

Grouping and summarizing

To find out average horsepower across all of the cars in mtcars, we can use summarise()

mtcars |> 
  summarise(hp = mean(hp))
# A tibble: 1 × 1
     hp
  <dbl>
1  147.

If we wanted to find out the average horsepower by the number of cylinders, we can group_by() and then summarise().

mtcars |> 
  group_by(cyl) |> 
  summarise(hp = mean(hp))
# A tibble: 3 × 2
    cyl    hp
  <dbl> <dbl>
1     4  82.6
2     6 122. 
3     8 209. 

Mutating

To add new columns to a data frame, we can use mutate(). Inside of mutate, we can make reference to any column in the dataframe.

## horsepower by cylinder?
mtcars |> 
  mutate(hp_by_cyl = hp/cyl)
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb hp_by_cyl
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4      18.3
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4      18.3
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1      23.2
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1      18.3
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2      21.9
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1      17.5
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4      30.6
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2      15.5
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2      23.8
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4      20.5
# … with 22 more rows
Work it out

This will load all tokens of “uh” and “um” from the Philadelphia Neighborhood Corpus.

um <- read_tsv("https://bit.ly/3JdeSbx")
Rows: 26060 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (3): word, next_seg, idstring
dbl (11): start_time, end_time, vowel_start, vowel_end, nasal_start, nasal_e...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The column word codes whether it was “um” or “uh” or some combo that was spoken. Other important columns are

  • start_time, end_time: the start and end times for the whole word

  • vowel_start, vowel_end: The start and end time of the vowel in the word.

  • nasal_start, nasal_end: The start and end times of the nasal, for the word UM.

  • next_seg: the transcription of the following segment. "sp" means “pause”

  • next_seg_start, next_seg_end the start and end times of the following segment

um
# A tibble: 26,060 × 14
   word  start…¹ end_t…² vowel…³ vowel…⁴ nasal…⁵ nasal…⁶ next_…⁷ next_…⁸ next_…⁹
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>
 1 UH       24.4    24.7    24.4    24.7    NA      NA   S          24.7    24.9
 2 UH       35.0    35.2    35.0    35.2    NA      NA   F          35.2    35.4
 3 UM       37.9    38.3    37.9    38.1    38.1    38.3 sp         38.3    38.4
 4 UH       44.5    44.7    44.5    44.7    NA      NA   DH         44.7    44.7
 5 UH       57.6    57.8    57.6    57.8    NA      NA   AY1        57.8    57.9
 6 UH       62.3    62.5    62.3    62.5    NA      NA   sp         62.5    63.0
 7 UH       73.9    74.2    73.9    74.2    NA      NA   sp         74.2    75.0
 8 UH       75.1    75.4    75.1    75.4    NA      NA   sp         75.4    75.7
 9 UM       81.6    82.0    81.6    81.8    81.8    82.0 sp         82.0    84.0
10 UH       92.6    92.9    92.6    92.9    NA      NA   sp         92.9    93.4
# … with 26,050 more rows, 4 more variables: chunk_start <dbl>,
#   chunk_end <dbl>, nwords <dbl>, idstring <chr>, and abbreviated variable
#   names ¹​start_time, ²​end_time, ³​vowel_start, ⁴​vowel_end, ⁵​nasal_start,
#   ⁶​nasal_end, ⁷​next_seg, ⁸​next_seg_start, ⁹​next_seg_end

Using dplyr verbs like

  1. Figure out the average duration of the vowel for each kind of word.
  2. Figure out the average duration of the vowel for each kind of word when the following segment is a pause versus when it isn’t.

Reuse

CC-BY-SA 4.0