── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0 ✔ purrr 1.0.1
✔ tibble 3.1.8 ✔ dplyr 1.1.0
✔ tidyr 1.3.0 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Let’s start by loading the tidyverse.
Tidyverse functions as verbs
Most tidyverse functions are written to be verbs, taje a data frame as their first argument, and also return a data frame.
# a data frame
mtcars <- as_tibble(mtcars)
mtcars
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
# filter the dataframe to
# only the rows with cyl==6
filter(mtcars, cyl == 6)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
5 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
6 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
7 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
# count how many rows
# have these values of gears
count(mtcars, gear)
# A tibble: 3 × 2
gear n
<dbl> <int>
1 3 15
2 4 12
3 5 5
Piping
Since tidyverse functions take data frames as input, and produce data frames as output, you might want to combine them.
What are the counts of gear
for cars with cyl==6
?
A problem here is that you have to write, and read your functions inside out. Wouldn’t it be great if we could write code that looks like:
First take the
mtcars
data, and then filter it bycyl==6
, then get the count of gears.
That’s where the pipe |>
comes in. The pipe takes everything to its left, and inserts it as the first argument to the function on its right.
This lets us chain tidyverse verbs together.
Grouping and summarizing
To find out average horsepower across all of the cars in mtcars
, we can use summarise()
If we wanted to find out the average horsepower by the number of cylinders, we can group_by()
and then summarise()
.
Mutating
To add new columns to a data frame, we can use mutate()
. Inside of mutate, we can make reference to any column in the dataframe.
## horsepower by cylinder?
mtcars |>
mutate(hp_by_cyl = hp/cyl)
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb hp_by_cyl
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 18.3
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 18.3
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 23.2
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 18.3
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 21.9
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 17.5
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 30.6
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 15.5
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 23.8
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 20.5
# … with 22 more rows
This will load all tokens of “uh” and “um” from the Philadelphia Neighborhood Corpus.
um <- read_tsv("https://bit.ly/3JdeSbx")
Rows: 26060 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): word, next_seg, idstring
dbl (11): start_time, end_time, vowel_start, vowel_end, nasal_start, nasal_e...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The column word
codes whether it was “um” or “uh” or some combo that was spoken. Other important columns are
start_time
,end_time
: the start and end times for the whole wordvowel_start
,vowel_end
: The start and end time of the vowel in the word.nasal_start
,nasal_end
: The start and end times of the nasal, for the wordUM
.next_seg
: the transcription of the following segment."sp"
means “pause”next_seg_start
,next_seg_end
the start and end times of the following segment
um
# A tibble: 26,060 × 14
word start…¹ end_t…² vowel…³ vowel…⁴ nasal…⁵ nasal…⁶ next_…⁷ next_…⁸ next_…⁹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 UH 24.4 24.7 24.4 24.7 NA NA S 24.7 24.9
2 UH 35.0 35.2 35.0 35.2 NA NA F 35.2 35.4
3 UM 37.9 38.3 37.9 38.1 38.1 38.3 sp 38.3 38.4
4 UH 44.5 44.7 44.5 44.7 NA NA DH 44.7 44.7
5 UH 57.6 57.8 57.6 57.8 NA NA AY1 57.8 57.9
6 UH 62.3 62.5 62.3 62.5 NA NA sp 62.5 63.0
7 UH 73.9 74.2 73.9 74.2 NA NA sp 74.2 75.0
8 UH 75.1 75.4 75.1 75.4 NA NA sp 75.4 75.7
9 UM 81.6 82.0 81.6 81.8 81.8 82.0 sp 82.0 84.0
10 UH 92.6 92.9 92.6 92.9 NA NA sp 92.9 93.4
# … with 26,050 more rows, 4 more variables: chunk_start <dbl>,
# chunk_end <dbl>, nwords <dbl>, idstring <chr>, and abbreviated variable
# names ¹start_time, ²end_time, ³vowel_start, ⁴vowel_end, ⁵nasal_start,
# ⁶nasal_end, ⁷next_seg, ⁸next_seg_start, ⁹next_seg_end
Using dplyr verbs like
- Figure out the average duration of the vowel for each kind of
word
. - Figure out the average duration of the vowel for each kind of
word
when the following segment is a pause versus when it isn’t.