Joins
A useful diagram for understanding joins https://r4ds.hadley.nz/joins.html#fig-join-left
Join these two datasets together (dplyr::left_join()
) and find out
Which filled pause did people born before 1930 use the most?
Which filled pause did people born after 1980 use the most?
Think of using functions like
Pivots
install.packages("nasapower")
Getting monthly temperature data
Here is monthly temperature data for Lexington according to NASA
lex_temp
NASA/POWER CERES/MERRA2 Native Resolution Monthly and Annual
Dates (month/day/year): 01/01/1985 through 12/31/2021
Location: Latitude 38.048 Longitude -84.5016
Elevation from MERRA-2: Average for 0.5 x 0.625 degree lat/lon region = 280.65 meters
The value for missing source data that cannot be computed or is outside of the sources availability range: NA
Parameter(s):
Parameters:
T2M MERRA-2 Temperature at 2 Meters (C)
# A tibble: 37 × 17
LON LAT PARAMETER YEAR JAN FEB MAR APR MAY JUN JUL AUG
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -84.5 38.0 T2M 1985 -4.87 -3.06 7.64 13.5 17.4 21.0 23.8 23.1
2 -84.5 38.0 T2M 1986 -1.52 1.77 6.53 12.8 18.1 23.4 26.5 23.4
3 -84.5 38.0 T2M 1987 -0.99 1.08 6.54 10.8 19.7 22.8 25.4 25.7
4 -84.5 38.0 T2M 1988 -2.38 -0.56 5.58 11.0 17.2 23.6 26.4 26.2
5 -84.5 38.0 T2M 1989 2.42 -0.93 6.44 10.7 14.7 21.1 23.5 23.1
6 -84.5 38.0 T2M 1990 2.61 4.76 7.74 10.8 15.8 21.3 23.7 23.8
7 -84.5 38.0 T2M 1991 -0.54 2.13 7.15 13.5 20.4 22.3 24.3 24.3
8 -84.5 38.0 T2M 1992 0.09 3.51 6.23 11.7 15.9 19.8 23.6 20.9
9 -84.5 38.0 T2M 1993 1.78 -0.27 4.3 10.6 17.4 21.8 26.3 24.8
10 -84.5 38.0 T2M 1994 -3.89 0.8 4.54 13.2 15.1 22.7 24.3 23.2
# … with 27 more rows, and 5 more variables: SEP <dbl>, OCT <dbl>, NOV <dbl>,
# DEC <dbl>, ANN <dbl>
Pivoting long
lex_long <-
lex_temp |>
pivot_longer(
cols = JAN:DEC,
names_to = "month",
values_to = "temp"
)
lex_long
# A tibble: 444 × 7
LON LAT PARAMETER YEAR ANN month temp
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 -84.5 38.0 T2M 1985 12.0 JAN -4.87
2 -84.5 38.0 T2M 1985 12.0 FEB -3.06
3 -84.5 38.0 T2M 1985 12.0 MAR 7.64
4 -84.5 38.0 T2M 1985 12.0 APR 13.5
5 -84.5 38.0 T2M 1985 12.0 MAY 17.4
6 -84.5 38.0 T2M 1985 12.0 JUN 21.0
7 -84.5 38.0 T2M 1985 12.0 JUL 23.8
8 -84.5 38.0 T2M 1985 12.0 AUG 23.1
9 -84.5 38.0 T2M 1985 12.0 SEP 19.8
10 -84.5 38.0 T2M 1985 12.0 OCT 15.5
# … with 434 more rows
Pivoting wide
You can do data operations, and then pivot the data back to wide.
lex_long |>
# converting year to a decade value
mutate(decade = floor(YEAR / 10)) |>
# grouping by decade and month
group_by(decade, month) |>
# getting average temperature within groups
summarise(mean_temp = mean(temp)) |>
# pivoting wide
pivot_wider(
names_from = month,
values_from = mean_temp
)
`summarise()` has grouped output by 'decade'. You can override using the
`.groups` argument.
# A tibble: 5 × 13
# Groups: decade [5]
decade APR AUG DEC FEB JAN JUL JUN MAR MAY NOV OCT
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 198 11.8 24.3 -0.564 -0.34 -1.47 25.1 22.4 6.55 17.4 7.50 12.4
2 199 11.6 23.7 1.98 2.14 0.077 24.4 21.8 5.72 17.2 6.16 13.1
3 200 12.7 24.5 1.08 1.23 0.158 24.1 22.3 6.86 17.7 7.34 13.6
4 201 12.9 24.1 2.46 1.47 -0.838 24.7 22.7 6.26 18.6 6.08 14.0
5 202 11.2 24.2 4.20 0.68 1.48 24.8 22.2 9.12 16.2 6.88 15.0
# … with 1 more variable: SEP <dbl>
With the “UM” data
- Calculate the average duration of the vowel for each person for each word. Think of using functions like
- Figure out how many times longer the vowel is in “UM” than for “UH” for each person. Think of using functions like
Untidy data examples
Source: {untidydata}
by Joseph Casillas
install.packages("devtools")
devtools::install_github("jvcasillas/untidydata")
library(untidydata)
This is just the nettle
dataset all over again, but starts out “long”
language_diversity
# A tibble: 444 × 4
Continent Country Measurement Value
<chr> <chr> <chr> <dbl>
1 Africa Algeria Langs 18
2 Africa Angola Langs 42
3 Oceania Australia Langs 234
4 Asia Bangladesh Langs 37
5 Africa Benin Langs 52
6 Americas Bolivia Langs 38
7 Africa Botswana Langs 27
8 Americas Brazil Langs 209
9 Africa Burkina Faso Langs 75
10 Africa CAR Langs 94
# … with 434 more rows
Using a pivot _*()
functions, convert the untidydata::language_diversity
data to the format we’ve seen the Nettle data in.
Stretch goals
Here’s Spanish vowel data, also from {untidydata}
.
spanish_vowels
# A tibble: 750 × 4
label rep f1 f2
<chr> <int> <dbl> <dbl>
1 p01-male-a 1 615. 1231.
2 p01-male-a 2 645. 1282.
3 p01-male-a 3 608. 1248.
4 p01-male-e 1 477. 1612.
5 p01-male-e 2 457. 1839.
6 p01-male-e 3 445. 1849.
7 p01-male-i 1 309. 2153.
8 p01-male-i 2 259. 2176.
9 p01-male-i 3 337. 2015.
10 p01-male-o 1 478. 865.
# … with 740 more rows
The spanish_vowels$label
column has three different variable smushed together: speaker id, speaker gender, vowel class. This is good file naming convention. Poor data column convention.
Look over the docs for tidyr::separate()
and try to get the speaker id, speaker gender, and vowel class separated out into their own columns.