Continuing with the tidyverse

Author

Josef Fruehwald

Published

January 26, 2023

Joins

A useful diagram for understanding joins https://r4ds.hadley.nz/joins.html#fig-join-left

Work it out
um <- read_tsv("https://bit.ly/3JdeSbx")
demo <- read_csv("https://bit.ly/3wOfcGx")

Join these two datasets together (dplyr::left_join()) and find out

  1. Which filled pause did people born before 1930 use the most?

  2. Which filled pause did people born after 1980 use the most?

Think of using functions like

Pivots

install.packages("nasapower")

Getting monthly temperature data

lex_temp <- 
  get_power(
    community = "ag",
    temporal_api = "monthly",
    pars = "T2M",
    dates = c("1985-01-01", "2021-12-31"),
    lonlat = c(-84.501640,  38.047989)
  )

Here is monthly temperature data for Lexington according to NASA

lex_temp
NASA/POWER CERES/MERRA2 Native Resolution Monthly and Annual  
 Dates (month/day/year): 01/01/1985 through 12/31/2021  
 Location: Latitude  38.048   Longitude -84.5016  
 Elevation from MERRA-2: Average for 0.5 x 0.625 degree lat/lon region = 280.65 meters 
 The value for missing source data that cannot be computed or is outside of the sources availability range: NA  
 Parameter(s):  
 
 Parameters: 
 T2M     MERRA-2 Temperature at 2 Meters (C)  
 
# A tibble: 37 × 17
     LON   LAT PARAMETER  YEAR   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG
   <dbl> <dbl> <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 -84.5  38.0 T2M        1985 -4.87 -3.06  7.64  13.5  17.4  21.0  23.8  23.1
 2 -84.5  38.0 T2M        1986 -1.52  1.77  6.53  12.8  18.1  23.4  26.5  23.4
 3 -84.5  38.0 T2M        1987 -0.99  1.08  6.54  10.8  19.7  22.8  25.4  25.7
 4 -84.5  38.0 T2M        1988 -2.38 -0.56  5.58  11.0  17.2  23.6  26.4  26.2
 5 -84.5  38.0 T2M        1989  2.42 -0.93  6.44  10.7  14.7  21.1  23.5  23.1
 6 -84.5  38.0 T2M        1990  2.61  4.76  7.74  10.8  15.8  21.3  23.7  23.8
 7 -84.5  38.0 T2M        1991 -0.54  2.13  7.15  13.5  20.4  22.3  24.3  24.3
 8 -84.5  38.0 T2M        1992  0.09  3.51  6.23  11.7  15.9  19.8  23.6  20.9
 9 -84.5  38.0 T2M        1993  1.78 -0.27  4.3   10.6  17.4  21.8  26.3  24.8
10 -84.5  38.0 T2M        1994 -3.89  0.8   4.54  13.2  15.1  22.7  24.3  23.2
# … with 27 more rows, and 5 more variables: SEP <dbl>, OCT <dbl>, NOV <dbl>,
#   DEC <dbl>, ANN <dbl>

Pivoting long

lex_long <- 
  lex_temp |> 
  pivot_longer(
    cols = JAN:DEC,
    names_to = "month",
    values_to = "temp"
  )

lex_long
# A tibble: 444 × 7
     LON   LAT PARAMETER  YEAR   ANN month  temp
   <dbl> <dbl> <chr>     <dbl> <dbl> <chr> <dbl>
 1 -84.5  38.0 T2M        1985  12.0 JAN   -4.87
 2 -84.5  38.0 T2M        1985  12.0 FEB   -3.06
 3 -84.5  38.0 T2M        1985  12.0 MAR    7.64
 4 -84.5  38.0 T2M        1985  12.0 APR   13.5 
 5 -84.5  38.0 T2M        1985  12.0 MAY   17.4 
 6 -84.5  38.0 T2M        1985  12.0 JUN   21.0 
 7 -84.5  38.0 T2M        1985  12.0 JUL   23.8 
 8 -84.5  38.0 T2M        1985  12.0 AUG   23.1 
 9 -84.5  38.0 T2M        1985  12.0 SEP   19.8 
10 -84.5  38.0 T2M        1985  12.0 OCT   15.5 
# … with 434 more rows

Pivoting wide

You can do data operations, and then pivot the data back to wide.

lex_long |> 
  # converting year to a decade value
  mutate(decade = floor(YEAR / 10)) |> 
  # grouping by decade and month
  group_by(decade, month) |> 
  # getting average temperature within groups
  summarise(mean_temp = mean(temp)) |> 
  # pivoting wide
  pivot_wider(
    names_from = month,
    values_from = mean_temp
  )
`summarise()` has grouped output by 'decade'. You can override using the
`.groups` argument.
# A tibble: 5 × 13
# Groups:   decade [5]
  decade   APR   AUG    DEC   FEB    JAN   JUL   JUN   MAR   MAY   NOV   OCT
   <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    198  11.8  24.3 -0.564 -0.34 -1.47   25.1  22.4  6.55  17.4  7.50  12.4
2    199  11.6  23.7  1.98   2.14  0.077  24.4  21.8  5.72  17.2  6.16  13.1
3    200  12.7  24.5  1.08   1.23  0.158  24.1  22.3  6.86  17.7  7.34  13.6
4    201  12.9  24.1  2.46   1.47 -0.838  24.7  22.7  6.26  18.6  6.08  14.0
5    202  11.2  24.2  4.20   0.68  1.48   24.8  22.2  9.12  16.2  6.88  15.0
# … with 1 more variable: SEP <dbl>
Work it out

With the “UM” data

  1. Calculate the average duration of the vowel for each person for each word. Think of using functions like
  2. Figure out how many times longer the vowel is in “UM” than for “UH” for each person. Think of using functions like

Untidy data examples

Source: {untidydata} by Joseph Casillas

install.packages("devtools")
devtools::install_github("jvcasillas/untidydata")
library(untidydata)

This is just the nettle dataset all over again, but starts out “long”

language_diversity
# A tibble: 444 × 4
   Continent Country      Measurement Value
   <chr>     <chr>        <chr>       <dbl>
 1 Africa    Algeria      Langs          18
 2 Africa    Angola       Langs          42
 3 Oceania   Australia    Langs         234
 4 Asia      Bangladesh   Langs          37
 5 Africa    Benin        Langs          52
 6 Americas  Bolivia      Langs          38
 7 Africa    Botswana     Langs          27
 8 Americas  Brazil       Langs         209
 9 Africa    Burkina Faso Langs          75
10 Africa    CAR          Langs          94
# … with 434 more rows
Work it out

Using a pivot _*() functions, convert the untidydata::language_diversity data to the format we’ve seen the Nettle data in.

Stretch goals

Here’s Spanish vowel data, also from {untidydata}.

spanish_vowels
# A tibble: 750 × 4
   label        rep    f1    f2
   <chr>      <int> <dbl> <dbl>
 1 p01-male-a     1  615. 1231.
 2 p01-male-a     2  645. 1282.
 3 p01-male-a     3  608. 1248.
 4 p01-male-e     1  477. 1612.
 5 p01-male-e     2  457. 1839.
 6 p01-male-e     3  445. 1849.
 7 p01-male-i     1  309. 2153.
 8 p01-male-i     2  259. 2176.
 9 p01-male-i     3  337. 2015.
10 p01-male-o     1  478.  865.
# … with 740 more rows

The spanish_vowels$label column has three different variable smushed together: speaker id, speaker gender, vowel class. This is good file naming convention. Poor data column convention.

Work it out

Look over the docs for tidyr::separate() and try to get the speaker id, speaker gender, and vowel class separated out into their own columns.

Reuse

CC-BY-SA 4.0