── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ tibble 3.1.8 ✔ dplyr 1.1.0
✔ tidyr 1.3.0 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
What is “tidy data”
The actual paper on the topic is worth reading: Wickham (2014). There are three rules to tidy data:
- Every column is one, and only one, variable.
- Every row is one, and only one, observation.
- Every cell is a single value.
Pivoting
install.packages("nasapower")
Here is an “untidy” data set. There are multiple observations per row.
lex_temp |>
rmarkdown::paged_table()
LON <dbl> | LAT <dbl> | PARAMETER <chr> | YEAR <dbl> | JAN <dbl> | FEB <dbl> | MAR <dbl> | APR <dbl> | MAY <dbl> | JUN <dbl> | |
---|---|---|---|---|---|---|---|---|---|---|
-84.50164 | 38.04799 | T2M | 1985 | -4.87 | -3.06 | 7.64 | 13.50 | 17.44 | 21.03 | |
-84.50164 | 38.04799 | T2M | 1986 | -1.52 | 1.77 | 6.53 | 12.84 | 18.10 | 23.43 | |
-84.50164 | 38.04799 | T2M | 1987 | -0.99 | 1.08 | 6.54 | 10.80 | 19.68 | 22.83 | |
-84.50164 | 38.04799 | T2M | 1988 | -2.38 | -0.56 | 5.58 | 10.99 | 17.15 | 23.56 | |
-84.50164 | 38.04799 | T2M | 1989 | 2.42 | -0.93 | 6.44 | 10.67 | 14.72 | 21.07 | |
-84.50164 | 38.04799 | T2M | 1990 | 2.61 | 4.76 | 7.74 | 10.80 | 15.85 | 21.31 | |
-84.50164 | 38.04799 | T2M | 1991 | -0.54 | 2.13 | 7.15 | 13.49 | 20.39 | 22.32 | |
-84.50164 | 38.04799 | T2M | 1992 | 0.09 | 3.51 | 6.23 | 11.67 | 15.91 | 19.85 | |
-84.50164 | 38.04799 | T2M | 1993 | 1.78 | -0.27 | 4.30 | 10.59 | 17.37 | 21.81 | |
-84.50164 | 38.04799 | T2M | 1994 | -3.89 | 0.80 | 4.54 | 13.18 | 15.07 | 22.68 |
To tidy this data up, we need to “pivot” it longer. Simply saying pivot_longer()
and telling it which columns we want to turn long-wise (JAN
through DEC
) will do the trick.
lex_temp |>
pivot_longer(
cols = JAN:DEC
) |>
rmarkdown::paged_table()
LON <dbl> | LAT <dbl> | PARAMETER <chr> | YEAR <dbl> | ANN <dbl> | name <chr> | value <dbl> |
---|---|---|---|---|---|---|
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | JAN | -4.87 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | FEB | -3.06 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | MAR | 7.64 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | APR | 13.50 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | MAY | 17.44 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | JUN | 21.03 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | JUL | 23.85 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | AUG | 23.11 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | SEP | 19.83 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | OCT | 15.49 |
Before, there was 1 row per year, and 1 column per month. Now there are 12 rows per year, with the 12 months’ column names moved into a column called name
and the values in a column called value
.
We can tell pivot_wider()
what to call the new columns by passing it arguments to names_to=
and values_to=
lex_temp |>
pivot_longer(
cols = JAN:DEC,
names_to = "month",
values_to = "temp"
) ->
lex_temp_long
lex_temp_long |>
rmarkdown::paged_table()
LON <dbl> | LAT <dbl> | PARAMETER <chr> | YEAR <dbl> | ANN <dbl> | month <chr> | temp <dbl> |
---|---|---|---|---|---|---|
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | JAN | -4.87 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | FEB | -3.06 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | MAR | 7.64 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | APR | 13.50 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | MAY | 17.44 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | JUN | 21.03 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | JUL | 23.85 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | AUG | 23.11 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | SEP | 19.83 |
-84.50164 | 38.04799 | T2M | 1985 | 11.97 | OCT | 15.49 |
If we try turning this into a plot or heat map right away, it won’t go quite according to plan.
lex_temp_long |>
ggplot(aes(month, YEAR, fill = temp))+
geom_raster()+
scale_fill_viridis_c(option = "magma")
By default, it’s plotting the months in alphabetical order.
lex_temp_long |>
mutate(month = str_to_lower(month)) |>
ggplot(aes(month, YEAR, fill = temp))+
geom_raster() +
scale_fill_viridis_c(option = "magma") +
scale_x_discrete(limits = str_to_lower(month.abb))
lex_temp_long |>
mutate(month = str_to_lower(month)) |>
ggplot(aes(YEAR, month, fill = temp))+
geom_raster() +
scale_fill_viridis_c(option = "magma") +
scale_y_discrete(limits = rev(str_to_lower(month.abb)))+
coord_fixed()
Separating
When there’s more than one value in a cell, we need to separate_
them. Let’s say three people described cities, and we recorded them in this dataframe.
tribble(
~person, ~city, ~description,
"joe", "Philadelphia, PA", "Gritty, weird",
"skylar", "Columbus, OH", "Cheap, easy, comfortable",
"robin", "Nashville, TN", "musical"
) -> cities
cities
# A tibble: 3 × 3
person city description
<chr> <chr> <chr>
1 joe Philadelphia, PA Gritty, weird
2 skylar Columbus, OH Cheap, easy, comfortable
3 robin Nashville, TN musical
The column city
contains two values, the city and the state, and the column description
contains between 1 and 3 values, depending on each person.
To separate the city
column into a separate city
and state
, we need to use separate_wider_delim()
.
cities |>
separate_wider_delim(
cols = city,
delim = ", ",
names = c("city", "state")
)
# A tibble: 3 × 4
person city state description
<chr> <chr> <chr> <chr>
1 joe Philadelphia PA Gritty, weird
2 skylar Columbus OH Cheap, easy, comfortable
3 robin Nashville TN musical
We’ll want to separate the description
column into different rows, which we can do with separate_longer_delim()
.
cities |>
separate_wider_delim(
cols = city,
delim = ", ",
names = c("city", "state")
) |>
separate_longer_delim(
cols = description,
delim = ", "
)
# A tibble: 6 × 4
person city state description
<chr> <chr> <chr> <chr>
1 joe Philadelphia PA Gritty
2 joe Philadelphia PA weird
3 skylar Columbus OH Cheap
4 skylar Columbus OH easy
5 skylar Columbus OH comfortable
6 robin Nashville TN musical