A&S500/Lin517 - Census Data

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
✔ purrr   1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Census

In order to load this data, you’ll have to download it yourself from the Lexington data hub.

Lexington data source

lex_data <- read_csv("data/Census_2020_-_Race_18_and_Over_by_Precinct.csv")

Here’s what the data looks like

head(lex_data) |> 
  rmarkdown::paged_table()

This data isn’t especially “tidy”. There is 1 row per precinct, and the column names are not descriptive. If you google the first column ID, you’ll find some tables that provide descriptive labels that look like this:

label	id
Total:	P0030001
Population of one race:	P0030002
White alone	P0030003
Black or African American alone	P0030004
American Indian and Alaska Native alone	P0030005
Asian alone	P0030006
Native Hawaiian and Other Pacific Islander alone	P0030007
Some Other Race alone	P0030008
Two or More Races:	P0030009
…	…

A few things to note:

We need to get these descriptive labels into the census data.
There are data columns associated with summary values (e.g. “Total:” as well as with unsummarized data.

Step 1 requires a “data dictionary”, which you can load with the following line of code¹:

source("https://jofrhwld.github.io/AandS500_2023/class_notes/2023-02-16_realmessy/data_dictionary.R")

race_id |> 
  rmarkdown::paged_table()

Some more things to note:

Not everyone who included “Black or African American” (for example) on the census is represented in the “Black or African American Alone” data.
The total number of people who included “Black or African American” on the census will have their data distributed across many different columns in lex_data.

race_id |> 
  filter(str_detect(label, "Black or African American")) |> 
  rmarkdown::paged_table()

There are 32 different data columns that include people who included “Black or African American” on the census.

Step 1: Pivot longer

We need to pivot lex_data long. We only want to pivot the data columns starting with "P003" long, so we’ll use starts_with() for that.

lex_data |> 
  pivot_longer(
    starts_with("P003"),
    names_to = "id",
    values_to = "n"
  ) ->
  lex_long

The dataframe lex_long now has 1 row for every P003* data column for every precinct.

Step 2: Joining

I planned ahead with the names_to="id" argument in pivot_longer() so that I could quickly join the race_id data dictionary onto it.

lex_long |> 
  left_join(race_id) |> 
  relocate(label, .before = id)->
  lex_labelled

Joining with `by = join_by(id)`

Now, we have descriptive labels on the data!

Step 3: Subdividing summary from detailed data.

One problem remaining is that there are still summary data rows in among our detailed data. The total population of each precinct is has a row, along with the detailed race identification. We’ll want to subdivide these out into different dataframes.

lex_labelled |> 
  filter(
    str_detect(label, "Total")
  ) ->
  lex_totals

lex_labelled |> 
  filter(
    str_detect(label, "Population")
  ) ->
  lex_popof

lex_labelled |> 
  filter(
    str_detect(label, "Total", negate = T),
    str_detect(label, "Population", negate = T)
  ) ->
  lex_detailed

Now, lex_detailed contains just the detailed information on what race people included in the census.

Step 4: String work

There’s a few more things we’ll want to do before we make the data as granular as possible.

When people included just one race on the census, the label has " alone" at the end, which we’ll want to remove to make the label align with the multiple race ids.
We’ll want to get a count of how many races people included before we break up the data.

For the first issue, we’ll use stringr::str_remove(), and for the second, we’ll use stringr::str_count() to count how many semicolons are in each label.

lex_detailed |> 
  mutate(
    label = str_remove(label, " alone"),
    n_id = str_count(label, ";") + 1,
    with_id = row_number()
  ) ->
  lex_detailed_tidy

Step 5: Separating long

Now, we’ll want to separate the label column long, with a new row for each label, separating by "; "

lex_detailed_tidy |> 
  separate_longer_delim(label, delim = "; ") ->
  lex_detailed_long

Some Dataviz

Now, a thing to keep in mind is that individual people are now double counted in lex_detailed_long. For example, before we separated long, there was just 1 row for the 35 people in Gibson Park who included both “White” and “Black or African American” on the census.

lex_detailed_tidy |> 
  select(NAME, label, n, n_id, with_id) |> 
  filter(
    label == "White; Black or African American",
    n_id == 2,
    NAME == "GIBSON PARK"
  ) |> 
  rmarkdown::paged_table()

After we separated long, all 35 of these people appear in both the “White” and the “Black or African American” rows.

lex_detailed_long |> 
  select(NAME, label, n, n_id, with_id) |> 
  filter(
    with_id == 511
  ) |> 
  rmarkdown::paged_table()

Keeping that in mind, this is the easiest format to count up everyone who included a specific race identity in the census, whether that was “alone” or along with multiple. We can also count up how many people included multiple race identities.

lex_detailed_long |> 
  # Count the total number of people
  # who included a race on the census
  # by the number of races they included
  summarise(
    .by = c(label, n_id),
    n = sum(n)
  ) |> 
  # Calculate the proportion of people
  # who put down N race ids 
  # by race id
  mutate(
    .by = label,
    prop = n/sum(n)
  ) |> 
  ggplot(aes(n_id, prop)) +
    geom_col(fill = "grey90") +
    ylim(0,1)+
    facet_wrap(~str_wrap(label, 15)) +
    theme(strip.text = element_text(size = 12))

For some race ids, the most common response was just to put down just 1, but for others, like American Indian and Alaska Native, the most common response was to include 2 race ids. If we had made maps or did other analyses, and used just the data from “American Indian and Alaska Native alone”, we would have dramatically undercounted all of the people who included American Indian and Alaska Native on their census.

Footnotes

I “baked this at home” with a lot of copy-pasting from a pdf.↩︎