I had a lot of fun working with the XKCD color survey data, and I think I’ll keep messing around with it here and there in the future. But I also think I’ll want access to the full data. The tidytuesday data set was necessarily boiled down. The answers
data frame contained just one hex code associated with one color label, not every color label given to every hex code in the survey. The full data set is available as a sqlite dump linked to from the xkcd blog, so this post is just about how I set up access to it within my blog RStudio project.
Necessary Libraries
For this workflow, I used
Downloading the database
If you want to hang onto the tar file locally (in case of linkrot), you could download it to a real location, but I just want to send it to a temp file. You can get the temp directory for your R session with fs::path_temp()
, and I created the destination file using fs::path()
.
download.file(
"http://xkcd.com/color/colorsurvey.tar.gz",
fs::path(
fs::path_temp(),
"colorsurvey.tar.gz"
)
)
Then, it’s a matter of untaring it.
untar(
fs::path(fs::path_temp(), "colorsurvey.tar.gz"),
exdir = fs::path(fs::path_temp(), "colorsurvey")
)
You can get a list of the available files like so:
I’m not actually including this code in executable chunks to avoid downloading the tar file on every re-render, but dump_files
will look something like
[1] /var/folders/xyz/mainsurvey_sqldump.txt
[2] /var/folders/xyz/satfaces_sqldump.txt
It’s the mainsurvey_sqldump.txt
file that we want to process.
dump_files |>
keep(
~str_detect(.x, "mainsurvey")
) ->
survey_dump
Creating the local database
To set up the database, we need to read the sqldump file. Apparently there’s a sqlite command .read
that will do this, but I couldn’t figure out how to run it within a DBI or RSQLite function, so I have to use a system()
command.
First, I have decide where this database is going, and since I don’t want it to wind up being duplicated in every post directory, I’ll create a top level project directory called data
.
fs::dir_create(
here::here("data")
)
Then, I need to decide on the name of the database file.
local_sql_path <- here::here("data", "colors.sqlite")
I’ll glue my variables into the system command I want:
creation_command <- str_glue(
"sqlite3 {local_sql_path} < {survey_dump}"
)
And then finally run it.
system(creation)
Updating .gitignore
You’ll probably want to add *.sqlite
file to your gitignore, which you can either do do by hand or with usethis::use_git_ignore()
usethis::use_git_ignore("*.sqlite")
Looking at the database
Now we can connect to the database and look at it.
dbListTables(colordb)
[1] "answers" "names" "users"
We can access tables from the database like it’s a dataframe with tbl()
. It doesn’t actually read the whole thing into memory.
colors <- tbl(colordb, "answers")
colors
# Source: table<`answers`> [?? x 7]
# Database: sqlite 3.50.1 [/Users/joseffruehwald/Documents/blog/data/colors.sqlite]
id user_id datestamp r g b colorname
<int> <int> <dbl> <int> <int> <int> <chr>
1 1 1 1267418734 72 100 175 pastel blue
2 2 1 1267418739 204 177 246 faint violet
3 3 1 1267418769 182 226 245 baby blue
4 4 1 1267418773 130 64 234 purple
5 5 2 1267419006 75 49 234 blue
6 6 2 1267419010 76 215 249 light blue
7 7 2 1267419015 111 145 122 olive green
8 8 2 1267419019 88 70 1 brown
9 9 2 1267419021 218 35 156 pink
10 10 4 1267419023 154 42 159 purple
# ℹ more rows
Now, we can run the dplyr-like commands on this table thanks to dbplyr and only load the rows we’re interested in.
colors |>
# get just blue and light blue
filter(
colorname %in% c("blue", "light blue")
) |>
# manually get some hex values
mutate(
rh = sql("printf('%02X', r)"),
gh = sql("printf('%02X', g)"),
bh = sql("printf('%02X', b)")
) |>
mutate(
hex = sql("rh||gh||bh")
)
# Source: SQL [?? x 11]
# Database: sqlite 3.50.1 [/Users/joseffruehwald/Documents/blog/data/colors.sqlite]
id user_id datestamp r g b colorname rh gh bh hex
<int> <int> <dbl> <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
1 5 2 1267419006 75 49 234 blue 4B 31 EA 4B31EA
2 17 2 1267419032 41 201 234 blue 29 C9 EA 29C9EA
3 22 2 1267419040 73 97 236 blue 49 61 EC 4961EC
4 27 4 1267419062 33 115 229 blue 21 73 E5 2173E5
5 45 6 1267419091 14 47 164 blue 0E 2F A4 0E2FA4
6 49 4 1267419096 64 128 225 blue 40 80 E1 4080E1
7 57 7 1267419102 74 89 253 blue 4A 59 FD 4A59FD
8 87 6 1267419139 120 158 209 blue 78 9E D1 789ED1
9 100 7 1267419170 74 107 231 blue 4A 6B E7 4A6BE7
10 115 10 1267419193 95 196 210 blue 5F C4 D2 5FC4D2
# ℹ more rows
I’m kind of glad I messed around with manually converting the rgb values to hex values, cause it turns out I don’t know what the right way is to convert these rgb values to hex codes. My manual approach gives me #4B31EA for the first row, but using the colorspace package, I get two different hex codes depending on whether I assume r
, g
, b
are RGB or sRGB values.
swatchplot(
"RGB" = "#9479F6",
"sRGB" = "#4B31EA"
)
🤷♂️
dbDisconnect(colordb)
Reuse
Citation
@online{fruehwald2025,
author = {Fruehwald, Josef},
title = {Getting the Xkcd Color Survey Sqlite Database},
series = {Væl Space},
date = {2025-07-16},
url = {https://jofrhwld.github.io/blog/posts/2025/07/2025-07-16_color-sql/},
langid = {en}
}