Getting the xkcd color survey sqlite database

Josef Fruehwald

doi:10.59350/r7fpp-kxk10

I had a lot of fun working with the XKCD color survey data, and I think I’ll keep messing around with it here and there in the future. But I also think I’ll want access to the full data. The tidytuesday data set was necessarily boiled down. The answers data frame contained just one hex code associated with one color label, not every color label given to every hex code in the survey. The full data set is available as a sqlite dump linked to from the xkcd blog, so this post is just about how I set up access to it within my blog RStudio project.

# basic setup
source(here::here("_defaults.R"))
library(tidyverse)
library(colorspace)

Necessary Libraries

For this workflow, I used

fs: for working with file systems
DBI and RSQLite: for working with the sqlite database
dbplyr: for using dplyr commands on the database.

library(fs)
library(DBI)
library(RSQLite)
library(dbplyr)

Downloading the database

If you want to hang onto the tar file locally (in case of linkrot), you could download it to a real location, but I just want to send it to a temp file. You can get the temp directory for your R session with fs::path_temp(), and I created the destination file using fs::path().

download.file(
  "http://xkcd.com/color/colorsurvey.tar.gz",
  fs::path(
    fs::path_temp(),
    "colorsurvey.tar.gz"
  )
)

Then, it’s a matter of untaring it.

untar(
  fs::path(fs::path_temp(), "colorsurvey.tar.gz"),
  exdir = fs::path(fs::path_temp(), "colorsurvey")
)

You can get a list of the available files like so:

path(
  path_temp(),
  "colorsurvey"
) |>
  dir_ls() ->
  dump_files

I’m not actually including this code in executable chunks to avoid downloading the tar file on every re-render, but dump_files will look something like

[1] /var/folders/xyz/mainsurvey_sqldump.txt
[2] /var/folders/xyz/satfaces_sqldump.txt

It’s the mainsurvey_sqldump.txt file that we want to process.

dump_files |>
  keep(
    ~str_detect(.x, "mainsurvey")
  ) ->
  survey_dump

Creating the local database

To set up the database, we need to read the sqldump file. Apparently there’s a sqlite command .read that will do this, but I couldn’t figure out how to run it within a DBI or RSQLite function, so I have to use a system() command.

First, I have decide where this database is going, and since I don’t want it to wind up being duplicated in every post directory, I’ll create a top level project directory called data.

fs::dir_create(
  here::here("data")
)

Then, I need to decide on the name of the database file.

local_sql_path <- here::here("data", "colors.sqlite")

I’ll glue my variables into the system command I want:

creation_command <- str_glue(
  "sqlite3 {local_sql_path} < {survey_dump}"
)

And then finally run it.

system(creation_command)

Updating `.gitignore`

You’ll probably want to add *.sqlite file to your gitignore, which you can either do do by hand or with usethis::use_git_ignore()

usethis::use_git_ignore("*.sqlite")

Looking at the database

Now we can connect to the database and look at it.

colordb <- dbConnect(
  RSQLite::SQLite(), 
  here::here("data", "colors.sqlite")
)

dbListTables(colordb)

[1] "answers" "names"   "users"

We can access tables from the database like it’s a dataframe with tbl(). It doesn’t actually read the whole thing into memory.

colors <- tbl(colordb, "answers")
colors

# Source:   table<`answers`> [?? x 7]
# Database: sqlite 3.50.1 [/Users/joseffruehwald/Documents/blog/data/colors.sqlite]
      id user_id  datestamp     r     g     b colorname   
   <int>   <int>      <dbl> <int> <int> <int> <chr>       
 1     1       1 1267418734    72   100   175 pastel blue 
 2     2       1 1267418739   204   177   246 faint violet
 3     3       1 1267418769   182   226   245 baby blue   
 4     4       1 1267418773   130    64   234 purple      
 5     5       2 1267419006    75    49   234 blue        
 6     6       2 1267419010    76   215   249 light blue  
 7     7       2 1267419015   111   145   122 olive green 
 8     8       2 1267419019    88    70     1 brown       
 9     9       2 1267419021   218    35   156 pink        
10    10       4 1267419023   154    42   159 purple      
# ℹ more rows

Now, we can run the dplyr-like commands on this table thanks to dbplyr and only load the rows we’re interested in.

colors |> 
  # get just blue and light blue
  filter(
    colorname %in% c("blue", "light blue")
  ) |> 
  # manually get some hex values
  mutate(
    rh = sql("printf('%02X', r)"),
    gh = sql("printf('%02X', g)"),
    bh = sql("printf('%02X', b)")
  ) |> 
  mutate(
    hex = sql("rh||gh||bh")
  )

# Source:   SQL [?? x 11]
# Database: sqlite 3.50.1 [/Users/joseffruehwald/Documents/blog/data/colors.sqlite]
      id user_id  datestamp     r     g     b colorname rh    gh    bh    hex   
   <int>   <int>      <dbl> <int> <int> <int> <chr>     <chr> <chr> <chr> <chr> 
 1     5       2 1267419006    75    49   234 blue      4B    31    EA    4B31EA
 2    17       2 1267419032    41   201   234 blue      29    C9    EA    29C9EA
 3    22       2 1267419040    73    97   236 blue      49    61    EC    4961EC
 4    27       4 1267419062    33   115   229 blue      21    73    E5    2173E5
 5    45       6 1267419091    14    47   164 blue      0E    2F    A4    0E2FA4
 6    49       4 1267419096    64   128   225 blue      40    80    E1    4080E1
 7    57       7 1267419102    74    89   253 blue      4A    59    FD    4A59FD
 8    87       6 1267419139   120   158   209 blue      78    9E    D1    789ED1
 9   100       7 1267419170    74   107   231 blue      4A    6B    E7    4A6BE7
10   115      10 1267419193    95   196   210 blue      5F    C4    D2    5FC4D2
# ℹ more rows

I’m kind of glad I messed around with manually converting the rgb values to hex values, cause it turns out I don’t know what the right way is to convert these rgb values to hex codes. My manual approach gives me #4B31EA for the first row, but using the colorspace package, I get two different hex codes depending on whether I assume r, g, b are RGB or sRGB values.

RGB(
  75/255,
  49/255,
  234/255
) |> 
  hex()

[1] "#9479F6"

sRGB(
  75/255,
  49/255,
  234/255
) |> 
  hex()

[1] "#4B31EA"

swatchplot(
  "RGB" = "#9479F6",
  "sRGB" = "#4B31EA"
)

🤷‍♂️

dbDisconnect(colordb)

Reuse

CC-BY 4.0

Citation

BibTeX citation:

@online{fruehwald2025,
  author = {Fruehwald, Josef},
  title = {Getting the Xkcd Color Survey Sqlite Database},
  series = {Væl Space},
  date = {2025-07-16},
  url = {https://jofrhwld.github.io/blog/posts/2025/07/2025-07-16_color-sql/},
  doi = {10.59350/r7fpp-kxk10},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef. 2025. “Getting the Xkcd Color Survey Sqlite Database.” Væl Space. July 16, 2025. https://doi.org/10.59350/r7fpp-kxk10.