Classifying documents, quanteda

Author

Josef Fruehwald

Published

April 11, 2023

Setup

New Install

if(!require(tm)){
  install.packages("tm")
  library(tm)
}
if(!require(lsa)){
  install.packages("lsa")
  library(lsa)
}
if(!require(quanteda)){
  install.packages("quanteda")
  library(quanteda)
}
if(!require(quanteda.textstats)){
  install.packages("quanteda.textstats")
  library(quanteda.textstats)
}
if(!require(quanteda.corpora)){
  remotes::install_github("quanteda/quanteda.corpora")
  library(quanteda.corpora)
}
if(!require(quanteda.tidy)){
  remotes::install_github("quanteda/quanteda.tidy")
  library(quanteda.corpora)
}
if(!require(rainette)){
  install.packages("rainette")
  library(rainette)
}

if(!require(GGally)){
  install.packages("GGally")
  library(GGally)
}

Loading

Prepping data

frank <- gutenberg_download(84)

frank |> 
  mutate(                
    text = str_squish(text),
    section = case_when(
      str_detect(text, "^Letter \\d+$") ~ text,
      str_detect(text, "^Chapter \\d+$") ~ text
    )
  ) |>             
  fill(section) |>
  filter(text != section,
         !is.na(section),
         str_length(text) > 0) ->
  frankenstein

frankenstein |> 
  unnest_tokens(
    word,
    text,
    token = "words"
  ) |> 
  mutate(word = str_remove_all(word, "_"))->
  frank_tokens
frank_tokens |> 
  slice(1:50) |> 
  rmarkdown::paged_table()

The quanteda-Verse

While tidytext is part of the larger “tidytext” universe, there is also another -verse of packages for analyzing text in R, the quanteda-verse.

Quanteda corpus

summary(data_corpus_inaugural, n = 10)
Corpus consisting of 59 documents, showing 10 documents:

            Text Types Tokens Sentences Year  President   FirstName
 1789-Washington   625   1537        23 1789 Washington      George
 1793-Washington    96    147         4 1793 Washington      George
      1797-Adams   826   2577        37 1797      Adams        John
  1801-Jefferson   717   1923        41 1801  Jefferson      Thomas
  1805-Jefferson   804   2380        45 1805  Jefferson      Thomas
    1809-Madison   535   1261        21 1809    Madison       James
    1813-Madison   541   1302        33 1813    Madison       James
     1817-Monroe  1040   3677       121 1817     Monroe       James
     1821-Monroe  1259   4886       131 1821     Monroe       James
      1825-Adams  1003   3147        74 1825      Adams John Quincy
                 Party
                  none
                  none
            Federalist
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican

Being “tidy”

There is some “tidy” support for quanteda objects

data_corpus_inaugural |> 
  mutate(decade = floor(Year/10) * 10) |> 
  summary(n = 10)
Corpus consisting of 59 documents, showing 10 documents:

            Text Types Tokens Sentences Year  President   FirstName
 1789-Washington   625   1537        23 1789 Washington      George
 1793-Washington    96    147         4 1793 Washington      George
      1797-Adams   826   2577        37 1797      Adams        John
  1801-Jefferson   717   1923        41 1801  Jefferson      Thomas
  1805-Jefferson   804   2380        45 1805  Jefferson      Thomas
    1809-Madison   535   1261        21 1809    Madison       James
    1813-Madison   541   1302        33 1813    Madison       James
     1817-Monroe  1040   3677       121 1817     Monroe       James
     1821-Monroe  1259   4886       131 1821     Monroe       James
      1825-Adams  1003   3147        74 1825      Adams John Quincy
                 Party decade
                  none   1780
                  none   1790
            Federalist   1790
 Democratic-Republican   1800
 Democratic-Republican   1800
 Democratic-Republican   1800
 Democratic-Republican   1810
 Democratic-Republican   1810
 Democratic-Republican   1820
 Democratic-Republican   1820
data_corpus_inaugural |> 
  slice(1:10)
Corpus consisting of 10 documents and 4 docvars.
1789-Washington :
"Fellow-Citizens of the Senate and of the House of Representa..."

1793-Washington :
"Fellow citizens, I am again called upon by the voice of my c..."

1797-Adams :
"When it was first perceived, in early times, that no middle ..."

1801-Jefferson :
"Friends and Fellow Citizens: Called upon to undertake the du..."

1805-Jefferson :
"Proceeding, fellow citizens, to that qualification which the..."

1809-Madison :
"Unwilling to depart from examples of the most revered author..."

[ reached max_ndoc ... 4 more documents ]
data_corpus_inaugural |> 
  filter(Year >= 1960, Year <= 1980)
Corpus consisting of 5 documents and 4 docvars.
1961-Kennedy :
"Vice President Johnson, Mr. Speaker, Mr. Chief Justice, Pres..."

1965-Johnson :
"My fellow countrymen, on this occasion, the oath I have take..."

1969-Nixon :
"Senator Dirksen, Mr. Chief Justice, Mr. Vice President, Pres..."

1973-Nixon :
"Mr. Vice President, Mr. Speaker, Mr. Chief Justice, Senator ..."

1977-Carter :
"For myself and for our Nation, I want to thank my predecesso..."

Tidy to Quanteda

frankenstein |> 
  mutate(text = str_remove_all(text, "_")) |> 
  group_by(section) |> 
  summarise(text = str_flatten(text, collapse = " ")) |> 
  # quanteda::corpus,
  # so we need to quote column names
  corpus(
    docid_field = "section", 
    text_field = "text"
  )->
  frank_corpus
frank_corpus |> 
  summary(n = 4)
Corpus consisting of 28 documents, showing 4 documents:

       Text Types Tokens Sentences
  Chapter 1   765   1955        75
 Chapter 10   917   2745       129
 Chapter 11   921   3309       111
 Chapter 12   754   2361        90

Tokenizing

quanteda tokenizing is done with quanteda::tokens()

data_corpus_inaugural |> 
  tokens(remove_punct = T)->
  inaugural_tokens

There are also many functions to apply to tokens that are prefixed with tokens_

inaugural_tokens |> 
  textstat_collocations(size = 3) |> 
  slice(1:50) |> 
  rmarkdown::paged_table()

Lowercasing, dropping stop words, and stemming

inaugural_tokens |> 
  tokens_tolower() |> 
  tokens_remove(pattern = stopwords()) |> 
  tokens_wordstem() ->
  inaugural_stem

Describing documents

We’ve looked at how data points can exist along dimensions with other data.

plotting code
library(palmerpenguins)
penguins |> 
  select(
    species,
    ends_with("_mm"),
    ends_with("_g")
  ) |> 
  drop_na() |> 
  ggpairs(aes(color = species))+
    khroma::scale_color_bright()+
    khroma::scale_fill_bright()+
    theme(strip.text = element_text(size = 6))

Dimensions of documents

The way fields like textmining and natural language processing have decided to give documents “dimensions” is to use the count of how often words appear inside of them to create what’s called a “document term matrix.”

frank_tokens |> 
  count(section, word) |> 
  pivot_wider(names_from = word, values_from = n) |> 
  select(1:30) |> 
  rmarkdown::paged_table()

Document Term Matrix in quanteda

inaugural_stem |> 
  dfm() ->
  inaugural_dfm
inaugural_dfm
Document-feature matrix of: 59 documents, 5,458 features (89.34% sparse) and 4 docvars.
                 features
docs              fellow-citizen senat hous repres among vicissitud incid life
  1789-Washington              1     1    2      2     1          1     1    1
  1793-Washington              0     0    0      0     0          0     0    0
  1797-Adams                   3     1    3      3     4          0     0    2
  1801-Jefferson               2     0    0      1     1          0     0    1
  1805-Jefferson               0     0    0      0     7          0     0    2
  1809-Madison                 1     0    0      1     0          1     0    1
                 features
docs              event fill
  1789-Washington     2    1
  1793-Washington     0    0
  1797-Adams          0    0
  1801-Jefferson      0    0
  1805-Jefferson      1    0
  1809-Madison        0    1
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 5,448 more features ]
inaugural_dfm |> 
  dfm_tfidf()->
  inaugural_tfidf
inaugural_tfidf
Document-feature matrix of: 59 documents, 5,458 features (89.34% sparse) and 4 docvars.
                 features
docs              fellow-citizen    senat     hous    repres     among
  1789-Washington      0.4920984 0.624724 1.249448 0.8972654 0.1373836
  1793-Washington      0         0        0        0         0        
  1797-Adams           1.4762952 0.624724 1.874172 1.3458982 0.5495342
  1801-Jefferson       0.9841968 0        0        0.4486327 0.1373836
  1805-Jefferson       0         0        0        0         0.9616849
  1809-Madison         0.4920984 0        0        0.4486327 0        
                 features
docs              vicissitud    incid       life     event      fill
  1789-Washington  0.9927008 0.925754 0.08065593 0.9841968 0.5947608
  1793-Washington  0         0        0          0         0        
  1797-Adams       0         0        0.16131186 0         0        
  1801-Jefferson   0         0        0.08065593 0         0        
  1805-Jefferson   0         0        0.16131186 0.4920984 0        
  1809-Madison     0.9927008 0        0.08065593 0         0.5947608
[ reached max_ndoc ... 53 more documents, reached max_nfeat ... 5,448 more features ]

Similarity

inaugural_tfidf |> 
  textstat_simil(method = "cosine") |> 
  as_tibble() |> 
  arrange(desc(cosine)) 
# A tibble: 1,711 × 3
   document1     document2     cosine
   <fct>         <fct>          <dbl>
 1 1817-Monroe   1821-Monroe    0.367
 2 1837-VanBuren 1841-Harrison  0.317
 3 1841-Harrison 1845-Polk      0.302
 4 2001-Bush     2021-Biden     0.292
 5 1897-McKinley 1909-Taft      0.290
 6 1837-VanBuren 1853-Pierce    0.289
 7 1825-Adams    1845-Polk      0.289
 8 1841-Harrison 1857-Buchanan  0.289
 9 1845-Polk     1857-Buchanan  0.288
10 1889-Harrison 1897-McKinley  0.285
# ℹ 1,701 more rows

Clustering

We can cluster over a distance matrix

inaugural_tfidf |> 
  textstat_dist() |> 
  # moving over to base R!
  as.dist() |> 
  hclust() -> 
  inaugural_clust1
plot(inaugural_clust1)

A more interactive clustering

data_corpus_inaugural |> 
  split_segments() ->
  inaugural_corpus2 

inaugural_corpus2 |> 
  tokens(remove_punct = T) |> 
  tokens_tolower() |> 
  tokens_remove(pattern = stopwords()) |> 
  tokens_wordstem() ->
  inaugural_tokens2

inaugural_tokens2 |> 
  dfm() |> 
  dfm_trim(min_docfreq = 10)->
  inaugural_dfm2
clust2 <- rainette(inaugural_dfm2)
Warning in rainette(inaugural_dfm2): some documents don't have any term, they
won't be assigned to any cluster.
rainette_explor(clust2, inaugural_dfm2, inaugural_corpus2)
inaugural_corpus2$group <- cutree(clust2, 5)
inaugural_corpus2 |> 
  docvars() |> 
  drop_na() |> 
  count(Year, segment_source, group) |> 
  group_by(segment_source) |> 
  mutate(prop = n/sum(n)) |> 
  ggplot(aes(factor(group), prop))+
    geom_col()+
    facet_wrap(~segment_source)+
    theme(strip.text = element_text(size = 8))