More Text

Author

Josef Fruehwald

Published

March 30, 2023

Package setup

library(tidyverse)
library(tidytext)
library(gutenbergr)

“Book keeping” with text

{tidytext}’s tokenizers don’t operate between rows, so if there’s some kind of tokenization you want to do where tokens might be split across rows, you’ll need to “flatten” the strings.

Sentences example

tribble(
  ~text,
  "This is a sentence.",
  "This is",
  "another. And yet another!"
) |>
  unnest_tokens(
    sentences,
    text,
    token = "sentences"
  )
1
This creates a few sentences with weird line breaks between them.
2
This separates text out into separate rows for each token.
3
The new column will be called sentences.
4
The column we’re tokenizing is text.
5
The tokenizer is a sentence tokenizer.
# A tibble: 4 × 1
  sentences          
  <chr>              
1 this is a sentence.
2 this is            
3 another.           
4 and yet another!   

The results aren’t properly tokenized sentences. To get that, we need to “flatten” the text.

tribble(
  ~text,
  "This is a sentence.",
  "This is",
  "another. And yet another!"
) |>
  summarise(
    text = str_flatten(text, collapse = " ")
  ) ->
  flattened

flattened
1
Same as above.
2
Taking multiple rows and condensing them down into one can be done with summarise()
3
str_flatten() will take all of the strings in each row, and collapse them together into one.
# A tibble: 1 × 1
  text                                                 
  <chr>                                                
1 This is a sentence. This is another. And yet another!
flattened |> 
  unnest_tokens(
    sentences, 
    text, 
    token = "sentences"
  )
# A tibble: 3 × 1
  sentences          
  <chr>              
1 this is a sentence.
2 this is another.   
3 and yet another!   

ngrams example

ngrams

A sequences of \(n\) tokens.

When talking about extracting “ngrams” from a text, we mean we’re extracting every sequence of \(n\) tokens.

tribble(
  ~text,
  "this is",
  "an example of it"
) ->
  to_ngramize
to_ngramize |> 
  unnest_ngrams(
    bigrams,
    text,
    n = 2
  )
1
We could also use unnest_tokens(), but unnest_ngrams() is more specific.
2
The new column will be called bigrams.
3
n is the number of tokens we want to include in a sequence.
# A tibble: 4 × 1
  bigrams   
  <chr>     
1 this is   
2 an example
3 example of
4 of it     

The example above is missing the bigram "is an" though. To get it, we’ll need to glue the two rows together.

to_ngramize |> 
  summarise(
    text = str_flatten(text, collapse = " ")
  ) |>
  unnest_ngrams(
    bigrams,
    text,
    n = 2
  )
1
Same “flattening” we did above.
2
Same bigram unnesting we did above.
# A tibble: 5 × 1
  bigrams   
  <chr>     
1 this is   
2 is an     
3 an example
4 example of
5 of it     

Making a Jane Austen Autocomplete

This block filters all of project Gutenberg’s metadata to get just Jane Austen’s novels’ ids.

gutenberg_metadata |> 
  filter(
    str_detect(author, "Austen, Jane"),
    language == "en" 
  ) |> 
  select(gutenberg_id, title) |> 
  group_by(title) |> 
  filter(gutenberg_id == min(gutenberg_id),
         gutenberg_id < 2000) -> 
  jane_austen_df

This downloads all of her novels.

austen_works <- gutenberg_download(jane_austen_df$gutenberg_id, meta_fields = "title")

Flatten

austen_works |> 
  group_by(title) |> 
  summarise(text = str_flatten(text, collapse = " "))->
  austen_works_flat

Get the bigrams

austen_works_flat |> 
  unnest_ngrams(
    bigram,
    text, 
    n = 2
  ) ->
austen_bigrams

Count the bigrams

austen_bigrams |> 
  count(bigram) |> 
  arrange(-n) ->
  austen_bigram_count

austen_bigram_count
# A tibble: 228,082 × 2
   bigram       n
   <chr>    <int>
 1 of the    3235
 2 to be     2927
 3 in the    2585
 4 it was    1851
 5 i am      1757
 6 of her    1578
 7 to the    1508
 8 she had   1500
 9 she was   1417
10 had been  1339
# … with 228,072 more rows

Separate into each word

austen_bigram_count |> 
  separate_wider_delim(
    bigram, 
    delim = " ",
    names = c("word1", "word2"))->
  austen_bigram_sep

austen_bigram_sep |> 
  head()
# A tibble: 6 × 3
  word1 word2     n
  <chr> <chr> <int>
1 of    the    3235
2 to    be     2927
3 in    the    2585
4 it    was    1851
5 i     am     1757
6 of    her    1578

Now, we can get all of the words that follow “the”

austen_bigram_sep |> 
  filter(word1 == "the")
# A tibble: 4,165 × 3
   word1 word2       n
   <chr> <chr>   <int>
 1 the   same      555
 2 the   first     451
 3 the   world     432
 4 the   house     375
 5 the   room      352
 6 the   whole     329
 7 the   other     328
 8 the   most      325
 9 the   very      298
10 the   subject   282
# … with 4,155 more rows

We can even get a random word, weighted by often it appeared after “the”

austen_bigram_sep |> 
  filter(word1 == "the") |> 
  sample_n(
    size = 1,
    weight = n
  )
# A tibble: 1 × 3
  word1 word2     n
  <chr> <chr> <int>
1 the   house   375

We could write a little function to produce a whole string of tokens.

generate_sequence <- function(bigram_df, num = 10, first_word = "the"){
  out <- c(first_word)
  search_word = first_word
  for(i in 1:num){
    new_word <- bigram_df |> 
      filter(word1 == search_word) |> 
      sample_n(size = 1, weight = n) |> 
      pull(word2)
    out <- c(out, new_word)
    search_word <- new_word
  }
  return(out)
}
generate_sequence(austen_bigram_sep)
 [1] "the"     "dangers" "of"      "his"     "sister"  "turn"    "out"    
 [8] "of"      "mind"    "was"     "not"