library(tidyverse)
library(tidytext)
library(gutenbergr)
Package setup
“Book keeping” with text
{tidytext}
’s tokenizers don’t operate between rows, so if there’s some kind of tokenization you want to do where tokens might be split across rows, you’ll need to “flatten” the strings.
Sentences example
tribble(
~text,
"This is a sentence.",
"This is",
"another. And yet another!"
|>
) unnest_tokens(
sentences,
text,token = "sentences"
)
- 1
- This creates a few sentences with weird line breaks between them.
- 2
- This separates text out into separate rows for each token.
- 3
-
The new column will be called
sentences
. - 4
-
The column we’re tokenizing is
text
. - 5
- The tokenizer is a sentence tokenizer.
# A tibble: 4 × 1
sentences
<chr>
1 this is a sentence.
2 this is
3 another.
4 and yet another!
The results aren’t properly tokenized sentences. To get that, we need to “flatten” the text.
tribble(
~text,
"This is a sentence.",
"This is",
"another. And yet another!"
|>
) summarise(
text = str_flatten(text, collapse = " ")
->
)
flattened
flattened
- 1
- Same as above.
- 2
-
Taking multiple rows and condensing them down into one can be done with
summarise()
- 3
-
str_flatten()
will take all of the strings in each row, and collapse them together into one.
# A tibble: 1 × 1
text
<chr>
1 This is a sentence. This is another. And yet another!
|>
flattened unnest_tokens(
sentences,
text, token = "sentences"
)
# A tibble: 3 × 1
sentences
<chr>
1 this is a sentence.
2 this is another.
3 and yet another!
ngrams example
- ngrams
-
A sequences of \(n\) tokens.
When talking about extracting “ngrams” from a text, we mean we’re extracting every sequence of \(n\) tokens.
tribble(
~text,
"this is",
"an example of it"
->
) to_ngramize
|>
to_ngramize unnest_ngrams(
bigrams,
text,n = 2
)
- 1
-
We could also use
unnest_tokens()
, butunnest_ngrams()
is more specific. - 2
-
The new column will be called
bigrams
. - 3
-
n
is the number of tokens we want to include in a sequence.
# A tibble: 4 × 1
bigrams
<chr>
1 this is
2 an example
3 example of
4 of it
The example above is missing the bigram "is an"
though. To get it, we’ll need to glue the two rows together.
|>
to_ngramize summarise(
text = str_flatten(text, collapse = " ")
|>
) unnest_ngrams(
bigrams,
text,n = 2
)
- 1
- Same “flattening” we did above.
- 2
- Same bigram unnesting we did above.
# A tibble: 5 × 1
bigrams
<chr>
1 this is
2 is an
3 an example
4 example of
5 of it
Making a Jane Austen Autocomplete
This block filters all of project Gutenberg’s metadata to get just Jane Austen’s novels’ ids.
|>
gutenberg_metadata filter(
str_detect(author, "Austen, Jane"),
== "en"
language |>
) select(gutenberg_id, title) |>
group_by(title) |>
filter(gutenberg_id == min(gutenberg_id),
< 2000) ->
gutenberg_id jane_austen_df
This downloads all of her novels.
<- gutenberg_download(jane_austen_df$gutenberg_id, meta_fields = "title") austen_works
Flatten
|>
austen_works group_by(title) |>
summarise(text = str_flatten(text, collapse = " "))->
austen_works_flat
Get the bigrams
|>
austen_works_flat unnest_ngrams(
bigram,
text, n = 2
->
) austen_bigrams
Count the bigrams
|>
austen_bigrams count(bigram) |>
arrange(-n) ->
austen_bigram_count
austen_bigram_count
# A tibble: 228,082 × 2
bigram n
<chr> <int>
1 of the 3235
2 to be 2927
3 in the 2585
4 it was 1851
5 i am 1757
6 of her 1578
7 to the 1508
8 she had 1500
9 she was 1417
10 had been 1339
# … with 228,072 more rows
Separate into each word
|>
austen_bigram_count separate_wider_delim(
bigram, delim = " ",
names = c("word1", "word2"))->
austen_bigram_sep
|>
austen_bigram_sep head()
# A tibble: 6 × 3
word1 word2 n
<chr> <chr> <int>
1 of the 3235
2 to be 2927
3 in the 2585
4 it was 1851
5 i am 1757
6 of her 1578
Now, we can get all of the words that follow “the”
|>
austen_bigram_sep filter(word1 == "the")
# A tibble: 4,165 × 3
word1 word2 n
<chr> <chr> <int>
1 the same 555
2 the first 451
3 the world 432
4 the house 375
5 the room 352
6 the whole 329
7 the other 328
8 the most 325
9 the very 298
10 the subject 282
# … with 4,155 more rows
We can even get a random word, weighted by often it appeared after “the”
|>
austen_bigram_sep filter(word1 == "the") |>
sample_n(
size = 1,
weight = n
)
# A tibble: 1 × 3
word1 word2 n
<chr> <chr> <int>
1 the house 375
We could write a little function to produce a whole string of tokens.
<- function(bigram_df, num = 10, first_word = "the"){
generate_sequence <- c(first_word)
out = first_word
search_word for(i in 1:num){
<- bigram_df |>
new_word filter(word1 == search_word) |>
sample_n(size = 1, weight = n) |>
pull(word2)
<- c(out, new_word)
out <- new_word
search_word
}return(out)
}
generate_sequence(austen_bigram_sep)
[1] "the" "dangers" "of" "his" "sister" "turn" "out"
[8] "of" "mind" "was" "not"