library(tidyverse)
We did “strings”, but “text” is different.
String operations are going to be things like “split a string into substrings”
String things
Here’s using directory names from the `course_notes` site as an example
library(fs)
::dir_ls("..") |>
fsstr_remove("../") ->
notes_directory
tibble(
notes = notes_directory
->
)
notes_df
|>
notes_df ::paged_table() rmarkdown
- 1
- This gets a list of all files and directories in the directory above.
- 2
-
This removes the
../
component of the directory names. - 3
- Putting it in a tibble so we can to tibble things.
notes <chr> | ||||
---|---|---|---|---|
2023-01-10 | ||||
2023-01-12_gh-onboarding | ||||
2023-01-12_inclass | ||||
2023-01-17_REco | ||||
2023-01-17_RLang | ||||
2023-01-19_viz | ||||
2023-01-19_viz2 | ||||
2023-01-23_viz3 | ||||
2023-01-26_text-labels | ||||
2023-01-31_theming |
|>
notes_df filter(
str_starts(notes, "2023")
|>
) separate_wider_delim(
notes, delim = "_",
names = c("date", "name"),
too_few = "align_start"
|>
) ::paged_table() rmarkdown
date <chr> | name <chr> | |||
---|---|---|---|---|
2023-01-10 | NA | |||
2023-01-12 | gh-onboarding | |||
2023-01-12 | inclass | |||
2023-01-17 | REco | |||
2023-01-17 | RLang | |||
2023-01-19 | viz | |||
2023-01-19 | viz2 | |||
2023-01-23 | viz3 | |||
2023-01-26 | text-labels | |||
2023-01-31 | theming |
Text Stuff
But text, or written language is a bit different.
<- "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."
sentence
<- tibble(
sentences_df sentence = sentence
)
|>
sentences_df ::paged_table() rmarkdown
sentence <chr> |
---|
Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. |
How do we split this up? Are “words” just where the spaces go?
|>
sentences_df separate_longer_delim(
sentence,delim = " "
|>
) ::paged_table() rmarkdown
sentence <chr> | ||||
---|---|---|---|---|
Call | ||||
me | ||||
Ishmael. | ||||
Some | ||||
years | ||||
ago—never | ||||
mind | ||||
how | ||||
long | ||||
precisely—having |
What even is a word??
tribble(
~lexeme, ~baseform, ~contraction,
"be", "are", "aren't",
"be", "is", "isn't",
"be", "was", "wasn't",
"be", "were", "weren't",
"do", "do", "don't",
"do", "does", "doesn't",
"do", "did", "didn't",
"will", "will", "won't",
"can", "can", "can't",
"could", "could", "couldn't",
"would", "would", "wouldn't"
->
)
not_contractions
|>
not_contractions ::paged_table() rmarkdown
lexeme <chr> | baseform <chr> | contraction <chr> | ||
---|---|---|---|---|
be | are | aren't | ||
be | is | isn't | ||
be | was | wasn't | ||
be | were | weren't | ||
do | do | don't | ||
do | does | doesn't | ||
do | did | didn't | ||
will | will | won't | ||
can | can | can't | ||
could | could | couldn't |
If we wanted to treat the contraction as something separate from the word it’s contracted onto…1
|>
not_contractions mutate(
minus_contraction = str_remove(contraction, "n't")
|>
) filter(
!= minus_contraction
baseform )
# A tibble: 2 × 4
lexeme baseform contraction minus_contraction
<chr> <chr> <chr> <chr>
1 will will won't wo
2 can can can't ca
Usually text we might want to analyze is caught up in a lot of markup!
tibble(
website = read_lines("https://www.uky.edu/", n_max = 50)
|>
) ::paged_table() rmarkdown
website <chr> |
---|
<!DOCTYPE html> |
<html lang="en" dir="ltr"> |
<head> |
<meta charset="utf-8" /> |
<link rel="canonical" href="https://www.uky.edu/" /> |
<link rel="shortlink" href="https://www.uky.edu/" /> |
<meta name="Generator" content="Drupal 9 (https://www.drupal.org)" /> |
<meta name="MobileOptimized" content="width" /> |
<meta name="HandheldFriendly" content="true" /> |
<meta name="viewport" content="width=device-width, initial-scale=1.0" /> |
Or, heaven forbid, we want to analyze text that’s in a pdf.
Getting started
Key libraries:
# Tidytext
if(!require(tidytext)){
install.packages("tidytext")
}
# gutenbergr
if(!require(gutenbergr)){
install.packages("gutenbergr")
}
# tidytext
if(!require(readtext)){
install.packages("readtext")
}
# quanteda
if(!require(quanteda)){
install.packages("quanteda")
}
# text
if(!require(text)){
install.packages("text")
}
Tidytext and tokenizing
We’ll be following Text Mining with R.
A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.
|>
sentences_df unnest_tokens(text, sentence, token = "words")
# A tibble: 43 × 1
text
<chr>
1 call
2 me
3 ishmael
4 some
5 years
6 ago
7 never
8 mind
9 how
10 long
# … with 33 more rows
<- gutenberg_download(768) wuthering_heights
|>
wuthering_heights slice(1:10) |>
::paged_table() rmarkdown
gutenberg_id <int> | text <chr> | |||
---|---|---|---|---|
768 | Wuthering Heights | |||
768 | ||||
768 | by Emily Brontë | |||
768 | ||||
768 | ||||
768 | ||||
768 | ||||
768 | CHAPTER I | |||
768 | ||||
768 |
We can unnest words
|>
wuthering_heights unnest_tokens(word, text, token = "words") |>
slice(1:10) |>
::paged_table() rmarkdown
gutenberg_id <int> | word <chr> | |||
---|---|---|---|---|
768 | wuthering | |||
768 | heights | |||
768 | by | |||
768 | emily | |||
768 | brontë | |||
768 | chapter | |||
768 | i | |||
768 | 1801 | |||
768 | i | |||
768 | have |
We can unnest sentences
|>
wuthering_heights summarise(
one_big_line = str_flatten(text, collapse = " ")
|>
) unnest_tokens(sentence, one_big_line, token = "sentences") |>
slice(1:10) |>
::paged_table() rmarkdown
sentence <chr> |
---|
wuthering heights by emily brontë chapter i 1801—i have just returned from a visit to my landlord—the solitary neighbour that i shall be troubled with. |
this is certainly a beautiful country! |
in all england, i do not believe that i could have fixed on a situation so completely removed from the stir of society. |
a perfect misanthropist’s heaven—and mr. |
heathcliff and i are such a suitable pair to divide the desolation between us. |
a capital fellow! |
he little imagined how my heart warmed towards him when i beheld his black eyes withdraw so suspiciously under their brows, as i rode up, and when his fingers sheltered themselves, with a jealous resolution, still further in his waistcoat, as i announced my name. |
“mr. |
heathcliff?” |
i said. |
We can unnest ngrams
|>
wuthering_heights summarise(
one_big_line = str_flatten(text, collapse = " ")
|>
) unnest_tokens(sentence, one_big_line, token = "ngrams", n = 2) |>
slice(1:10) |>
::paged_table() rmarkdown
sentence <chr> | ||||
---|---|---|---|---|
wuthering heights | ||||
heights by | ||||
by emily | ||||
emily brontë | ||||
brontë chapter | ||||
chapter i | ||||
i 1801 | ||||
1801 i | ||||
i have | ||||
have just |
The classic power law
We can do some basics of lexicostatistics
|>
wuthering_heights unnest_tokens(word, text, token = "words") |>
mutate(word = str_remove_all(word, "_")) |>
count(word) |>
mutate(
rank = rank(
desc(n),
ties.method = "random"
)->
) wh_freqs
- 1
- unnesting each token onto its own line
- 2
- removing underscores
- 3
- counting how often each word appeared
- 4
- adding on word ranks (most frequent = 1, second most 2, etc)
|>
wh_freqs ggplot(aes(rank, n))+
geom_point()
|>
wh_freqs ggplot(aes(rank, n))+
geom_point() +
scale_x_log10()+
scale_y_log10()
Footnotes
We only got lucky her with “don’t”.↩︎