Lemmatizing and Stemming
What tokenizing does not get for you
Coming back to the example sentences from the first data processing lecture, properly tokenizing these sentences will only partly help us with our linguistic analysis.
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
from tabulate import tabulate
= """The 2019 film CATS is a movie about cats.
phrase Cats appear in every scene.
A cat can always be seen"""
# case folding
= phrase.lower()
phrase_lower
# tokenization
= word_tokenize(phrase_lower)
tokens
# counting
= Counter(tokens)
token_count
# cat focus
= [[k,token_count[k]] for k in token_count if "cat" in k]
cat_list
print(tabulate(cat_list,
= ["type", "count"])) headers
type | count |
---|---|
cats | 3 |
cat | 1 |
We’ve still got the plural cats
being counted as a separate word from cat
, which for our weird use case, we don’t want. Our options here are to either “stem” or “lemmatize” our tokens.
Stemming
Stemming is focused on cutting off morphemes and, to some degree, providing a consistent stem across all types that share a stem. So the outcomes aren’t always a recognizable word. The way it does this is all rule-based. For example, the first step of the Porter stemmer contains the following rewrite rules.
i. sses -> ss
ii. ies -> i
iii. ss -> ss
iv. s ->
When a word comes into the first step, if its end matches any of the left hand sides, it will get re-written as the right hand side. If it could match multiple, the one it has the longest match with wins, so
“passes” matches
i.
, so it gets rewritten as “pass”“pass” matches
iii.
andiv.
, but has the largest overlap withiii.
so it gets rewritten as “pass”“parties” matches
ii.
, so it gets rewritten as “parti”“pas” (as in “faux pas”) matches
iv.
so it gets rewritten as “pa”“cats” matches
iv.
so it gets rewritten as “cats”
This works basically correctly to the various /+z/ morphemes in English, but it over does it (“pas” should be left alone) and it produces some stems that don’t look like the actual root word (“parti” vs “party”).
After this step, it contains a lot more hand crafted rules (e.g. ational - > ate
).
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
= PorterStemmer()
p_stemmer = [p_stemmer.stem(t) for t in tokens]
p_stemmed for t in p_stemmed:
print(f"`{t}` |", end = " ")
the
| 2019
| film
| cat
| is
| a
| movi
| about
| cat
| .
| cat
| appear
| in
| everi
| scene
| .
| a
| cat
| can
| alway
| be
| seen
|
= SnowballStemmer("english")
s_stemmer = [s_stemmer.stem(t) for t in tokens]
s_stemmed for t in s_stemmed:
print(f"`{t}` |", end = " ")
the
| 2019
| film
| cat
| is
| a
| movi
| about
| cat
| .
| cat
| appear
| in
| everi
| scene
| .
| a
| cat
| can
| alway
| be
| seen
|
Just to focus on how the stemmers operate over a specific paradigm:
= ["cry", "cries", "crying", "cried", "crier"]
cry
print(
tabulate(for c in cry],
[[c, s_stemmer.stem(c)] =["token", "stem"]
headers
) )
token | stem |
---|---|
cry | cri |
cries | cri |
crying | cri |
cried | cri |
crier | crier |
Also, when something like inflectional morphology makes a change to the stem, it won’t get undone by the stemmer.
= ["run", "runs", "running", "ran", "runner"]
run
print(
tabulate(for r in run],
[[r, s_stemmer.stem(r)] =["token", "stem"]
headers
) )
token | stem |
---|---|
run | run |
runs | run |
running | run |
ran | ran |
runner | runner |
Lemmatizing
Lemmatizing involves a more complex morphological analysis of words, and as such requires language specific models to work.
nltk lemmatizing
nltk uses WordNet for its English lemmatizing. WordNet is a large database of lexical relations that have been hand annotated starting in the 1980s. Its outputs are always recognizable words.
= nltk.WordNetLemmatizer() wnl
print(
tabulate(for c in cry],
[[c, wnl.lemmatize(c)] =["token", "lemma"]
headers
) )
token | lemma |
---|---|
cry | cry |
cries | cry |
crying | cry |
cried | cried |
crier | crier |
print(
tabulate(for r in run],
[[r, wnl.lemmatize(r)] =["token", "lemma"]
headers
) )
token | lemma |
---|---|
run | run |
runs | run |
running | running |
ran | ran |
runner | runner |
spaCy lemmatizing
spaCy has a number of models that do lemmatizing. They list WordNet along with a few other data sources for the model.
import spacy
= spacy.load("en_core_web_sm")
nlp = nlp.get_pipe("lemmatizer") lemmatizer
= nlp(" ".join(cry))
doc print(
tabulate(for c in doc],
[[c.text, c.lemma_] =["token", "lemma"]
headers
) )
token | lemma |
---|---|
cry | cry |
cries | cry |
crying | cry |
cried | cry |
crier | crier |
= nlp(" ".join(run))
doc print(
tabulate(for r in doc],
[[r.text, r.lemma_] =["token", "lemma"]
headers
) )
token | lemma |
---|---|
run | run |
runs | run |
running | run |
ran | run |
runner | runner |
The use of lemmatizing and stemming
For a lot of the NLP tasks we’re going to be learning about, lemmatizing and stemming don’t factor in as part of the pre-processing pipeline. However, they’re useful tools to have handy when doing linguistic analyses. For example, for all of the importance of “word frequency” in linguistics literature, there’s often not much clarity about how the text was pre-processed to get these word frequencies.