Lemmatizing and Stemming

data

Author

Josef Fruehwald

Published

September 13, 2022

What tokenizing does not get for you

Coming back to the example sentences from the first data processing lecture, properly tokenizing these sentences will only partly help us with our linguistic analysis.

import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
from tabulate import tabulate

phrase = """The 2019 film CATS is a movie about cats. 
Cats appear in every scene. 
A cat can always be seen"""

# case folding
phrase_lower = phrase.lower()

# tokenization
tokens = word_tokenize(phrase_lower)

# counting
token_count = Counter(tokens)

# cat focus
cat_list = [[k,token_count[k]] for k in token_count if "cat" in k]

print(tabulate(cat_list,
               headers = ["type", "count"]))

type	count
cats	3
cat	1

We’ve still got the plural cats being counted as a separate word from cat, which for our weird use case, we don’t want. Our options here are to either “stem” or “lemmatize” our tokens.

Stemming

Stemming is focused on cutting off morphemes and, to some degree, providing a consistent stem across all types that share a stem. So the outcomes aren’t always a recognizable word. The way it does this is all rule-based. For example, the first step of the Porter stemmer contains the following rewrite rules.

i.   sses -> ss
ii.  ies -> i
iii. ss -> ss
iv.  s ->

When a word comes into the first step, if its end matches any of the left hand sides, it will get re-written as the right hand side. If it could match multiple, the one it has the longest match with wins, so

“passes” matches i. , so it gets rewritten as “pass”
“pass” matches iii. and iv., but has the largest overlap with iii. so it gets rewritten as “pass”
“parties” matches ii., so it gets rewritten as “parti”
“pas” (as in “faux pas”) matches iv. so it gets rewritten as “pa”
“cats” matches iv. so it gets rewritten as “cats”

This works basically correctly to the various /+z/ morphemes in English, but it over does it (“pas” should be left alone) and it produces some stems that don’t look like the actual root word (“parti” vs “party”).

After this step, it contains a lot more hand crafted rules (e.g. ational - > ate).

from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer

p_stemmer = PorterStemmer()
p_stemmed = [p_stemmer.stem(t) for t in tokens]
for t in p_stemmed:
  print(f"`{t}` |", end = " ")

the | 2019 | film | cat | is | a | movi | about | cat | . | cat | appear | in | everi | scene | . | a | cat | can | alway | be | seen |

s_stemmer = SnowballStemmer("english")
s_stemmed = [s_stemmer.stem(t) for t in tokens]
for t in s_stemmed:
  print(f"`{t}` |", end = " ")

the | 2019 | film | cat | is | a | movi | about | cat | . | cat | appear | in | everi | scene | . | a | cat | can | alway | be | seen |

Just to focus on how the stemmers operate over a specific paradigm:

cry = ["cry", "cries", "crying", "cried", "crier"]

print(
  tabulate(
    [[c, s_stemmer.stem(c)] for c in cry],
    headers=["token", "stem"]
  )
)

token	stem
cry	cri
cries	cri
crying	cri
cried	cri
crier	crier

Also, when something like inflectional morphology makes a change to the stem, it won’t get undone by the stemmer.

run = ["run", "runs", "running", "ran", "runner"]

print(
  tabulate(
    [[r, s_stemmer.stem(r)] for r in run],
    headers=["token", "stem"]
  )
)

token	stem
run	run
runs	run
running	run
ran	ran
runner	runner

Lemmatizing

Lemmatizing involves a more complex morphological analysis of words, and as such requires language specific models to work.

nltk lemmatizing

nltk uses WordNet for its English lemmatizing. WordNet is a large database of lexical relations that have been hand annotated starting in the 1980s. Its outputs are always recognizable words.

wnl = nltk.WordNetLemmatizer()

print(
  tabulate(
    [[c, wnl.lemmatize(c)] for c in cry],
    headers=["token", "lemma"]
  )
)

token	lemma
cry	cry
cries	cry
crying	cry
cried	cried
crier	crier

print(
  tabulate(
    [[r, wnl.lemmatize(r)] for r in run],
    headers=["token", "lemma"]
  )
)

token	lemma
run	run
runs	run
running	running
ran	ran
runner	runner

spaCy lemmatizing

spaCy has a number of models that do lemmatizing. They list WordNet along with a few other data sources for the model.

import spacy
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")

doc = nlp(" ".join(cry))
print(
  tabulate(
    [[c.text, c.lemma_] for c in doc],
    headers=["token", "lemma"]
  )
)

token	lemma
cry	cry
cries	cry
crying	cry
cried	cry
crier	crier

doc = nlp(" ".join(run))
print(
  tabulate(
    [[r.text, r.lemma_] for r in doc],
    headers=["token", "lemma"]
  )
)

token	lemma
run	run
runs	run
running	run
ran	run
runner	runner

The use of lemmatizing and stemming

For a lot of the NLP tasks we’re going to be learning about, lemmatizing and stemming don’t factor in as part of the pre-processing pipeline. However, they’re useful tools to have handy when doing linguistic analyses. For example, for all of the importance of “word frequency” in linguistics literature, there’s often not much clarity about how the text was pre-processed to get these word frequencies.