Lemmatizing and Stemming

data
Author

Josef Fruehwald

Published

September 13, 2022

What tokenizing does not get for you

Coming back to the example sentences from the first data processing lecture, properly tokenizing these sentences will only partly help us with our linguistic analysis.

import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
from tabulate import tabulate

phrase = """The 2019 film CATS is a movie about cats. 
Cats appear in every scene. 
A cat can always be seen"""

# case folding
phrase_lower = phrase.lower()

# tokenization
tokens = word_tokenize(phrase_lower)

# counting
token_count = Counter(tokens)

# cat focus
cat_list = [[k,token_count[k]] for k in token_count if "cat" in k]

print(tabulate(cat_list,
               headers = ["type", "count"]))
type count
cats 3
cat 1

We’ve still got the plural cats being counted as a separate word from cat, which for our weird use case, we don’t want. Our options here are to either “stem” or “lemmatize” our tokens.

Stemming

Stemming is focused on cutting off morphemes and, to some degree, providing a consistent stem across all types that share a stem. So the outcomes aren’t always a recognizable word. The way it does this is all rule-based. For example, the first step of the Porter stemmer contains the following rewrite rules.

i.   sses -> ss
ii.  ies -> i
iii. ss -> ss
iv.  s -> 

When a word comes into the first step, if its end matches any of the left hand sides, it will get re-written as the right hand side. If it could match multiple, the one it has the longest match with wins, so

  • “passes” matches i. , so it gets rewritten as “pass”

  • “pass” matches iii. and iv., but has the largest overlap with iii. so it gets rewritten as “pass”

  • “parties” matches ii., so it gets rewritten as “parti”

  • “pas” (as in “faux pas”) matches iv. so it gets rewritten as “pa”

  • “cats” matches iv. so it gets rewritten as “cats”

This works basically correctly to the various /+z/ morphemes in English, but it over does it (“pas” should be left alone) and it produces some stems that don’t look like the actual root word (“parti” vs “party”).

After this step, it contains a lot more hand crafted rules (e.g. ational - > ate).

from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
p_stemmer = PorterStemmer()
p_stemmed = [p_stemmer.stem(t) for t in tokens]
for t in p_stemmed:
  print(f"`{t}` |", end = " ")

the | 2019 | film | cat | is | a | movi | about | cat | . | cat | appear | in | everi | scene | . | a | cat | can | alway | be | seen |

s_stemmer = SnowballStemmer("english")
s_stemmed = [s_stemmer.stem(t) for t in tokens]
for t in s_stemmed:
  print(f"`{t}` |", end = " ")

the | 2019 | film | cat | is | a | movi | about | cat | . | cat | appear | in | everi | scene | . | a | cat | can | alway | be | seen |

Just to focus on how the stemmers operate over a specific paradigm:

cry = ["cry", "cries", "crying", "cried", "crier"]

print(
  tabulate(
    [[c, s_stemmer.stem(c)] for c in cry],
    headers=["token", "stem"]
  )
)
token stem
cry cri
cries cri
crying cri
cried cri
crier crier

Also, when something like inflectional morphology makes a change to the stem, it won’t get undone by the stemmer.

run = ["run", "runs", "running", "ran", "runner"]

print(
  tabulate(
    [[r, s_stemmer.stem(r)] for r in run],
    headers=["token", "stem"]
  )
)
token stem
run run
runs run
running run
ran ran
runner runner

Lemmatizing

Lemmatizing involves a more complex morphological analysis of words, and as such requires language specific models to work.

nltk lemmatizing

nltk uses WordNet for its English lemmatizing. WordNet is a large database of lexical relations that have been hand annotated starting in the 1980s. Its outputs are always recognizable words.

wnl = nltk.WordNetLemmatizer()
print(
  tabulate(
    [[c, wnl.lemmatize(c)] for c in cry],
    headers=["token", "lemma"]
  )
)
token lemma
cry cry
cries cry
crying cry
cried cried
crier crier
print(
  tabulate(
    [[r, wnl.lemmatize(r)] for r in run],
    headers=["token", "lemma"]
  )
)
token lemma
run run
runs run
running running
ran ran
runner runner

spaCy lemmatizing

spaCy has a number of models that do lemmatizing. They list WordNet along with a few other data sources for the model.

import spacy
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
doc = nlp(" ".join(cry))
print(
  tabulate(
    [[c.text, c.lemma_] for c in doc],
    headers=["token", "lemma"]
  )
)
token lemma
cry cry
cries cry
crying cry
cried cry
crier crier
doc = nlp(" ".join(run))
print(
  tabulate(
    [[r.text, r.lemma_] for r in doc],
    headers=["token", "lemma"]
  )
)
token lemma
run run
runs run
running run
ran run
runner runner

The use of lemmatizing and stemming

For a lot of the NLP tasks we’re going to be learning about, lemmatizing and stemming don’t factor in as part of the pre-processing pipeline. However, they’re useful tools to have handy when doing linguistic analyses. For example, for all of the importance of “word frequency” in linguistics literature, there’s often not much clarity about how the text was pre-processed to get these word frequencies.