python
import nltk
from collections import Counter
import gutenbergpy.textget
from tabulate import tabulate
import numpy as np
The notes on Perplexity, describe how we can get a measure of how well a given n-gram model predicts strings in a test set of data. Roughly speaking:
The better the model gets, the higher a probability it will assign to each \(P(w_i|w_{i-1})\) .
The higher the probabilities, the lower the perplexities.
The lower the perplexities, the better the model
As a quick demonstration, I’ve written some code here in collapsible sections to build a bigram model of Frankenstein, and to get the conditional probabilities for every bigram in an input sentence.
python
import nltk
from collections import Counter
import gutenbergpy.textget
from tabulate import tabulate
import numpy as np
python
getbook()
functiondef getbook(book, outfile):
"""
Download a book from project Gutenberg and save it
to the specified outfile
"""
print(f"Downloading Project Gutenberg ID {book}")
= gutenbergpy.textget.get_text_by_id(book)
raw_book = gutenbergpy.textget.strip_headers(raw_book)
clean_book if not outfile:
= f'{book}.txt'
outfile print(f"Saving book as {outfile}")
with open(outfile, 'wb') as file:
file.write(clean_book)
file.close()
python
= 84, outfile = "gen/frankenstein.txt") getbook(book
Downloading Project Gutenberg ID 84
python
def ngramize(filename, n = 2):
"""
given a file name, generate the ngrams and n-1 grams
"""
with open(filename, 'r') as f:
= f.read()
lines
= nltk.sent_tokenize(lines)
sentences = [sent.strip().replace("\n", " ")
sentences for sent in sentences]
= [nltk.word_tokenize(sent)
sentences_tok for sent in sentences]
= [list(nltk.lm.preprocessing.pad_both_ends(sent, n = n))
sentences_padn for sent in sentences_tok]
= [list(nltk.ngrams(sent, n = n))
sentences_ngram for sent in sentences_padn]
= [list(nltk.ngrams(sent, n = n-1))
sentences_ngram_minus for sent in sentences_padn]
= sum(sentences_ngram, [])
flat_ngram = sum(sentences_ngram_minus, [])
flat_ngram_minus
return(flat_ngram, flat_ngram_minus)
python
= ngramize("gen/frankenstein.txt", n = 2) bigram, unigram
python
= Counter(bigram)
bigram_count = Counter(unigram) unigram_count
python
def get_conditional_prob(x, bigram_count, unigram_count):
"""
for a tuple x, get the conditional probability of x[1] | x[0]
"""
if x in bigram_count:
= bigram_count[x] / unigram_count[x[0:-1]]
cond else:
= 0
cond
return(cond)
python
def get_sentence_probs(sentence, bigram_count, unigram_count, n = 2):
"""
given a sentence, get its list of conditional probabilities
"""
= nltk.word_tokenize(sentence)
sent_tokens = nltk.lm.preprocessing.pad_both_ends(sent_tokens, n = n)
sent_pad = nltk.ngrams(sent_pad, n = n)
sent_ngram = [get_conditional_prob(gram, bigram_count, unigram_count)
sent_conditionals for gram in sent_ngram]
return(sent_conditionals)
python
def get_conditional_strings(sentence, n = 2):
"""
given a sentence, return the string of conditionals
"""
= nltk.word_tokenize(sentence)
sent_tokens = nltk.lm.preprocessing.pad_both_ends(sent_tokens, n = n)
sent_pad = [x.replace("<", "<").replace(">", ">") for x in sent_pad]
sent_pad = nltk.ngrams(sent_pad, n = n)
sent_ngram = [f"P({x[-1]} | {' '.join(x[0:-1])})" for x in sent_ngram]
out_cond return(out_cond)
Having built the bigram model with the code above, we can take this sample sentence:
I saw the old man.
We can calculate the conditional probability of every word in the sentence given the word before, as well as the surprisal for each word.1
python
= "I saw the old man."
sentence = get_sentence_probs(sentence, bigram_count, unigram_count, n = 2)
cond_probs = [-np.log2(x) for x in cond_probs]
cond_surp = get_conditional_strings(sentence, n = 2) cond_strings
conditional | probability | surprisal |
---|---|---|
P(I | <s>) | 0.1876 | 2.4139 |
P(saw | I) | 0.0162 | 5.9476 |
P(the | saw) | 0.2340 | 2.0952 |
P(old | the) | 0.0064 | 7.2865 |
P(man | old) | 0.6800 | 0.5564 |
P(. | man) | 0.1364 | 2.8745 |
P(</s> | .) | 0.9993 | 0.0011 |
Summing up the surprisal column, we get the total surprisal of the sentence (about 21 bits). We can then get the number of bits per word (about 3) which gives us our ngram perplexity for the sentence (about 8).
total surprisal | surprisal/word | perplexity |
---|---|---|
21.1752 | 3.0250 | 8.1400 |
But, not everything is so neat and tidy. Let’s try this again for the sentence
I saw the same man.
python
= "I saw the same man."
sentence = get_sentence_probs(sentence, bigram_count, unigram_count, n = 2)
cond_probs = [-np.log2(x) for x in cond_probs]
cond_surp = get_conditional_strings(sentence, n = 2) cond_strings
conditional | probability | surprisal |
---|---|---|
P(I | <s>) | 0.1876 | 2.4139 |
P(saw | I) | 0.0162 | 5.9476 |
P(the | saw) | 0.2340 | 2.0952 |
P(same | the) | 0.0154 | 6.0235 |
P(man | same) | 0.0000 | |
P(. | man) | 0.1364 | 2.8745 |
P(</s> | .) | 0.9993 | 0.0011 |
total surprisal | surprisal/word | perplexity |
---|---|---|
! |
It looks like the bigram ("same", "man")
just didn’t appear in the novel. This is zero percolates up through all of our calculations.
\[ C(\text{same man}) = 0 \]
\[ P(\text{same man}) = \frac{C(\text{same man)}}{N} = \frac{0}{N} = 0 \]
\[ P(\text{man}~|~\text{same}) = \frac{P(\text{same man)}}{P(\text{same)}} = \frac{0}{P(\text{same})} = 0 \]
\[ s(\text{man}~|~\text{same}) = -\log_2(P(\text{man}~|~\text{same})) = -\log_2(0) = \infty \]
\[ pp(\text{I saw the same man.)} = \frac{\sum_{i=1}^Ns(w_i|w_{i-1})}{N} = \frac{\dots+\infty+\dots}{N} = \infty \]
In other words, our bigram model’s “mind” is completely blown by a sentence with the sequence same man
in it.
This is, of course data sparsity rearing its head again. On the one hand, we are building an n-gram model out of a fairly small corpus. But on the other, the data sparsity problem will never go away, and we are always going to be left with the following two issues:
Out Of Vocabulary items
Missing ngrams of words that were in the vocabulary.
“Out Of Vocabulary”, commonly referred to OOV, problems, are going to come up if you ever do any computational work with language of any variety.
Our example of perplexity blowing up was due to a specific bigram, ('same', 'man')
not appearing in the corpus, even though each individual word does appear. The same thing will happen if any individual word in a sentence is oov.
python
# literally blowing the mind of a victorian child eating a cool ranch dorito
= "I ate a cool ranch Dorito."
sentence = get_sentence_probs(sentence, bigram_count, unigram_count, n = 2)
cond_probs = [-np.log2(x) for x in cond_probs]
cond_surp = get_conditional_strings(sentence, n = 2) cond_strings
conditional | probability | surprisal |
---|---|---|
P(I | <s>) | 0.1876 | 2.4139 |
P(ate | I) | 0.0007 | 10.4712 |
P(a | ate) | 0.2500 | 2.0000 |
P(cool | a) | 0.0000 | |
P(ranch | cool) | 0.0000 | |
P(Dorito | ranch) | 0.0000 | |
P(. | Dorito) | 0.0000 | |
P(</s> | .) | 0.9993 | 0.0011 |
One approach SLP suggests is to convert every vocabulary item that occurs below a certain frequency to <UNK>
, then re-estimate all of the ngram values. Here, I’m
# Getting a list of unigrams that occurred once
= [x for x in unigram_count if unigram_count[x] == 1]
to_unk
# <UNK> conversion
= [("<UNK>",) if x in to_unk else x for x in unigram]
unigram_unk = [("<UNK>", "<UNK>") if ((x[0],) in to_unk and (x[1],) in to_unk) else
bigram_unk "<UNK>", x[1]) if (x[0],) in to_unk else
(0], "<UNK>") if (x[1],) in to_unk else
(x[for x in bigram ]
x
# <UNK> count
= Counter(unigram_unk)
unigram_unk_count = Counter(bigram_unk) bigram_unk_count
python
def get_sentence_unk_probs(sentence, bigram_count, unigram_count, n = 2):
"""
given a sentence, get its list of conditional probabilities
"""
= nltk.word_tokenize(sentence)
sent_tokens = [x if (x,) in unigram_count else "<UNK>" for x in sent_tokens]
sent_tokens_unk = nltk.lm.preprocessing.pad_both_ends(sent_tokens_unk, n = n)
sent_pad = nltk.ngrams(sent_pad, n = n)
sent_ngram = [get_conditional_prob(gram, bigram_count, unigram_count)
sent_conditionals for gram in sent_ngram]
return(sent_conditionals)
= "I ate a Dorito."
sentence = get_sentence_unk_probs(sentence, bigram_unk_count, unigram_unk_count, n = 2)
cond_probs = [-np.log2(x) for x in cond_probs]
cond_surp = get_conditional_unk_strings(sentence, unigram_count, n = 2) cond_strings
conditional | probability | surprisal |
---|---|---|
P(I | <s>) | 0.1876 | 2.4139 |
P(ate | I) | 0.0007 | 10.4712 |
P(a | ate) | 0.2500 | 2.0000 |
P(<UNK> | a) | 0.1173 | 3.0912 |
P(. | <UNK>) | 0.0600 | 4.0588 |
P(</s> | .) | 0.9993 | 0.0011 |
Converting low frequency words to <UNK>
means that now when the ngram model meets a word it doesn’t know, like Dorito
, there is still some probability it can assign.
This <UNK>
ification of the data doesn’t solve everything, though. Here’s the longer sentence:
= "I ate a cool ranch Dorito."
sentence = get_sentence_unk_probs(sentence, bigram_unk_count, unigram_unk_count, n = 2)
cond_probs = [-np.log2(x) for x in cond_probs]
cond_surp = get_conditional_unk_strings(sentence, unigram_unk_count, n = 2) cond_strings
conditional | probability | surprisal |
---|---|---|
P(I | <s>) | 0.1876 | 2.4139 |
P(ate | I) | 0.0007 | 10.4712 |
P(a | ate) | 0.2500 | 2.0000 |
P(cool | a) | 0.0000 | |
P(<UNK> | cool) | 0.0000 | |
P(<UNK> | <UNK>) | 0.0391 | 4.6782 |
P(. | <UNK>) | 0.0600 | 4.0588 |
P(</s> | .) | 0.9993 | 0.0011 |
The problem here is that there is a known word, cool
, that just happens never to occur in the bigrams (a, cool)
or (cool, <UNK>)
. Maybe what we want is some way of assigning a small probability, of bigrams that could have happened, but didn’t.
The first, simple idea, is to make a grid of all possible bigrams, and add 1 to all of their counts.
python
def get_conditional_prob_add1(x, bigram_count, unigram_count):
"""
for a tuple x, get the conditional probability of x[1] | x[0]
"""
if x in bigram_count:
= (bigram_count[x]+1) / (unigram_count[x[0:-1]] + len(unigram_count))
cond else:
= 1/ (unigram_count[x[0:-1]] + len(unigram_count))
cond
return(cond)
python
def get_sentence_unk_probs_add1(sentence, bigram_count, unigram_count, n = 2):
"""
given a sentence, get its list of conditional probabilities
"""
= nltk.word_tokenize(sentence)
sent_tokens = [x if (x,) in unigram_count else "<UNK>" for x in sent_tokens]
sent_tokens_unk = nltk.lm.preprocessing.pad_both_ends(sent_tokens_unk, n = n)
sent_pad = nltk.ngrams(sent_pad, n = n)
sent_ngram = [get_conditional_prob_add1(gram, bigram_count, unigram_count)
sent_conditionals for gram in sent_ngram]
return(sent_conditionals)
= "I ate a cool ranch Dorito."
sentence = get_sentence_unk_probs_add1(sentence, bigram_unk_count, unigram_unk_count, n = 2)
cond_probs = [-np.log2(x) for x in cond_probs]
cond_surp = get_conditional_unk_strings(sentence, unigram_unk_count, n = 2) cond_strings
conditional | probability | surprisal |
---|---|---|
P(I | <s>) | 0.0797 | 3.6498 |
P(ate | I) | 0.0004 | 11.1921 |
P(a | ate) | 0.0005 | 11.0307 |
P(cool | a) | 0.0002 | 12.4299 |
P(<UNK> | cool) | 0.0002 | 12.0300 |
P(<UNK> | <UNK>) | 0.0180 | 5.7941 |
P(. | <UNK>) | 0.0276 | 5.1784 |
P(</s> | .) | 0.3912 | 1.3539 |
2 things to notice here:
The probabilities jumped around because by adding 1 to every bigram count, we’ve given many bigrams a larger portion of the probability pie than they had before, and in a probability space, everything has to sum to 1. So that means we’ve also taken away a portion of the probability space from many bigrams.
conditional | bigram count | w1 count | add 1 prob | implied counts |
---|---|---|---|---|
P(I | <s>) | 577 | 3,075 | 0.0797 | 244.9828 |
P(ate | I) | 2 | 2,839 | 0.0004 | 1.2134 |
P(a | ate) | 1 | 4 | 0.0005 | 0.0019 |
P(cool | a) | 0 | 1,338 | 0.0002 | 0.2425 |
P(<UNK> | cool) | 0 | 2 | 0.0002 | 0.0005 |
P(<UNK> | <UNK>) | 138 | 3,533 | 0.0180 | 63.6700 |
P(. | <UNK>) | 212 | 3,533 | 0.0276 | 97.5663 |
P(</s> | .) | 2,686 | 2,688 | 0.3912 | 1,051.6389 |
The add 1 method effectively shaved off a little bit of probability from bigrams we did see to give it to bigrams we didn’t see. For example, we had 2 observations of (I, ate)
, but after redistributing probabilities, we’d effectively shaved off 0.79 observations. Things are even more extreme for other bigrams. Like (<s>, I)
which got 323 observations shaved off, to redistribute to unseen bigrams.
The idea behind Absolute Discounting is instead of shaving variable amounts of probability off of every ngram, we instead shave off a fixed amount. The Greek letter \(\delta\) is used to indicate this “shave off” amount.
Our total number of observed bigrams, after <UNK>
ifying, 36,744. If we shaved off 0.25 observations off of each bigram, that would give us \(36,744\times0.75=27,558\) observations to spread around to the bigrams we didn’t observe. If we just did that uniformly, the unobserved bigrams would just get a sliver of that probability mass. There are 4,179 unigrams in our data, meaning we would expect there to be \(4179^2=17,464,041\) possible bigrams, that means there are \(17,464,041-36,744 = 17,427,297\) bigrams trying to get a piece of those 8,936 observations we just shaved off, coming out to just 0.0016 observations each.
Some more clever approaches try not to distribute the probability surplus evenly, though. For example Kneser-Ney smoothing tries to distribute it proportionally to how often the \(w_i\) word in a \((w_{i-1}w_i)\) bigram appears as the second word in a bigram.
\(-\log_2(p)\)↩︎