Comprehensions and Useful Things

python
Author

Josef Fruehwald

Published

September 23, 2022

Instructions

Setup

We’re going to be exploring the way the spaCy package does tokenization.

If you get an error at the very beginning when hitting Run, run this code to download the spaCy model in the shell.

# bash
python -m spacy download en_core_web_sm

Currently, the code in main.py

  1. loads the spaCy English model
  2. reads in Frankenstein
  3. Strips leading whitespace from the beginning and end of each line
  4. Concatenates all of the lines into one megastring
  5. Uses the spaCy analyzer to (among other things) tokenize the book.
import spacy
from collections import Counter
from collections import defaultdict

# Load the spaCy english model
nlp = spacy.load("en_core_web_sm")

# open and read in Frankenstein
with open("gen/texts/frank.txt", 'r') as f:
  lines = f.readlines()

# Remove leading and trailing whitespaces
lines = [line.strip() for line in lines]

# concatenate frankenstein into one huge string
frank_one_string = " ".join(lines)

# Tokenize all of frankenstein
frank_doc = nlp(frank_one_string)

print(frank_doc[500:600])
river. But supposing all these conjectures to be false, you cannot contest the inestimable benefit which I shall confer on all mankind, to the last generation, by discovering a passage near the pole to those countries, to reach which at present so many months are requisite; or by ascertaining the secret of the magnet, which, if at all possible, can only be effected by an undertaking such as mine.  These reflections have dispelled the agitation with which I began my letter, and I feel my heart glow

Spacy token structure

We can treat frank_doc like a list, but it’s actually a special data structure. The same goes for each token inside frank_doc. If you just say

print(frank_doc[506])
conjectures

It will print conjectures. But if you say

print(
  dir(frank_doc[506])
)
['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_dep', 'has_extension', 'has_head', 'has_morph', 'has_vector', 'head', 'i', 'idx', 'iob_strings', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'morph', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'remove_extension', 'right_edge', 'rights', 'sent', 'sent_start', 'sentiment', 'set_extension', 'set_morph', 'shape', 'shape_', 'similarity', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'tensor', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_']

You’ll see a lot more values and methods associated with the token than you normally would for a string. For example, frank_doc[506].text will give us the text of the token, and frank_doc[506].lemma_ will give us the lemma.

print(
  f"The word '{frank_doc[506].text}' is lemmatized as '{frank_doc[506].lemma_}'"
)
The word 'conjectures' is lemmatized as 'conjecture'

Or we can get the guessed part of speech with frank_doc[506].pos_

print(
  f"The word '{frank_doc[506].text}' is given the part of speech '{frank_doc[506].pos_}'"
)
The word 'conjectures' is given the part of speech 'VERB'

Or we can pull out the guessed morphological information:

print(
  f"spacy guesses '{frank_doc[506].text}' is '{frank_doc[506].morph}'"
)
spacy guesses 'conjectures' is 'Number=Sing|Person=3|Tense=Pres|VerbForm=Fin'

if-statements to control code (like loops)

We can use if statements to control how our code runs. An if statement checks to see if its logical comparison is true, and if it is, it executes its code.

## This is not true, so it dosn't print
if frank_doc[506].pos_ == "NOUN":
  print("it's a verb!")

## This is true, so it prints
if frank_doc[506].pos_ == "VERB":
  print("it's a verb!")
it's a verb!
Note
💡 TASK 1

Print the .text of every word whose .lemma_ is "monster"

Note
💡 TASK 2

With a for loop, create a list called five_letter which contains every 5 letter word from the book (a.k.a. .text is 5 characters long.)

Comprehensions

“Comprehensions” are a great shortcut around writing out a whole for loop. Let’s take the following list:

rain_list = "The rain in Spain stays mainly on the plain".split(" ")
print(rain_list)
['The', 'rain', 'in', 'Spain', 'stays', 'mainly', 'on', 'the', 'plain']

If I wanted to capitalize all of those words, one way I could do it is with a for loop

upper_rain = []
for word in rain_list:
  upper_rain.append(word.upper())

print(upper_rain)
['THE', 'RAIN', 'IN', 'SPAIN', 'STAYS', 'MAINLY', 'ON', 'THE', 'PLAIN']

Alternatively, I could do it with a “list comprehension”:

upper_rain2 = [word.upper() for word in rain_list]

print(upper_rain2)
['THE', 'RAIN', 'IN', 'SPAIN', 'STAYS', 'MAINLY', 'ON', 'THE', 'PLAIN']

List comprehensions keep the for word in rain_list part the same, but instead of needing to initialize a whole empty list, we wrap the whole thing inside [ ], which tells python we’re going to capture the results inside a list. The variable (& whatever we do to it) at the beginning of the command is what gets captured.

We can use if statements too.

ai_words = [word for word in rain_list if "ai" in word]

print(ai_words)
['rain', 'Spain', 'mainly', 'plain']

We can even have nested for statements

rain_list = "The rain in Spain stays mainly on the plain".split(" ")
letters = [letter
            for word in rain_list
              for letter in word]
print(letters)
['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n', 's', 't', 'a', 'y', 's', 'm', 'a', 'i', 'n', 'l', 'y', 'o', 'n', 't', 'h', 'e', 'p', 'l', 'a', 'i', 'n']
Note
💡 TASK 3

With a list comprehension, create a list called five_letter2 which contains every 5 letter word from the book (a.k.a. .text is 5 characters long.)

Note
💡 TASK 4

By whatever means necessary (but I recommend using a list comprehension), create a list containing all of the words with a VERB as .pos

set()

A set is another special python data structure that, among other things, will “uniquify” a list.

bman_list = "na na na na na na na na na na na na na na na na Batman".split(" ")
bman_set = set(bman_list)
print(bman_set)
{'na', 'Batman'}
Note
💡 TASK 5

Find out how many total words there are in Frankenstein, excluding tokens with .pos of PUNCT and SPACE

Note
💡 TASK 6

Find out how many total unique words (.text) there are in Frankenstein, excluding tokens with .pos of PUNCT and SPACE

Note
💡 TASK 7

Find out how many total unique lemmas (.lemma_) there are in Frankenstein, excluding tokens with .pos of PUNCT and SPACE

Counter()

There is a handy dandy function called Counter that we can import from the collections module like so

from collections import Counter

When we pass Counter() a list, it will return a dictionary of counts of items in that list.

bman_list = "na na na na na na na na na na na na na na na na Batman".split(" ")
bman_count = Counter(bman_list)
print(bman_count)
Counter({'na': 16, 'Batman': 1})
Note
💡 TASK 8

Create a counter dictionary of all of the forms of “be” (.lemma == "be") in Frankenstein