Module 13: Texts¶
We'll use spaCy and wordcloud to play with text data. spaCy is probably the best python package for analyzing text data. It's capable and super fast. Let's install them.
pip install wordcloud spacy
To use spaCy, you also need to download models. Run:
python -m spacy download en
SpaCy basics¶
import spacy
import wordcloud
nlp = spacy.load('en_core_web_sm')
Usually the first step of text analysis is tokenization, which is the process of breaking a document into "tokens". You can roughly think of it as extracting each word.
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token)
Apple is looking at buying U.K. startup for $ 1 billion
As you can see, it's not exactly same as doc.split(). You'd want to have $ as a separate token because it has a particular meaning (USD). Actually, as shown in an example (https://spacy.io/usage/spacy-101#annotations-pos-deps), spaCy figures out a lot of things about these tokens. For instance,
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_)
Apple apple PROPN NNP is be VERB VBZ looking look VERB VBG at at ADP IN buying buy VERB VBG U.K. u.k. PROPN NNP startup startup NOUN NN for for ADP IN $ $ SYM $ 1 1 NUM CD billion billion NUM CD
It figured it out that Apple is a proper noun ("PROPN" and "NNP"; see here for the part of speech tags).
spaCy has a visualizer too.
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})
It even recognizes entities and can visualize them.
text = """But Google is starting from behind. The company made a late push
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa
software, which runs on its Echo and Dot devices, have clear leads in
consumer adoption."""
doc2 = nlp(text)
displacy.render(doc2, style='ent', jupyter=True)
Let's read a book¶
Shall we load some serious book? You can use any books that you can find as a text file.
import requests
metamorphosis_book = requests.get('http://www.gutenberg.org/cache/epub/5200/pg5200.txt').content
metamorphosis_book[:1000]
b'\xef\xbb\xbfThe Project Gutenberg EBook of Metamorphosis, by Franz Kafka\r\nTranslated by David Wyllie.\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever. You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\n** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **\r\n** Please follow the copyright guidelines in this file. **\r\n\r\n\r\nTitle: Metamorphosis\r\n\r\nAuthor: Franz Kafka\r\n\r\nTranslator: David Wyllie\r\n\r\nRelease Date: August 16, 2005 [EBook #5200]\r\nFirst posted: May 13, 2002\r\nLast updated: May 20, 2012\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK METAMORPHOSIS ***\r\n\r\n\r\n\r\n\r\nCopyright (C) 2002 David Wyllie.\r\n\r\n\r\n\r\n\r\n\r\n Metamorphosis\r\n Franz Kafka\r\n\r\nTranslated by David Wyllie\r\n\r\n\r\n\r\nI\r\n\r\n\r\nOne morning, when Gregor Samsa woke from troubled dreams, he found\r\nhimself transformed in his bed into a horrible vermi'
Looks like we have successfully loaded the book. You'd probably want to remove the parts at the beginning and at the end that are not parts of the book if you are doing a serious analysis, but let's ignore them for now. Let's try to feed this directly into spaCy.
# Try to run this cell. Is it giving exception? Why doesn't it work? Think about it for a moment before moving on!
try:
doc_metamor = nlp(metamorphosis_book)
except TypeError as e:
print("metamorphosis_book is not converted to string:", str(e))
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-8-8d8f96556a77> in <module> ----> 1 doc_metamor = nlp(metamorphosis_book) ~/anaconda3/envs/dviz/lib/python3.7/site-packages/spacy/language.py in __call__(self, text, disable) 338 raise ValueError(Errors.E088.format(length=len(text), 339 max_length=self.max_length)) --> 340 doc = self.make_doc(text) 341 for name, proc in self.pipeline: 342 if name in disable: ~/anaconda3/envs/dviz/lib/python3.7/site-packages/spacy/language.py in make_doc(self, text) 370 371 def make_doc(self, text): --> 372 return self.tokenizer(text) 373 374 def update(self, docs, golds, drop=0., sgd=None, losses=None): TypeError: Argument 'string' has incorrect type (expected str, got bytes)
On encodings¶
What are we getting this error? What does it mean? It says nlp function expects str type but we passed bytes.
type(metamorphosis_book)
bytes
Indeed, the type of metamorphosis_book is bytes. But as we have seen above, we can see the book contents right? What's going on?
Well, the problem is that a byte sequence is not yet a proper string until we know how to decode it. A string is an abstract object and we need to specify an encoding to write the string into a file. For instance, if I have a string of Korean characters like "안녕", there are several encodings that I can specify to write that into a file, and depending on the encoding that I choose, the byte sequences can be totally different from each other. This is a really important (and confusing) topic, but because it's beyond the scope of the course, I'll just link a nice post about encoding: http://kunststube.net/encoding/
"안녕".encode('utf8')
b'\xec\x95\x88\xeb\x85\x95'
# b'\xec\x95\x88\xeb\x85\x95'.decode('euc-kr') <- what happen if you do this?
b'\xec\x95\x88\xeb\x85\x95'.decode('utf8')
'안녕'
"안녕".encode('euc-kr')
b'\xbe\xc8\xb3\xe7'
b'\xbe\xc8\xb3\xe7'.decode('euc-kr')
'안녕'
You can decode with "wrong" encoding too.
b'\xbe\xc8\xb3\xe7'.decode('latin-1')
'¾È³ç'
As you can see the same string can be encoded into different byte sequences depending on the encoding. It's a really annoying fun topic and if you need to deal with text data, you must have a good understanding of it.
I know that Project Gutenberg uses utf-8 encoding. So let's decode the byte sequence into a string.
# TODO: Convert the bytes into string and replace below dummy value with your code.
metamorphosis_book_str = 'DUMMY PLACEHOLDER'
type(metamorphosis_book_str)
str
Shall we try again?
doc_metamor = nlp(metamorphosis_book_str)
words = [token.text for token in doc_metamor
if token.is_stop != True and token.is_punct != True]
Let's count!¶
from collections import Counter
Counter(words).most_common(5)
[('\r\n', 1970), (' ', 670), ('Gregor', 298), ("'s", 199), ('\r\n\r\n', 148)]
a lot of newline characters and multiple spaces. A quick and dirty way to remove them is split & join. The idea is that you split the document using split() and then join with a single space . Can you implement it and print the 10 most common words?
# Implement
[('Gregor', 298),
("'s", 199),
('room', 131),
('sister', 101),
('father', 99),
('I', 90),
('door', 87),
('He', 85),
('mother', 85),
('Project', 84)]
Let's keep the object with word count.
word_cnt = Counter(words)
Some wordclouds?¶
import matplotlib.pyplot as plt
%matplotlib inline
Can you check out the wordcloud package documentation and create a word cloud from the word count object that we created from the book above and plot it?
# Implement: create a word cloud object
<wordcloud.wordcloud.WordCloud at 0x7faf5a245ba8>
# Implement: plot the word cloud object
(-0.5, 999.5, 499.5, -0.5)
Q: Can you create a word cloud for a certain part of speech, such as nouns, verbs, proper nouns, etc. (pick one)?
# Implement
(-0.5, 999.5, 499.5, -0.5)