Data Acquisition 3 | Unix and Python Text Processing Tools, and Spacy
Data Acquisition 3 | Unix and Python Text Processing Tools, and Spacy

- Unix Text Processing Tools
Suppose we are given the following poem. Please store it on your computer.
(0) Recall: Commands we are not going to cover
There are also many other useful Unix commands that we are not going to introduce in a way that is too specific because we have already met some of them in the command line part.
Here is a list of useful Unix command for text processing that we are not going to cover here, but we will assume that you can fully understand all of them.
- echo
- cat
- grep
- sort
- uniq –c
- tr
- sed
- cut
- paste
- rev
- comm
- join
- shuf
Please google them if you are not sure of their usage. But if there’s any difficulty for most of you to understand, please comment on this article so that we can add more references.
(1) head and tail Command
The head command means to show the first n lines of a file,
$ head -3 poem.txt
The tail command means to show the last n lines of a file,
$ tail -3 poem.txt
(2) wc command
The wc command returns the newline count, word count, and byte count of a file by sequence.
$ wc poem.txt
(3) od command
Select octal bytes of the printable characters or backslash escapes,
$ od -cb poem.txt
Select hexadecimal 2-byte units of the printable characters or backslash escapes,
$ od -cx poem.txt
Select hexadecimal 4-byte units of the printable characters or backslash escapes,
$ od -ctx poem.txt
Most Useful: Select hexadecimal 1-byte units of the printable characters or backslash escapes,
$ od -c -t xC poem.txt
Note that, for non-ASCII characters, instead of 1-byte per character, it takes 2 bytes for those characters to store (** means this is also for storing the character ahead). For example,
á **
c3 a1
(4) iconv Command
Suppose we only want to have the ASCII characters in a file and we want to get rid of all the non-ASCII characters, what we can do is to use the iconv command. For example,
iconv -c -f utf-8 -t ascii poem.txt
(5) curl vs. wget Command
Both the curl command and the wget command allow us to download information from an Internet server. In most cases, we can use either curl or wget,
$ curl <url> > <path>
or
$ wget <url> > <path>
But there is indeed some difference between them,
- wget has a stronger ability to download recursively
- curl supports more internet protocols
2. Python Text Processing Tools
Suppose we are given a list of words as variable w,
(1) Case Insensitivity
This can be done by the list comprehension,
w = [w.lower() for w in words]
(2) Porter Stemmer
Suppose we have duplicated words in the list and they are in different forms. For example, watering and waters, city and citys. Then we have to use the porter stemmer. We have to install this package first,
$ pip install -q -U nltk
then we import the porter stemmer and create an instance,
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
To make it a stemmed list, we have to add,
w = [stemmer.stem(w) for w in words]
(3) Word Counter
We can use a default dictionary to create a counter of the word frequency in the list. However, an easier way is to use a Counter. To use a counter, first of all, we have to import this,
from collections import Counter
then we can use Counter on our list,
ctr = Counter(w)
this will give us a dictionary of all the words and its frequency in the list. We can also grab the most common item by the .most_common method,
ctr.most_common(5)
This will give us the five most common items in the list.
(4) Filter out the English stop words
It is likely that our frequency dictionary is dominated by the stop words in English (i.e. the, of, etc.) and they are always useless for our following analysis. The way to filter out all those stop words is based on the library of the scikit-learn. To use this ENGLISH_STOP_WORDS library, first of all, we have to import this in Python,
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
We can check out this library by,
print(list(ENGLISH_STOP_WORDS))
To make things easier, we use the filter function to exact the words that are not in the ENGLISH_STOP_WORDS library,
STOP_WORDS = [stemmer.stem(item) for item in ENGLISH_STOP_WORDS]
w = list(filter(lambda i: i not in list(STOP_WORDS), w))
Then the Counter will not count all these stop words.
ctr = Counter(w)
(5) Word Cloud
Basically, this is not always for any actually useful meaning but it’s for fun and visualization in common sense. First of all, we have to install the word cloud package,
$ pip install wordcloud
To create a word cloud of our list, we have to import,
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
Similarly, we have to make our list a frequency dictionary in order to create the word cloud,
ctr = Counter(w)
then, we have to generate a wordcloud object with the .fit_words method, this will fit our dictionary to an image object,
wordcloud = WordCloud()
wordcloud.fit_words(ctr)
Finally, we are going to plot this word cloud by matplotlib with the .imshow() method,
fig=plt.figure(figsize=(6, 6))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
(6) Ignore Non-ASCII Characters
Suppose we want to ignore all the non-ASCII characters in a text by Python, what we can do is,
text = [c for c in text if ord(c)<=127]
text = ''.join(text)
3. Spacy
The full name of Spacy is called the Industrial-Strength Natural Language Processing (NLP), and it is a python package. It is a powerful tool for text processing.
(1) Installation
To install spacy, we have to use pip,
$ pip install spacy
then also, we have to download the English NLP model information,
$ python -m spacy download en
There are other non-English language models but we are not going to use them now. For the following part, we are going to grab a webpage from Tesla,
$ curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html
(2) Extract Text from Html
Recall what we have learned for beautifulsoup,
It can be a good idea if we extract the text through it,
import sys
from bs4 import BeautifulSoup
def html2text(html_text):
soup = BeautifulSoup(html_text, 'html.parser')
text = soup.get_text()
return text
with open("/tmp/TeslaIPO.html", "r") as f:
html_text = f.read()
tsla = html2text(html_text)
(3) Tokenizing with Spacy
To tokenize the text with Spacy, which is going to help us with more information about the text,
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(tsla)
then each of the items in the doc variable is a token.
(4) Extract Parts of Speech
import pandas as pd
winfo = []
for token in doc:
winfo.append([token.text, token.pos_, token.is_stop])
pd.DataFrame(data=winfo, columns=['word','part of speech', 'stop word'])
this will output a table with words in the file and their parts of speech, and whether or not this word is a stop word.
(5) Extract Information of the Entities
The entities are non-word instances in the text. This can include the year, date, number, organization name, people name, and etc. The following code will walk through all the entities in the text and also gives us information on what is the type for each of them (aka. label),
winfo = []
for ent in doc.ents:
winfo.append([ent.text, ent.label_])
pd.DataFrame(data=winfo, columns=['word', 'label'])
In order to see a full list of how we can interpret these labels, here is the documentation for reference. Note that there can be some mistakes in the labels.
Another quick way to show the information for the entities is to use the displacy model in the spacy,
from spacy import displacy
displacy.render(doc, style='ent')
This program will directly give us the entities and their entity labels.
(6) Calculate Word Vectors
The word vectors can be used for interpreting how close the meaning of any two words. This can be measured by the distance of the two vectors. Spacy can directly give us the result of the word vector for each word,
winfo = []
for t in doc:
winfo.append([t.text, t.vector])
pd.DataFrame(data=winfo, columns=['word', 'vector'])
(7) Split the Text Into Sentences
Scapy can also be used to split the text by sentences in the text.
winfo = []
for s in doc.sents:
winfo.append([s.text])
pd.DataFrame(data=winfo, columns=['sentence'])