Data Acquisition 3 | Unix and Python Text Processing Tools, and Spacy

Series: Data Acquisition

Data Acquisition 3 | Unix and Python Text Processing Tools, and Spacy

Unix Text Processing Tools

Suppose we are given the following poem. Please store it on your computer.

(0) Recall: Commands we are not going to cover

There are also many other useful Unix commands that we are not going to introduce in a way that is too specific because we have already met some of them in the command line part.

Play With the Command Lines
Series: Linux Commandsmedium.com

Here is a list of useful Unix command for text processing that we are not going to cover here, but we will assume that you can fully understand all of them.

echo
cat
grep
sort
uniq –c
tr
sed
cut
paste
rev
comm
join
shuf

Please google them if you are not sure of their usage. But if there’s any difficulty for most of you to understand, please comment on this article so that we can add more references.

(1) head and tail Command

The head command means to show the first n lines of a file,

$ head -3 poem.txt

The tail command means to show the last n lines of a file,

$ tail -3 poem.txt

(2) wc command

The wc command returns the newline count, word count, and byte count of a file by sequence.

$ wc poem.txt

(3) od command

Select octal bytes of the printable characters or backslash escapes,

$ od -cb poem.txt

Select hexadecimal 2-byte units of the printable characters or backslash escapes,

$ od -cx poem.txt

Select hexadecimal 4-byte units of the printable characters or backslash escapes,

$ od -ctx poem.txt

Most Useful: Select hexadecimal 1-byte units of the printable characters or backslash escapes,

$ od -c -t xC poem.txt

Note that, for non-ASCII characters, instead of 1-byte per character, it takes 2 bytes for those characters to store (** means this is also for storing the character ahead). For example,

 á  **
c3  a1

(4) iconv Command

Suppose we only want to have the ASCII characters in a file and we want to get rid of all the non-ASCII characters, what we can do is to use the iconv command. For example,

iconv -c -f utf-8 -t ascii poem.txt

(5) curl vs. wget Command

Both the curl command and the wget command allow us to download information from an Internet server. In most cases, we can use either curl or wget,

$ curl <url> > <path>

$ wget <url> > <path>

But there is indeed some difference between them,

wget has a stronger ability to download recursively
curl supports more internet protocols

2. Python Text Processing Tools

Suppose we are given a list of words as variable w,

(1) Case Insensitivity

This can be done by the list comprehension,

w = [w.lower() for w in words]

(2) Porter Stemmer

Suppose we have duplicated words in the list and they are in different forms. For example, watering and waters, city and citys. Then we have to use the porter stemmer. We have to install this package first,

$ pip install -q -U nltk

then we import the porter stemmer and create an instance,

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

To make it a stemmed list, we have to add,

w = [stemmer.stem(w) for w in words]

(3) Word Counter

We can use a default dictionary to create a counter of the word frequency in the list. However, an easier way is to use a Counter. To use a counter, first of all, we have to import this,

from collections import Counter

then we can use Counter on our list,

ctr = Counter(w)

this will give us a dictionary of all the words and its frequency in the list. We can also grab the most common item by the .most_common method,

ctr.most_common(5)

This will give us the five most common items in the list.

(4) Filter out the English stop words

It is likely that our frequency dictionary is dominated by the stop words in English (i.e. the, of, etc.) and they are always useless for our following analysis. The way to filter out all those stop words is based on the library of the scikit-learn. To use this ENGLISH_STOP_WORDS library, first of all, we have to import this in Python,

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

We can check out this library by,

print(list(ENGLISH_STOP_WORDS))

To make things easier, we use the filter function to exact the words that are not in the ENGLISH_STOP_WORDS library,

STOP_WORDS = [stemmer.stem(item) for item in ENGLISH_STOP_WORDS]
w = list(filter(lambda i: i not in list(STOP_WORDS), w))

Then the Counter will not count all these stop words.

ctr = Counter(w)

(5) Word Cloud

Basically, this is not always for any actually useful meaning but it’s for fun and visualization in common sense. First of all, we have to install the word cloud package,

$ pip install wordcloud

To create a word cloud of our list, we have to import,

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter

Similarly, we have to make our list a frequency dictionary in order to create the word cloud,

ctr = Counter(w)

then, we have to generate a wordcloud object with the .fit_words method, this will fit our dictionary to an image object,

wordcloud = WordCloud()
wordcloud.fit_words(ctr)

Finally, we are going to plot this word cloud by matplotlib with the .imshow() method,

fig=plt.figure(figsize=(6, 6))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

(6) Ignore Non-ASCII Characters

Suppose we want to ignore all the non-ASCII characters in a text by Python, what we can do is,

text = [c for c in text if ord(c)<=127]
text = ''.join(text)

3. Spacy

The full name of Spacy is called the Industrial-Strength Natural Language Processing (NLP), and it is a python package. It is a powerful tool for text processing.

(1) Installation

To install spacy, we have to use pip,

$ pip install spacy

then also, we have to download the English NLP model information,

$ python -m spacy download en

There are other non-English language models but we are not going to use them now. For the following part, we are going to grab a webpage from Tesla,

$ curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html

(2) Extract Text from Html

Recall what we have learned for beautifulsoup,

Data Acquisition 2 | Beautifulsoup
Series: Data Acquisitionmedium.com

It can be a good idea if we extract the text through it,

import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()

tsla = html2text(html_text)

(3) Tokenizing with Spacy

To tokenize the text with Spacy, which is going to help us with more information about the text,

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(tsla)

then each of the items in the doc variable is a token.

(4) Extract Parts of Speech

import pandas as pd

winfo = []

for token in doc:
    winfo.append([token.text, token.pos_, token.is_stop])
    
pd.DataFrame(data=winfo, columns=['word','part of speech', 'stop word'])

this will output a table with words in the file and their parts of speech, and whether or not this word is a stop word.

(5) Extract Information of the Entities

The entities are non-word instances in the text. This can include the year, date, number, organization name, people name, and etc. The following code will walk through all the entities in the text and also gives us information on what is the type for each of them (aka. label),

winfo = []

for ent in doc.ents:
    winfo.append([ent.text, ent.label_])

pd.DataFrame(data=winfo, columns=['word', 'label'])

In order to see a full list of how we can interpret these labels, here is the documentation for reference. Note that there can be some mistakes in the labels.

Another quick way to show the information for the entities is to use the displacy model in the spacy,

from spacy import displacy
displacy.render(doc, style='ent')

This program will directly give us the entities and their entity labels.

(6) Calculate Word Vectors

The word vectors can be used for interpreting how close the meaning of any two words. This can be measured by the distance of the two vectors. Spacy can directly give us the result of the word vector for each word,

winfo = []

for t in doc:
    winfo.append([t.text, t.vector])

pd.DataFrame(data=winfo, columns=['word', 'vector'])

(7) Split the Text Into Sentences

Scapy can also be used to split the text by sentences in the text.

winfo = []

for s in doc.sents:
    winfo.append([s.text])

pd.DataFrame(data=winfo, columns=['sentence'])