Advanced_Machine_Learning

Advanced Machine Learning 5 | Introduction to NLP

aml

1. Natural Language Processing Techniques

(1) Two NLP Workflows

Classical NLP: first use the technique to pre-processing the text, then use NLP models for training the data
NLP with Deep Learning: first preprocessing the data (mostly by word embedding), and the train data based on the neural network

(2) Classical NLP Techniques

Here is a list of techniques that we use for classical NLP, and we will explain in the following parts of this section.

Tokenization
Lemmatization
Stemming
Sentence Segmentation
POS (part of speech) tagging
NER (named entity recognization)
Dependency Parsing
N-Gram Model
BOW (Bag of word) Model

(3) Classical NLP Problems

Sentiment Analysis
Reading Comprehension

(4) Word Representation Techniques

WordNet
One-Hot Encoding
Word2vec Embedding: CBOW
Word2vec Embedding: Skip-gram
GloVe Embedding

(5) Text Cleaning Techniques

Fix spelling
Remove punctuation
Lower case
Remove stopwords

2. Classical NLP Techniques

(1) Text Tokenization

Text tokenization is the task of chopping up texts into pieces called tokens. It is usually used to build a vocabulary thst will be used to determine the inputs for the model. The difference between spliting by space is that the process of tokenization requires to split at some non-spaces. For example, with splitting by spaces, we have,


x
1
"They're not going to the Sam's Market."
2
["They're","not","going","to","the","Sam's","Market."]

However, with tokenization, this text will be splitted based on the tokens,


xxxxxxxxxx
1
1
"They're not going to the Sam's Market."
2
["They","'re","not","going","to","the","Sam's","Market","."]

(2) Subword Tokenization: Byte Pair Encoding (BPE)

The text tokenization technique is not perfect because we can have so many tokens of the same meaning. For example, low lower lowest by text tokenization will be splitted to,


xxxxxxxxxx
1
1
["low", "lower", "lowest"]

However, with BPE, this string will be splitted to tokens of,


xxxxxxxxxx
1
1
["low", "er", "est"]

This technique is used by the famous NLP model BERT, and the goal of BPE is to find a way to represent the text with the least amount of tokens. So now let's see an example. Suppose we have the following string,


x
1
"low low low low low lowest lowest newer newer newer newer newer newer wider wider wider new new"

First, let's split by characters and the vocabulary should be,


x
1
["d", "e", "i", "l", "n", "o", "r", "s", "t", "w", " "]

Then, let's find the character pair with the highest frequency in the string, which should be er, so we combine er as a token and put it in the vocabulary. So we have,


x
1
Merge=("e", "r"), Vocab = ["d", "e", "i", "l", "n", "o", "r", "s", "t", "w", " ", "er"]

Continue this process, we will merge the following tokens in sequence,


x
1
("e", "r")
2
("er", " ")
3
("n", "e")
4
("ne", "w")
5
("l", "o")
6
("lo", "w")
7
("new", "er ")
8
("low", " ")

And the final vocabulary should be,


xxxxxxxxxx
1
1
["d", "e", "i", "l", "n", "o", "r", "s", "t", "w", " ", "er", "er ", "ne", "new", "lo", "low", "newer ", "low "]

So according to the vocabulary above, for the following string,


x
1
"newer lower net"

We will split it as,


xxxxxxxxxx
1
1
["newer ", "low", "er ", "ne", "t"]

The pseudo code for the BPE algorithm is,

Screen Shot 2022-03-09 at 7.05.46 PM

(3) Build Vocabulary From Tokenization

So far, we have talked about how to take the tokenization of a text and now we would like to see how to build a vocabulary from the tokenization result. Our goal is to build a vocabulary with the following features,

Stable for the rare words in the text
Stable for predicting unknown words

In order to achieve these goals, we have to,

Map all the words with frequency less than 5 to the word UNK
Map all the out of vocabulary words to UNK

(4) Lemmatization and Stemming

Lemmatization is the process where we take individual tokens from a sentence and we try to reduce them to their base form. Stemming is to find the stem of a token that can leads to some meaningless tokens or ambiguous tokens.

For example, the lemmatization of the following words are,


x
1
cars -> car
2
car's -> car
3
careness -> care
4
carefully -> care

However, for stemming (e.g. Porter stemming), the result will be,


xxxxxxxxxx
1
1
cars -> car
2
car's -> car
3
careness -> car
4
carefully -> car

With lemmatization and stemming, some different words will be mapped to the same token. For example,


xxxxxxxxxx
1
1
am, is, are => be
2
I, you, she, he => pron

But because we ignore some information based on lemmatization and stemming, it will commonly lose some precision we have on the meanings of the words. So these techniques are generally used for,

Test classification tasks on searching engines
Complex languages like Arabic and Spanish

(5) Sentence Segmentation

We can not commonly use the period . to split sentences because these values are going to be ambiguous. For example

Abbreviation like Inc. or Dr.
Numbers like 4.3
Some quotations like It is "Not good." that said by him.

Because of these issues, we will build a binary classifier (i.e. decision tree) for deciding where is the end of sentences (EOSs). Here are some common decision boundaries,


x
1
- Is there one blank line after?
2
- Is the token before "." is an abbreviation?
3
- Is the case of word after "." is upper?
4
- Probability we have a word after ".".
5
- Probability we have a word before ".".

(6) POS (part of speech) tagging

Determine the part of speech tage for a particular instance of a word. Note that the same word can have more than one POS according to its context. For example


xxxxxxxxxx
1
1
Apple -> Propn
2
is -> Verb
3
at -> Adp
4
$ -> Sym
5
1 -> Num

(7) NER (named entity recognization)

Find and classify different named entities in the text. For example,


xxxxxxxxxx
4
1
Adam -> Name
2
2010 -> Date
3
Walgreens -> Organization
4
Los Angeles -> Location

This technique is frequently used for,

Text classifications
Sentiment analysis for named entities
Question answering

(8) Dependency Parsing

The dependency structure of a text shows which word depend on which other words.

(9) N-Gram Models

$n$ continuous sequence of items from a given text. For example, let's say we have the following string for training,


x
1
"I would like to go"

And the unigram model have the following training set,


x
1
["I","would","like","to","go"]

The bigram model have the following training set,


x
1
[["I","would"],
2
 ["would","like"],
3
 ["like", "to"],
4
 ["to", "go"]]

And the trigram model have the following training set,


x
1
[["I","would","like"],
2
 ["would","like","to"],
3
 ["like", "to","go"]]

(10) BOW (Bag of word) Model

The BOW model is considered as a unigram model and each word in that model is considered as a feature. For example, suppose we have the following two texts,


xxxxxxxxxx
1
1
"I would like to go"
2
"I would like not not to go"

Then the vocabulary of these texts should be,


x
1
["I","would","like","to","go","not"]

Now, let's see each of the texts above as a bag of words without order, and the texts above can be represented by the following two vectors with the frequency of each word in its bag.


x
1
[1, 1, 1, 1, 1, 0]
2
[1, 1, 1, 1, 1, 2]

We can find out in this case that because each word is considered a feature, it is a unigram model we have talked about.

3. Word Representation Techniques

(1) Downsides of WordNet and One-Hot Encoding

We have encoded the data with the techniques we talked about above, but how can we actually encode the meaning of a word?

One way is to use a WordNet for explaining the word, but there are some downsides,

Missing nuance of the word
Missing new meanings of the word
The analyzed results can be subjective
It requires human labor to create and adapt
It can not compute accurate word similarity

Another way is to use one-hot encoding, but there are also some defects,

The vector has a large dimension of the length of the vocabulary
Different word vectors are orthogonal so that there's no natural notation for one-hot vectors

Therefore, in order to vector for representing a word, we have to develop some other techniques.

(2) Word2Vec Embedding: CBOW

$2k+1$ $k$ $k$ words after the center word as the input.


x
1
["I","would","to","go"] -> like

$2k$ words, we use this input for fitting an embedding matrix, then take the summation of each column in the embedding matrix (because the values from the same column is from the same context). Finally, we fit a linear layer to derive the output predictions.


x
1
class CBOW(torch.nn.Module):
2
    def __init__(self, vocab_size, embedding_dim):
3
        super(CBOW, self).__init__()
4
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
5
        self.linear = nn.Linear(embedding_dim, vocab_size)
6
7
    def forward(self, inputs):
8
        embeds = sum(self.embeddings(inputs)).view(1,-1)
9
        out = self.linear(embeds)
10
        return out
11
12
    def get_word_emdedding(self, word_idx):
13
        return self.embeddings(word_idx).view(1,-1)

$K = 2k$ $V$ $D$ . Then the number of parameters are,

V \times D + (D + 1) \times V

(3) Deep CBOW

Because we have only one linear layer in the CBOW model above, the performance may not be good enough. What we can improve is to add multiple linear layers followed by activations into the CBOW model and make it as a DeepCBOW model.


x
1
class DeepCBOW(torch.nn.Module):
2
    def __init__(self, vocab_size, embedding_dim):
3
        super(DeepCBOW, self).__init__()
4
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
5
        self.linear1 = nn.Linear(embedding_dim, 128)
6
        self.linear2 = nn.Linear(128, vocab_size)
7
        self.relu = nn.ReLU()
8
9
    def forward(self, inputs):
10
        embeds = sum(self.embeddings(inputs)).view(1,-1)
11
        out = self.linear1(embeds)
12
        out = self.relu(out)
13
        out = self.linear2(embeds)
14
        return out
15
16
    def get_word_emdedding(self, word_idx):
17
        return self.embeddings(word_idx).view(1,-1)

(4) N-gram of Neural Bags

One problem for CBOW is that we can not capture the order of the words because the words in bag are unordered. So we can leverage the n-gram model to obtain some sequence information. However, there are several downsides of this model,

Similar n-grams are considered as different features

Still lose the global sequence order
Too many pairs of grams lead to a large vocabulary

Therefore, most of NLP tasks use a sequence representation learning problem, which takes a neural sequence model instead.

(4) Word2Vec Embedding: Skip-gram

The skip-gram is the inverse model of CBOW which select a word as the training record, and then random select one word from the context of this word as its target. So for the example above, we can have the training set as,


xxxxxxxxxx
1
1
like -> would
2
like -> I
3
like -> to
4
like -> go

$e_c$ $c$ $x$ $\theta_t$ $t$ $y$ . Then we have the following estimation of the target probability as,

\hat{y}_{t}=P(t \mid c)=\frac{e^{\theta_{t} \cdot e_{c}}}{\sum_{j=1}^{V} e^{\theta_{j} \cdot e_{c}}}

Therefore,


x
1
class SkipGram(torch.nn.Module):
2
    def __init__(self, vocab_size, embedding_dim):
3
        super(SkipGram, self).__init__()
4
        self.embedding_x = nn.Embedding(vocab_size, embedding_dim)
5
        self.embedding_y = nn.Embedding(vocab_size, embedding_dim)
6
7
    def forward(self, x, y):
8
        x = self.embedding_x(x)
9
        y = self.embedding_y(y)
10
        return x * y
11
12
    def get_word_emdedding(self, word_idx):
13
        return self.embedding_x(word_idx).view(1,-1)

Note that we can also write a naive model with the linear layer for representing the y embedding.


x
1
class SkipGram(torch.nn.Module):
2
    def __init__(self, vocab_size, embedding_dim):
3
        super(SkipGram, self).__init__()
4
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
5
        self.linear = nn.Linear(embedding_dim, vocab_sizen, bias=False)
6
7
    def forward(self, x):
8
        x = self.embedding(x)
9
        x = self.linear(x)
10
        return x
11
12
    def get_word_emdedding(self, word_idx):
13
        return self.embedding(word_idx).view(1,-1)

$E$ $2VD$ parameters.

(5) Skip-gram with Negative Sampling

In order to make this example easier, a trick here is to convert this softmax problem to a binary classification problem by assigning all the pairs as 1.


xxxxxxxxxx
4
1
like would 1
2
like I     1
3
like go    1
4
like to    1

Similar to matrix factorization problem, this matrix does not have negative values so that we have to conduct some negative samplings to this dataset. Commonly we can sample 5-20 negatives for each positive value for a small dataset and 2-5 negatives for each positive value for a large dataset. So what we can do is to add these negative to our dataset,


xxxxxxxxxx
5
1
like would 1
2
like car   0
3
like shop  0
4
like apple 0
5
like note  0

Because we take these negative samples from the whlole corpora, there is another problem. The stop words appear too much in the corpora so that we have to reduce the probability for choosing them. For example,


xxxxxxxxxx
1
1
like the   0

$w_i$ for sampling is combuted based on,

P\left(w_{i}\right)=\frac{f\left(w_{i}\right)^{3 / 4}}{\sum_{j=1}^{V} f\left(w_{j}\right)^{3 / 4}}

$f\left(w_{i}\right)$ is the frequency of this word.


xxxxxxxxxx
1
class SkipGram_neg(torch.nn.Module):
2
    def __init__(self, vocab_size, embedding_dim):
3
        super(SkipGram_neg, self).__init__()
4
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
5
        self.linear = nn.Linear(embedding_dim, vocab_sizen, bias=False)
6
7
    def forward(self, x):
8
        x = self.embedding(x)
9
        x = self.linear(x)
10
        return x
11
12
    def get_word_emdedding(self, word_idx):
13
        return self.embedding(word_idx).view(1,-1)

Note that we should apply binary_cross_entropy as our loss function here.

(5) GloVe Embedding

$X_{ij}$ $i$ $j$ . Then the loss function of GloVe is,

J = \sum_{i=1}^{V} \sum_{j=1}^{V} f\left(X_{i j}\right)\left(\theta_{i} \cdot e_{j}+b_{i}+b_{j}^{\prime}-\log \left(X_{i j}\right)\right)^{2}

Where,

f(x)=\left\{\begin{array}{cc}<br>\left(x / x_{\max }\right)^{\alpha} & \text { if } x<x_{\max } \\<br>1 & \text { otherwise }<br>\end{array}\right.

$2V(D+1)$ $\Theta$ $E$ ,

\tilde{e}_{i}=\frac{\theta_{i}+e_{i}}{2}

(6) t-SNE

t-SNE is a nonlinear nondeterministic algorithm that tries to preserve local neighbourhoods in the data, often at the expense of distorting the global structure.