Advanced Machine Learning 5 | Introduction to NLP

aml

1. Natural Language Processing Techniques

(1) Two NLP Workflows

(2) Classical NLP Techniques

Here is a list of techniques that we use for classical NLP, and we will explain in the following parts of this section.

(3) Classical NLP Problems

(4) Word Representation Techniques

(5) Text Cleaning Techniques

2. Classical NLP Techniques

(1) Text Tokenization

Text tokenization is the task of chopping up texts into pieces called tokens. It is usually used to build a vocabulary thst will be used to determine the inputs for the model. The difference between spliting by space is that the process of tokenization requires to split at some non-spaces. For example, with splitting by spaces, we have,

However, with tokenization, this text will be splitted based on the tokens,

(2) Subword Tokenization: Byte Pair Encoding (BPE)

The text tokenization technique is not perfect because we can have so many tokens of the same meaning. For example, low lower lowest by text tokenization will be splitted to,

However, with BPE, this string will be splitted to tokens of,

This technique is used by the famous NLP model BERT, and the goal of BPE is to find a way to represent the text with the least amount of tokens. So now let's see an example. Suppose we have the following string,

First, let's split by characters and the vocabulary should be,

Then, let's find the character pair with the highest frequency in the string, which should be er, so we combine er as a token and put it in the vocabulary. So we have,

Continue this process, we will merge the following tokens in sequence,

And the final vocabulary should be,

So according to the vocabulary above, for the following string,

We will split it as,

The pseudo code for the BPE algorithm is,

Screen Shot 2022-03-09 at 7.05.46 PM

(3) Build Vocabulary From Tokenization

So far, we have talked about how to take the tokenization of a text and now we would like to see how to build a vocabulary from the tokenization result. Our goal is to build a vocabulary with the following features,

In order to achieve these goals, we have to,

(4) Lemmatization and Stemming

Lemmatization is the process where we take individual tokens from a sentence and we try to reduce them to their base form. Stemming is to find the stem of a token that can leads to some meaningless tokens or ambiguous tokens.

For example, the lemmatization of the following words are,

However, for stemming (e.g. Porter stemming), the result will be,

With lemmatization and stemming, some different words will be mapped to the same token. For example,

But because we ignore some information based on lemmatization and stemming, it will commonly lose some precision we have on the meanings of the words. So these techniques are generally used for,

(5) Sentence Segmentation

We can not commonly use the period . to split sentences because these values are going to be ambiguous. For example

Because of these issues, we will build a binary classifier (i.e. decision tree) for deciding where is the end of sentences (EOSs). Here are some common decision boundaries,

(6) POS (part of speech) tagging

Determine the part of speech tage for a particular instance of a word. Note that the same word can have more than one POS according to its context. For example

(7) NER (named entity recognization)

Find and classify different named entities in the text. For example,

This technique is frequently used for,

(8) Dependency Parsing

The dependency structure of a text shows which word depend on which other words.

(9) N-Gram Models

N-gram models means to use continuous sequence of items from a given text. For example, let's say we have the following string for training,

And the unigram model have the following training set,

The bigram model have the following training set,

And the trigram model have the following training set,

(10) BOW (Bag of word) Model

The BOW model is considered as a unigram model and each word in that model is considered as a feature. For example, suppose we have the following two texts,

Then the vocabulary of these texts should be,

Now, let's see each of the texts above as a bag of words without order, and the texts above can be represented by the following two vectors with the frequency of each word in its bag.

We can find out in this case that because each word is considered a feature, it is a unigram model we have talked about.

3. Word Representation Techniques

(1) Downsides of WordNet and One-Hot Encoding

We have encoded the data with the techniques we talked about above, but how can we actually encode the meaning of a word?

One way is to use a WordNet for explaining the word, but there are some downsides,

Another way is to use one-hot encoding, but there are also some defects,

Therefore, in order to vector for representing a word, we have to develop some other techniques.

(2) Word2Vec Embedding: CBOW

The continuous bag of word (CBOW) is a model used for creating the embedding matrix of the vocabulary. Then from this embedding matrix, we can then extract the corresponding word and turn that word to a vector from the embedding matrix. Given the context window of , which means we take the words before the center word and words after the center word as the input.

So for each of the continuous context bags of word containing words, we use this input for fitting an embedding matrix, then take the summation of each column in the embedding matrix (because the values from the same column is from the same context). Finally, we fit a linear layer to derive the output predictions.

Suppose we have , the vocabulary size and the embedding size . Then the number of parameters are,

(3) Deep CBOW

Because we have only one linear layer in the CBOW model above, the performance may not be good enough. What we can improve is to add multiple linear layers followed by activations into the CBOW model and make it as a DeepCBOW model.

(4) N-gram of Neural Bags

One problem for CBOW is that we can not capture the order of the words because the words in bag are unordered. So we can leverage the n-gram model to obtain some sequence information. However, there are several downsides of this model,

Therefore, most of NLP tasks use a sequence representation learning problem, which takes a neural sequence model instead.

(4) Word2Vec Embedding: Skip-gram

The skip-gram is the inverse model of CBOW which select a word as the training record, and then random select one word from the context of this word as its target. So for the example above, we can have the training set as,

So the embedding vector of word as and embedding vector of word as . Then we have the following estimation of the target probability as,

Therefore,

Note that we can also write a naive model with the linear layer for representing the y embedding.

Note that here we select as our final embedding, and in both of the cases, we have parameters.

(5) Skip-gram with Negative Sampling

In order to make this example easier, a trick here is to convert this softmax problem to a binary classification problem by assigning all the pairs as 1.

Similar to matrix factorization problem, this matrix does not have negative values so that we have to conduct some negative samplings to this dataset. Commonly we can sample 5-20 negatives for each positive value for a small dataset and 2-5 negatives for each positive value for a large dataset. So what we can do is to add these negative to our dataset,

Because we take these negative samples from the whlole corpora, there is another problem. The stop words appear too much in the corpora so that we have to reduce the probability for choosing them. For example,

So the probability of choosing a word for sampling is combuted based on,

Where is the frequency of this word.

Note that we should apply binary_cross_entropy as our loss function here.

(5) GloVe Embedding

We are not look into the details of this one but the basic idea of GloVe is to puish the model with the frequently appeared words (like stop words). Let's define as the number oftimes a word appears in the context of word . Then the loss function of GloVe is,

Where,

From this formula, we can know that the number of parameters in this model is , and the final embedding should be the average of embedding matrix and ,

(6) t-SNE

t-SNE is a nonlinear nondeterministic algorithm that tries to preserve local neighbourhoods in the data, often at the expense of distorting the global structure.