(1) Two NLP Workflows
(2) Classical NLP Techniques
Here is a list of techniques that we use for classical NLP, and we will explain in the following parts of this section.
(3) Classical NLP Problems
(4) Word Representation Techniques
(5) Text Cleaning Techniques
(1) Text Tokenization
Text tokenization is the task of chopping up texts into pieces called tokens. It is usually used to build a vocabulary thst will be used to determine the inputs for the model. The difference between spliting by space is that the process of tokenization requires to split at some non-spaces. For example, with splitting by spaces, we have,
x1"They're not going to the Sam's Market."
2["They're","not","going","to","the","Sam's","Market."]
However, with tokenization, this text will be splitted based on the tokens,
xxxxxxxxxx
11"They're not going to the Sam's Market."
2["They","'re","not","going","to","the","Sam's","Market","."]
(2) Subword Tokenization: Byte Pair Encoding (BPE)
The text tokenization technique is not perfect because we can have so many tokens of the same meaning. For example, low lower lowest
by text tokenization will be splitted to,
xxxxxxxxxx
11["low", "lower", "lowest"]
However, with BPE, this string will be splitted to tokens of,
xxxxxxxxxx
11["low", "er", "est"]
This technique is used by the famous NLP model BERT, and the goal of BPE is to find a way to represent the text with the least amount of tokens. So now let's see an example. Suppose we have the following string,
x1"low low low low low lowest lowest newer newer newer newer newer newer wider wider wider new new"
First, let's split by characters and the vocabulary should be,
x1["d", "e", "i", "l", "n", "o", "r", "s", "t", "w", " "]
Then, let's find the character pair with the highest frequency in the string, which should be er
, so we combine er
as a token and put it in the vocabulary. So we have,
x1Merge=("e", "r"), Vocab = ["d", "e", "i", "l", "n", "o", "r", "s", "t", "w", " ", "er"]
Continue this process, we will merge the following tokens in sequence,
x1("e", "r")
2("er", " ")
3("n", "e")
4("ne", "w")
5("l", "o")
6("lo", "w")
7("new", "er ")
8("low", " ")
And the final vocabulary should be,
xxxxxxxxxx
11["d", "e", "i", "l", "n", "o", "r", "s", "t", "w", " ", "er", "er ", "ne", "new", "lo", "low", "newer ", "low "]
So according to the vocabulary above, for the following string,
x1"newer lower net"
We will split it as,
xxxxxxxxxx
11["newer ", "low", "er ", "ne", "t"]
The pseudo code for the BPE algorithm is,
(3) Build Vocabulary From Tokenization
So far, we have talked about how to take the tokenization of a text and now we would like to see how to build a vocabulary from the tokenization result. Our goal is to build a vocabulary with the following features,
In order to achieve these goals, we have to,
UNK
UNK
(4) Lemmatization and Stemming
Lemmatization is the process where we take individual tokens from a sentence and we try to reduce them to their base form. Stemming is to find the stem of a token that can leads to some meaningless tokens or ambiguous tokens.
For example, the lemmatization of the following words are,
x1cars -> car
2car's -> car
3careness -> care
4carefully -> care
However, for stemming (e.g. Porter stemming), the result will be,
xxxxxxxxxx
11cars -> car
2car's -> car
3careness -> car
4carefully -> car
With lemmatization and stemming, some different words will be mapped to the same token. For example,
xxxxxxxxxx
11am, is, are => be
2I, you, she, he => pron
But because we ignore some information based on lemmatization and stemming, it will commonly lose some precision we have on the meanings of the words. So these techniques are generally used for,
(5) Sentence Segmentation
We can not commonly use the period .
to split sentences because these values are going to be ambiguous. For example
Inc.
or Dr.
4.3
It is "Not good." that said by him.
Because of these issues, we will build a binary classifier (i.e. decision tree) for deciding where is the end of sentences (EOSs). Here are some common decision boundaries,
x1- Is there one blank line after?
2- Is the token before "." is an abbreviation?
3- Is the case of word after "." is upper?
4- Probability we have a word after ".".
5- Probability we have a word before ".".
(6) POS (part of speech) tagging
Determine the part of speech tage for a particular instance of a word. Note that the same word can have more than one POS according to its context. For example
xxxxxxxxxx
11Apple -> Propn
2is -> Verb
3at -> Adp
4$ -> Sym
51 -> Num
(7) NER (named entity recognization)
Find and classify different named entities in the text. For example,
xxxxxxxxxx
41Adam -> Name
22010 -> Date
3Walgreens -> Organization
4Los Angeles -> Location
This technique is frequently used for,
(8) Dependency Parsing
The dependency structure of a text shows which word depend on which other words.
(9) N-Gram Models
N-gram models means to use continuous sequence of items from a given text. For example, let's say we have the following string for training,
x1"I would like to go"
And the unigram model have the following training set,
x1["I","would","like","to","go"]
The bigram model have the following training set,
x1[["I","would"],
2["would","like"],
3["like", "to"],
4["to", "go"]]
And the trigram model have the following training set,
x1[["I","would","like"],
2["would","like","to"],
3["like", "to","go"]]
(10) BOW (Bag of word) Model
The BOW model is considered as a unigram model and each word in that model is considered as a feature. For example, suppose we have the following two texts,
xxxxxxxxxx
11"I would like to go"
2"I would like not not to go"
Then the vocabulary of these texts should be,
x1["I","would","like","to","go","not"]
Now, let's see each of the texts above as a bag of words without order, and the texts above can be represented by the following two vectors with the frequency of each word in its bag.
x1[1, 1, 1, 1, 1, 0]
2[1, 1, 1, 1, 1, 2]
We can find out in this case that because each word is considered a feature, it is a unigram model we have talked about.
(1) Downsides of WordNet and One-Hot Encoding
We have encoded the data with the techniques we talked about above, but how can we actually encode the meaning of a word?
One way is to use a WordNet for explaining the word, but there are some downsides,
Another way is to use one-hot encoding, but there are also some defects,
Therefore, in order to vector for representing a word, we have to develop some other techniques.
(2) Word2Vec Embedding: CBOW
The continuous bag of word (CBOW) is a model used for creating the embedding matrix of the vocabulary. Then from this embedding matrix, we can then extract the corresponding word and turn that word to a vector from the embedding matrix. Given the context window of , which means we take the words before the center word and words after the center word as the input.
x1["I","would","to","go"] -> like
So for each of the continuous context bags of word containing words, we use this input for fitting an embedding matrix, then take the summation of each column in the embedding matrix (because the values from the same column is from the same context). Finally, we fit a linear layer to derive the output predictions.
x
1class CBOW(torch.nn.Module):
2 def __init__(self, vocab_size, embedding_dim):
3 super(CBOW, self).__init__()
4 self.embeddings = nn.Embedding(vocab_size, embedding_dim)
5 self.linear = nn.Linear(embedding_dim, vocab_size)
6
7 def forward(self, inputs):
8 embeds = sum(self.embeddings(inputs)).view(1,-1)
9 out = self.linear(embeds)
10 return out
11
12 def get_word_emdedding(self, word_idx):
13 return self.embeddings(word_idx).view(1,-1)
Suppose we have , the vocabulary size and the embedding size . Then the number of parameters are,
(3) Deep CBOW
Because we have only one linear layer in the CBOW model above, the performance may not be good enough. What we can improve is to add multiple linear layers followed by activations into the CBOW model and make it as a DeepCBOW model.
x
1class DeepCBOW(torch.nn.Module):
2 def __init__(self, vocab_size, embedding_dim):
3 super(DeepCBOW, self).__init__()
4 self.embeddings = nn.Embedding(vocab_size, embedding_dim)
5 self.linear1 = nn.Linear(embedding_dim, 128)
6 self.linear2 = nn.Linear(128, vocab_size)
7 self.relu = nn.ReLU()
8
9 def forward(self, inputs):
10 embeds = sum(self.embeddings(inputs)).view(1,-1)
11 out = self.linear1(embeds)
12 out = self.relu(out)
13 out = self.linear2(embeds)
14 return out
15
16 def get_word_emdedding(self, word_idx):
17 return self.embeddings(word_idx).view(1,-1)
(4) N-gram of Neural Bags
One problem for CBOW is that we can not capture the order of the words because the words in bag are unordered. So we can leverage the n-gram model to obtain some sequence information. However, there are several downsides of this model,
Therefore, most of NLP tasks use a sequence representation learning problem, which takes a neural sequence model instead.
(4) Word2Vec Embedding: Skip-gram
The skip-gram is the inverse model of CBOW which select a word as the training record, and then random select one word from the context of this word as its target. So for the example above, we can have the training set as,
xxxxxxxxxx
11like -> would
2like -> I
3like -> to
4like -> go
So the embedding vector of word as and embedding vector of word as . Then we have the following estimation of the target probability as,
Therefore,
x
1class SkipGram(torch.nn.Module):
2 def __init__(self, vocab_size, embedding_dim):
3 super(SkipGram, self).__init__()
4 self.embedding_x = nn.Embedding(vocab_size, embedding_dim)
5 self.embedding_y = nn.Embedding(vocab_size, embedding_dim)
6
7 def forward(self, x, y):
8 x = self.embedding_x(x)
9 y = self.embedding_y(y)
10 return x * y
11
12 def get_word_emdedding(self, word_idx):
13 return self.embedding_x(word_idx).view(1,-1)
Note that we can also write a naive model with the linear layer for representing the y embedding.
x
1class SkipGram(torch.nn.Module):
2 def __init__(self, vocab_size, embedding_dim):
3 super(SkipGram, self).__init__()
4 self.embedding = nn.Embedding(vocab_size, embedding_dim)
5 self.linear = nn.Linear(embedding_dim, vocab_sizen, bias=False)
6
7 def forward(self, x):
8 x = self.embedding(x)
9 x = self.linear(x)
10 return x
11
12 def get_word_emdedding(self, word_idx):
13 return self.embedding(word_idx).view(1,-1)
Note that here we select as our final embedding, and in both of the cases, we have parameters.
(5) Skip-gram with Negative Sampling
In order to make this example easier, a trick here is to convert this softmax problem to a binary classification problem by assigning all the pairs as 1.
xxxxxxxxxx
41like would 1
2like I 1
3like go 1
4like to 1
Similar to matrix factorization problem, this matrix does not have negative values so that we have to conduct some negative samplings to this dataset. Commonly we can sample 5-20 negatives for each positive value for a small dataset and 2-5 negatives for each positive value for a large dataset. So what we can do is to add these negative to our dataset,
xxxxxxxxxx
51like would 1
2like car 0
3like shop 0
4like apple 0
5like note 0
Because we take these negative samples from the whlole corpora, there is another problem. The stop words appear too much in the corpora so that we have to reduce the probability for choosing them. For example,
xxxxxxxxxx
11like the 0
So the probability of choosing a word for sampling is combuted based on,
Where is the frequency of this word.
xxxxxxxxxx
1class SkipGram_neg(torch.nn.Module):
2 def __init__(self, vocab_size, embedding_dim):
3 super(SkipGram_neg, self).__init__()
4 self.embedding = nn.Embedding(vocab_size, embedding_dim)
5 self.linear = nn.Linear(embedding_dim, vocab_sizen, bias=False)
6
7 def forward(self, x):
8 x = self.embedding(x)
9 x = self.linear(x)
10 return x
11
12 def get_word_emdedding(self, word_idx):
13 return self.embedding(word_idx).view(1,-1)
Note that we should apply binary_cross_entropy
as our loss function here.
(5) GloVe Embedding
We are not look into the details of this one but the basic idea of GloVe is to puish the model with the frequently appeared words (like stop words). Let's define as the number oftimes a word appears in the context of word . Then the loss function of GloVe is,
Where,
From this formula, we can know that the number of parameters in this model is , and the final embedding should be the average of embedding matrix and ,
(6) t-SNE
t-SNE is a nonlinear nondeterministic algorithm that tries to preserve local neighbourhoods in the data, often at the expense of distorting the global structure.