A Mathematical dive into word embeddings (skip grams)

Amay Gada

Published in

Analytics Vidhya

6 min readJul 10, 2021

Why word embeddings?

Let us consider that our NLP model learns the phrase:

That is a beautiful painting

It would be tough for it to predict the word ‘painting’ in the phrase - That is a gorgeous ________.
This is because there is no direct relationship between the words beautiful and gorgeous. So the next question is, who do we have to blame for it?

One Hot Vectors

Earlier, each word was described by a one-hot vector of the entire corpus of words (called the vocabulary). Hence, a million word corpus would cause each word to be a million dimensions long.

Word Embeddings

As you may have figured, the main objective is to build an encoding for words that encapsulate certain relationships between them. These encodings are called as word embeddings. In case you are looking for an intuition for the same, we aim to create embeddings, such that the words having similar contexts cluster together (cosine similarity can be used to find how close two word embeddings are).
Moreover, this helps in reducing the dimensionality of the input vectors as well.

How to generate word embeddings?

Hopefully, now you have a general idea of what we are trying to do.

Now comes the time to ponder upon how we may actually go about generating these encodings!

Let’s begin with skip grams.

Skip Grams

The skip-gram neural network model is actually surprisingly simple in its most basic form. It involves training a neural network with one hidden layer. However, equally important and interesting as the architecture, is how we use it to generate word embeddings.

Initially we try to solve a completely different problem. From a large corpus of English text, we keep on choosing phrases whose lengths depend on a previously chosen window size. From this window, we choose a central word and call it the target. Rest of the words in that phrase are called the context. We build a model such that given a target word, we predict its context.

I want two slices of mozzarella pizza
window size = 7
Wi -> slices
Wo -> I, want, two, of , mozzarella, pizza
P(Wo|Wi) -> probablity of getting a context word Wo given the input word Wi

objective of the neural network — (eq 1)

Shape of W1 = (300, 10000)
Shape of W2 = (10000, 300)Hidden = W1.X (300,1)
(Note that no activation function is applied here and neither is a bias term involved)

Looking at the architecture, it is trivial to see how it maximises the objective of getting context words corresponding to target words. The question is, how does it help us generate word embeddings?

The answer lies in the hidden layer weight matrix W1.

While learning different contexts for a target word, we are in a way developing a model that encapsulates the relationships between the target and the context. Hence, we could come to a conclusion that the hidden layer is in fact the word embedding for a given input word. However, we need the word embeddings for all the words in the corpus. Hence, we turn to the weight matrix W1.

W1 is of shape (300, 10000). A matrix multiplication with a one-hot input vector of shape (10000, 1) would render a vector with shape (300,1). Given that the input is a one-hot encoded vector, only the node with value 1 would play a role in the matrix multiplication. The other nodes being 0, would contribute nothing.

Therefore, we could say that a word corresponding to the 10th element in the one-hot encoded vector, would have the 10th column of the W1 matrix as its word embedding. (think about this a little)

The softmax layer

The details about the output softmax layer were kept vague deliberately. Now is the time where we unveil the mystery behind that. To understand this it is necessary to understand that we are actually learning two word embeddings; One when the word is the target and one when the word is the context. The former is given by the weight matrix W1 and the latter is given by the weight matrix W2.

e1 = W1 . I      [I is the input vector]
e2 = (W2)T . O   [O is the one hot vector of the softmax argmax]

For each input vector, the hidden layer learns 300 weights (Vwi) corresponding to the weight matrix W1. Vwi is then multiplied by the weight matrix W2 in order to give a softmax output. Responsible for each node in the softmax output, we can say that there exist 300 (Uwj) parameters from the W2 matrix (for each output word j). According to what we saw earlier,
Vwi = e1 and Uwj = e2. Using this we can finally say that the dot product:
(Vwi)T . Uwj plays a role in calculating the decimal probablity at the output node j.

Note: The above calculation is for one input vector Vwi

Diagram for visualising Vwi and Uwj(fig 3)

Using this we can say that the softmax function for each input looks like this:

(eq 2)

The Cost Function

To define the cost function we must go back to when we actually defined our objective. Our goal is to maximize the probablity of getting context words given a target word as described in equation 1. Let’s talk a little more about it.

Firstly, let’s answer the question: Why a multiplication operation?
As in all machine learning problems, we assume stuff. Here we assume independence. i.e all context target pairs are independent of each other (just like a coin flip!)

This gives us something like this:

P(w1,w2,w3 | Wi) = P(w1|Wi) . P(w2|Wi) . P(w3|Wi)

Moreover do you see something missing?

Equation 1 only considers one window of size 2c+1. Whereas, we need to consider all the windows in the entire english text corpus. Hence the equation 1 must be extended to all the T windows that we encounter.

Finally, for the sake of easier differentials and preventing underflows, we take the log of the objective function. This leaves us with:

We divide by T so that the cost function is not affected by the size of the corpus — (eq 3)

The probablity P can be calculated using the equation 2

Computational Issues

The skip gram model works well for smaller vocabulary sizes. However, this is not always the case. Vocabulary sizes can easily scale up to millions of words and this poses a serious threat to the skip gram model.

To understand this better, we must go back to equation 2 and have a look at the summation in the denominator. A vocabulary of around 10 million would easily make the computation infeasible. Moreover, the summation of 10 million has to be done for every context-target pair existing in the corpus. This causes the skip gram model to be very slow.

So the next plausible question is: How do we fix it?

There are a few ways in which we can reduce this complexity.
1. Sub sampling
2. Noise contrastive estimation
3. Negative sampling
4. Hierarchial softmax

The in depth explanation for the above solutions is out of scope of this blog post. Hopefully I will write about them in the near future and link them here.

References

https://www.youtube.com/watch?v=Co-YYk0731I&list=PLxlkzujLkmQ8erDs5VSteKTmVg-y9Z1dR&index=7
Tomas Mikolov et al., Efficient Estimation of Word Representations in Vector Space 2013
Andrew NG (week 2 of course 5 of the deep learning.ai series)