What:
A neural network approach to learning the vectors in a vector embeddings.
Approach 1: Continuous Bag of Words:
Imagine we’re trying to find the vector for the target word “mat”, given the sentence “the cat sat on the _____“.
- Using bag o’ word embeddings, we’ll get the average vector for “the”, “cat”, “sat”, and “on”.
- I.e. we get the average of the first row of each vector, then the average of the second row etc.
- We take each average and make a new vector with it.
- This becomes the input for our neural network.
- The output layer is the length of your vocabulary, where each output neuron represents a possible word.
- Apply Softmax Function to the output.
- We compare our predicted word with the actual word.
- We then repeatedly do Gradient Descent, changing both the weights and the initial word vectors themselves.
Approach 2: Skip-Gram:
By contrast, Skip-Gram predicts the surrounding context words given a single target word.
Imagine we’re trying to predict “the”, “cat”, “sat”, and “on” given “mat”.
Step-by-Step Training
- You take the target word. Turn it to a One-Hot Encoding. By multiplying it by the (learned) embedding matrix, you’ve successfully extracted a vector for the word.
- You take this word-vector, and multiply it by another(!) learned matrix - called the “context matrix” or “unembedding matrix”. This gives you a vector of scores for every word in the vocabulary.
- The unembedding matrix is essentially converting the vector back into words by giving them all scores of likelihood.
- We softmax over all of the scores, to find the most likely word. We use cross entropy to find our loss.
- We then do normal Gradient Descent to learn the correct target word’s embedding and the context matrix (excluding the target word’s column in context matrix!).
- Because we’re predicting the word based on it’s context. Not the context AND the word!
- We repeat across the corpus.
Note on Efficiency: We use Negative Sampling to dramatically improve performance: Instead of computing Softmax over the entire vocab, we only compute the scores for the correct context word, as well as a few randomly sampled incorrect ones. You then only update those then.
Note on Training: We actually create multiple different different samples from the one sentence. This is because, in practise, we’re only trying to predict single correct context words at a time.
Note 2: In Skip-Gram, the embedding and unembedding matrix are 2 distinct matrices. But that’s not always the case! In Transformers, it’s far more common for them to be the one matrix.
🔹 Key Concept Behind Both Approaches
Similar words should appear in similar contexts.
This means that if you swap “cat” with “feline”, the surrounding words remain similar. Skip-Gram and CBOW both exploit this pattern to learn meaningful word embeddings.
🔹 Comparison Table
Method | Predicts | Fast? | Handles Rare Words? | Best For |
---|---|---|---|---|
CBOW | Target word from context | ✅ Faster | ❌ No (poor on infrequent words) | Frequent words, classification |
Skip-Gram | Context words from target | ❌ Slower | ✅ Yes (better for rare words) (Cos you make multiple examples for each word) | Rare words, deep semantic relationships |
Drawbacks
- It’s essentially a static embedding. It doesn’t change per context.