What:
A neural network approach to learning the vectors in a vector embeddings.
Approach 1: Continuous Bag of Words:
Imagine we’re trying to find the vector for the target word “mat”, given the sentence “the cat sat on the ______“.
- Using bag o’ word embeddings, we’ll get the average vector for “the”, “cat”, “sat”, and “on”.
- I.e. we get the average of the first row of each vector, then the average of the second row etc.
- We take each average and make a new vector with it.
- This becomes the input for our neural network.
- The output layer is the length of your vocabulary, where each output neuron represents a possible word.
- Apply Softmax Function to the output.
- We compare our predicted word with the actual word.
- We then repeatedly do Gradient Descent, changing both the weights and the initial word vectors themselves.
Approach 2: Skip-Gram:
By contrast, Skip-Gram predicts the surrounding context words given a single target word.
Imagine we’re trying to predict “the”, “cat”, “sat”, and “on” given “mat”.
Step-by-Step Training
-
One-Hot Encoding of the Target Word:
- Convert the target word (“mat”) into a one-hot vector (length = vocabulary size).
-
Word Embedding Lookup:
- The hidden layer is actually an embedding matrix (a lookup table).
- Multiplying the one-hot vector by the embedding matrix directly selects the word embedding (i.e., the row corresponding to “mat”).
-
Predict Context Words (One at a Time):
- The embedding vector is passed to the output layer, which has neurons equal to vocabulary size.
- The model tries to assign high scores to the actual context words and low scores to others.
-
Softmax Activation:
- Convert output scores into probabilities for each word in the vocabulary.
-
Loss Function & Gradient Descent:
- Compute cross-entropy loss for each (target, context) pair.
- Adjust:
- Word embeddings (so that similar words cluster together).
- Output layer weights (for better context prediction).
-
Optimisation: Negative Sampling (for Large Vocabularies)
- Instead of computing softmax over the entire vocabulary, the model only updates the correct context words + a few random “negative” words.
- This dramatically speeds up training.
🔹 Key Concept Behind Both Approaches
Similar words should appear in similar contexts.
This means that if you swap “cat” with “feline”, the surrounding words remain similar. Skip-Gram and CBOW both exploit this pattern to learn meaningful word embeddings.
🔹 Comparison Table
Method | Predicts | Fast? | Handles Rare Words? | Best For |
---|---|---|---|---|
CBOW | Target word from context | ✅ Faster | ❌ No (poor on infrequent words) | Frequent words, classification |
Skip-Gram | Context words from target | ❌ Slower | ✅ Yes (better for rare words) | Rare words, deep semantic relationships |
Drawbacks
- It’s essentially a static embedding. It doesn’t change per context.