What:

A neural network approach to learning the vectors in a vector embeddings.

Approach 1: Continuous Bag of Words:

Imagine we’re trying to find the vector for the target word “mat”, given the sentence “the cat sat on the _____“.

  1. Using bag o’ word embeddings, we’ll get the average vector for “the”, “cat”, “sat”, and “on”.
    1. I.e. we get the average of the first row of each vector, then the average of the second row etc.
    2. We take each average and make a new vector with it.
  2. This becomes the input for our neural network.
  3. The output layer is the length of your vocabulary, where each output neuron represents a possible word.
  4. Apply Softmax Function to the output.
  5. We compare our predicted word with the actual word.
  6. We then repeatedly do Gradient Descent, changing both the weights and the initial word vectors themselves.

Approach 2: Skip-Gram:

By contrast, Skip-Gram predicts the surrounding context words given a single target word.

Imagine we’re trying to predict “the”, “cat”, “sat”, and “on” given “mat”.

Step-by-Step Training

  1. You take the target word. Turn it to a One-Hot Encoding. By multiplying it by the (learned) embedding matrix, you’ve successfully extracted a vector for the word.
  2. You take this word-vector, and multiply it by another(!) learned matrix - called the “context matrix” or “unembedding matrix”. This gives you a vector of scores for every word in the vocabulary.
    • The unembedding matrix is essentially converting the vector back into words by giving them all scores of likelihood.
  3. We softmax over all of the scores, to find the most likely word. We use cross entropy to find our loss.
  4. We then do normal Gradient Descent to learn the correct target word’s embedding and the context matrix (excluding the target word’s column in context matrix!).
    1. Because we’re predicting the word based on it’s context. Not the context AND the word!
  5. We repeat across the corpus.

Note on Efficiency: We use Negative Sampling to dramatically improve performance: Instead of computing Softmax over the entire vocab, we only compute the scores for the correct context word, as well as a few randomly sampled incorrect ones. You then only update those then.

Note on Training: We actually create multiple different different samples from the one sentence. This is because, in practise, we’re only trying to predict single correct context words at a time.

Note 2: In Skip-Gram, the embedding and unembedding matrix are 2 distinct matrices. But that’s not always the case! In Transformers, it’s far more common for them to be the one matrix.


🔹 Key Concept Behind Both Approaches

Similar words should appear in similar contexts.
This means that if you swap “cat” with “feline”, the surrounding words remain similar. Skip-Gram and CBOW both exploit this pattern to learn meaningful word embeddings.

🔹 Comparison Table

MethodPredictsFast?Handles Rare Words?Best For
CBOWTarget word from context✅ Faster❌ No (poor on infrequent words)Frequent words, classification
Skip-GramContext words from target❌ Slower✅ Yes (better for rare words) (Cos you make multiple examples for each word)Rare words, deep semantic relationships

Drawbacks

  • It’s essentially a static embedding. It doesn’t change per context.