What:

A neural network approach to learning the vectors in a vector embeddings.

Approach 1: Continuous Bag of Words:

Imagine we’re trying to find the vector for the target word “mat”, given the sentence “the cat sat on the ______“.

  1. Using bag o’ word embeddings, we’ll get the average vector for “the”, “cat”, “sat”, and “on”.
    1. I.e. we get the average of the first row of each vector, then the average of the second row etc.
    2. We take each average and make a new vector with it.
  2. This becomes the input for our neural network.
  3. The output layer is the length of your vocabulary, where each output neuron represents a possible word.
  4. Apply Softmax Function to the output.
  5. We compare our predicted word with the actual word.
  6. We then repeatedly do Gradient Descent, changing both the weights and the initial word vectors themselves.

Approach 2: Skip-Gram:

By contrast, Skip-Gram predicts the surrounding context words given a single target word.

Imagine we’re trying to predict “the”, “cat”, “sat”, and “on” given “mat”.

Step-by-Step Training

  1. One-Hot Encoding of the Target Word:

    • Convert the target word (“mat”) into a one-hot vector (length = vocabulary size).
  2. Word Embedding Lookup:

    • The hidden layer is actually an embedding matrix (a lookup table).
    • Multiplying the one-hot vector by the embedding matrix directly selects the word embedding (i.e., the row corresponding to “mat”).
  3. Predict Context Words (One at a Time):

    • The embedding vector is passed to the output layer, which has neurons equal to vocabulary size.
    • The model tries to assign high scores to the actual context words and low scores to others.
  4. Softmax Activation:

    • Convert output scores into probabilities for each word in the vocabulary.
  5. Loss Function & Gradient Descent:

    • Compute cross-entropy loss for each (target, context) pair.
    • Adjust:
      • Word embeddings (so that similar words cluster together).
      • Output layer weights (for better context prediction).
  6. Optimisation: Negative Sampling (for Large Vocabularies)

    • Instead of computing softmax over the entire vocabulary, the model only updates the correct context words + a few random “negative” words.
    • This dramatically speeds up training.

🔹 Key Concept Behind Both Approaches

Similar words should appear in similar contexts.
This means that if you swap “cat” with “feline”, the surrounding words remain similar. Skip-Gram and CBOW both exploit this pattern to learn meaningful word embeddings.

🔹 Comparison Table

MethodPredictsFast?Handles Rare Words?Best For
CBOWTarget word from context✅ Faster❌ No (poor on infrequent words)Frequent words, classification
Skip-GramContext words from target❌ Slower✅ Yes (better for rare words)Rare words, deep semantic relationships

Drawbacks

  • It’s essentially a static embedding. It doesn’t change per context.