Markov Assumption

This is all based off the Markov Assumption: The probability of a word only depends on a fix number of previous words. This means N-Gram language models are not that good for long contexts.

What:

Before Transformers or LSTMs, we had N-Gram language models. They were quick, explainable, efficient statistical language models.

It predicts the probability of a word given its previous words. (Conditional Probability)

How?

First, N-Grams:

A sequence of n words from a given text.

  • Unigram (n=1): Each word is independent (e.g., “The”, “dog”, “runs”).
  • Bigram (n=2): Considers pairs of words (e.g., “The dog”, “dog runs”).
  • Trigram (n=3): Uses three-word sequences (e.g., “The dog runs”).

Predicting The Next Word:

In essence, we’re trying to predict the word given the words before it.

How?

We use the formula below.

where:

  • is the count of the word sequence .
  • is the count of the preceding word .

What if either are 0?

Often, with larger contexts / corpuses, you either denominator or numerator will be zero. We need to (similarly to Naive Bayes) add smoothing. How? Couple ways:

1. Backoff Smoothing

What if we use less context when we’ve never seen the full thing before?

We would then check the count of “on a”.

If still zero, we would check for “a”

If still zero, we would just take the unigram probability for the word itself.

2. Linear Interpolation:

We would mix the all of the probabilities of the unigram, bigram, trigram etc. The exact weighting of each () would be learned (and the sum of all weights adds up to 1).


3. Laplace Smoothing

Also used in Naive Bayes, we simply add (or a small ) to every single count. That’s it lol. It works surprisingly well.