Markov Assumption
This is all based off the Markov Assumption: The probability of a word only depends on a fixed number of previous words. This means N-Gram language models are not that good for long contexts.
What:
Before Transformers or LSTMs, we had N-Gram language models. They were quick, explainable, efficient statistical language models.
It predicts the probability of a word given its previous words. (Conditional Probability)
How?
First, N-Grams:
A sequence of n words from a given text.
- Unigram (n=1): Each word is independent (e.g., “The”, “dog”, “runs”).
- Bigram (n=2): Considers pairs of words (e.g., “The dog”, “dog runs”).
- Trigram (n=3): Uses three-word sequences (e.g., “The dog runs”).
Predicting The Next Word:
In essence, we’re trying to predict the word given the words before it.
How?
We use the formula below.
where:
- represents your the in -gram lol. So is it a bigram, trigram etc.
- is the count of the word sequence from .
- is the count of the preceding words .
Essentially:
What if either are 0?
Often, with larger contexts / corpuses, either your denominator or numerator will be zero. We need to (similarly to Naive Bayes) add smoothing. How? Couple ways:
1. Backoff Smoothing
What if we use less context when we’ve never seen the full thing before?
We would then check the count of “on a”.
If still zero, we would check for “a”
If still zero, we would just take the unigram probability for the word itself.
2. Linear Interpolation:
We would mix the all of the probabilities of the unigram, bigram, trigram etc. The exact weighting of each () would be learned (and the sum of all weights adds up to 1).
3. Laplace Smoothing
Also used in Naive Bayes, we simply add (or a small ) to every single count. That’s it lol. It works surprisingly well.
Other Challenges:
Long-Range Dependency:
In language, one word often agrees with another, even when there’s a long in between.
Sam/Dogs sleeps/sleep soundly.
Sam, the man with red hair who is my cousin, sleeps soundly.
N-Grams would struggle to capture this, especially if outside of their context (i.e. words inside > N).