N-Gram Language Models

Markov Assumption

This is all based off the Markov Assumption: The probability of a word only depends on a fixed number of previous words. This means N-Gram language models are not that good for long contexts.

What:

Before Transformers or LSTMs, we had N-Gram language models. They were quick, explainable, efficient statistical language models.

It predicts the probability of a word given its previous words. (Conditional Probability)

How?

First, N-Grams:

A sequence of n words from a given text.

Unigram (n=1): Each word is independent (e.g., “The”, “dog”, “runs”).
Bigram (n=2): Considers pairs of words (e.g., “The dog”, “dog runs”).
Trigram (n=3): Uses three-word sequences (e.g., “The dog runs”).

Predicting The Next Word:

In essence, we’re trying to predict the $n^{th}$ word given the words before it.
$P (w_{n} ∣ w_{n - 1}, w_{n - 2}, ..., w_{n - k + 1})$

How?

We use the formula below.
$P (w_{n} ∣ w_{n - k}, \dots, w_{n - 1}) = \frac{C ( w _{n - k} , \dots , w _{n} )}{C ( w _{n - k} , \dots , w _{n - 1} )}$
where:

$k$ represents your the $n$ in $n$ -gram lol. So is it a bigram, trigram etc.
$C (w_{n - k}, \dots, w_{n})$ is the count of the word sequence from $(w_{n - k}, w_{n})$ .
$C (w_{n - k}, \dots, w_{n - 1})$ is the count of the preceding words $w_{n - 1}$ .

Essentially:
$P (next word ∣ (n - 1) previous words) = \frac{count of the full n -gram}{count of the ( n - 1 ) -gram prefix}$

What if either are 0?

Often, with larger contexts / corpuses, either your denominator or numerator will be zero. We need to (similarly to Naive Bayes) add smoothing. How? Couple ways:

1. Backoff Smoothing

What if we use less context when we’ve never seen the full thing before?
$P (mat ∣ cat on a) = \frac{N ( cat on a mat )}{N ( on a mat )} What if count of "on a mat" is zero?!?$
We would then check the count of “on a”.
$P (mat ∣ cat on a) \approx P (mat ∣ on a)$
If still zero, we would check for “a”
$P (mat ∣ on a) \approx P (mat ∣ a)$
If still zero, we would just take the unigram probability for the word itself.
$P (mat ∣ a) \approx P (mat)$

2. Linear Interpolation:

We would mix the all of the probabilities of the unigram, bigram, trigram etc. The exact weighting of each ( $λ_{n}$ ) would be learned (and the sum of all weights adds up to 1).

$P (mat ∣ cat on a) = \frac{N ( cat on a mat )}{N ( on a mat )} What if count of "on a mat" is zero?!?$
$\hat{P} (mat ∣ cat on a) \approx λ_{3} P (mat ∣ cat on a) + λ_{2} P (mat ∣ on a) + λ_{1} P (mat ∣ a) + λ_{0} P (mat)$

3. Laplace Smoothing

Also used in Naive Bayes, we simply add $1$ (or a small $δ$ ) to every single count. That’s it lol. It works surprisingly well.

Other Challenges:

Long-Range Dependency:

In language, one word often agrees with another, even when there’s a long in between.

Sam/Dogs sleeps/sleep soundly.
Sam, the man with red hair who is my cousin, sleeps soundly.

N-Grams would struggle to capture this, especially if outside of their context (i.e. words inside > N).

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

N-Gram Language Models

What:

How?

First, N-Grams:

Predicting The Next Word:

How?

What if either are 0?

1. Backoff Smoothing

2. Linear Interpolation:

3. Laplace Smoothing

Other Challenges:

Long-Range Dependency:

Graph View

Table of Contents

Backlinks