Background:
As we saw in Part-of-Speech Tagging, a sequence of words has a probability that a given path generated it. Ideally, we find the state path that most likely generated them.
What are HMMs:
A Markov Process is the assumption that the next state depends only on the current state - not the full history; i.e.
Therefore, given:
- Sequential observations (e.g. words, sounds etc)
- Latent/hidden underlying states that caused those observations (e.g. if observation is item of clothing worn, then underlying state may be the weather.)
- The hidden states follow a Markov Process: The assumption that the next state depends only on the current state and not the full history
HMMs are models that see a sequence of things (words, sounds etc.) and try to infer the hidden underlying states (e.g. POS tags), using probability over time.
Viterbi: How tf do we get the most likely sequence of hidden states?
- : best score up to previous time step in state
- : transition probability from to current state
- : emission probability of the observation at time from state
- The max selects the most likely path to state at time
- The starting doesn’t include transition probability, just previous + emission.
Given:
- Observations
- States
- Emissions (how likely an observation is for a given state)
- Transitions (how likely one state follows another)
- Start Probabilities
Setup:
- The sentence “Dogs bark”, which we observe as
["Dogs", "bark"]
- Our hidden states are:
Noun (N), Verb (V)
- Emissions Probabilities:
- I.e. “Dogs” is very likely to be a noun
- Transition Probabilities of :
Noun -> Noun = 0.1
Noun -> Verb = 0.9
Verb -> Noun = 0.5
Verb -> Verb = 0.5
- Probability of Starting:
Viterbi Table Setup:
Word 1: Setup:
Word | Noun (N) | Verb (V) |
---|---|---|
“Dogs” | v[Noun] = start[Noun] × emission[Noun]["Dogs"] = 0.6 x 0.9 = 0.54 | v[Verb] = start[Verb] × emission[Verb]["Dogs"] = 0.4 x 0.1 = 0.04 |
Word 2:
Now, we compute the probabilities of “bark”, given it came from either noun or verb.
Assuming “bark” is a noun: (i.e. v["bark"][Noun]
):
Noun → Noun:
Verb → Noun:
Since 0.0108 is larger, we can assume that, given “bark” is a noun, it’s most likely that “Dogs” is also a noun.
Assuming “bark” is a verb: (i.e. v["bark"][Verb]
):
Noun → Verb:
Verb → Verb:
Since 0.3888 is larger, we can assume that, given “bark” is a verb, it’s most likely that “Dogs” is also a noun.
Updating the table:
Word | Noun (N) | Verb (V) | Backpointer |
---|---|---|---|
”Dogs” | 0.54 | 0.04 | start |
”bark” | 0.0108 | 0.3888 | Noun |
Once we’ve hit the end, we backtrack:
Since the highest end score was , we backtrack from that word to the most likely state that gave us that. We repeat that until we get back to the start. The back-tracked path is thus the most likely path.
Where tf did the probabilities come from?
Take a labelled corpus, and simply count and divide (like in normal Probability duh).
Problems with Viterbi
Let’s say, by pure chance, the model’s path begins to go down 2+ words that are incredibly uncommon in the training data. After that point, the model’s behaviour will be incredibly unpredictable. The same is also true in language models. Whenever you feed it solidGoldMagikarp (distinctly random, non-human text), then the model begins to freak out / act randomly.