What:
Typically in ML, we don’t feed the Vector Embeddings of full raw words or full images into Neural Networks. Instead, we feed in tokens - parts of a word or image.
Why?
- We could take individual characters and feed those into the model. But that’s inefficient for longer words.
- We could take whole words - but how do we deal with new words / what we’ve not seen?
Instead, it’s far more useful to take common word chunks - common words have their own encoding and new words can be efficiently constructed.
How?
Turning words into tokens - tokenisation. We often use Byte Pair Encoding (BPE - used in ChatGPT).
Algorithm
Repeat:
- Count pairs of tokens: how many times each pair occurs together in the training data;
- Find the most frequent pair of tokens;
- Merge this pair - add a merge to the merge table, and the new token to the vocabulary.
(Contrived) Example:
1. Start with Characters:
Text: "low, lower, lowest, Bar"
- Initial tokens:
["l", "o", "w", ",", "l", "o", "w", "e", "r", "l", "o", "w", "e", "s", "t", "b", "a", "r"]
2. Iteratively Merge the Most Frequent Pairs Until Desired Vocab Size is Met:
Find the most common character pairs and merge them into tokens.
- Iteration 1:
"l" + "o"
→["lo", "w", ",", "lo", "w", "e", "r", "lo", "w", "e", "s", "t", "b", "a", "r"]
- Iteration 2:
"lo" + "w"
→["low", ",", "low", "e", "r", "low", "e", "s", "t", "b", "a", "r"]
- Iteration 3:
"low" + "e"
→["low", ",", "lowe", "r", "lowe", "s", "t", "b", "a", "r"]
- Iteration 4:
"lowe" + "r"
→["low", ",", "lower", "lowe", "s", "t", "b", "a", "r"]
- …
- Final tokens:
["low", ",", "lower", "lowest", "bar"]
(assuming desired vocab size was 5).
This (admittedly contrived, slightly falsified example) also let’s us create new words efficiently as well - e.g. "Barlow" = "bar" + "low"
Problems:
1️⃣ Segments Based on Frequency, not Meaning or Morphology:
"running" → ["run", "##ning"]
"runner" → ["run", "##ner"]
But "runs" → ["runs"] (no split!)
The model might not generalise well between “running”/“runner”, and “runs”, even though they share the same root.
2️⃣ Inconsistent Across Training Datasets:
- In one dataset:
"New York"
→["New", " York"]
- In another dataset:
"New York"
→["New", "Y", "ork"]
The same word might have different representations depending on the corpus used, hurting consistency in downstream tasks.
3️⃣ Fixed Vocab Size:
Once BPE is trained, the vocabulary is fixed.
Effect: New words that weren’t in the training data are awkwardly split into subwords, even if they are common in future datasets.
Example:
"COVID19"
→["CO", "VID", "19"]
Inconsistent tokenisation for out-of-vocabulary words leads to poor generalisation in real-world applications.
Funnily enough, even after all of those problems - it’s still the best we’ve got.