What:

Typically in ML, we don’t feed the Vector Embeddings of full raw words or full images into Neural Networks. Instead, we feed in tokens - parts of a word or image.

Why?

  • We could take individual characters and feed those into the model. But that’s inefficient for longer words.
  • We could take whole words - but how do we deal with new words / what we’ve not seen?

Instead, it’s far more useful to take common word chunks - common words have their own encoding and new words can be efficiently constructed.

How?

Turning words into tokens - tokenisation. We often use Byte Pair Encoding (BPE - used in ChatGPT).

Algorithm

Repeat:

  • Count pairs of tokens: how many times each pair occurs together in the training data;
  • Find the most frequent pair of tokens;
  • Merge this pair - add a merge to the merge table, and the new token to the vocabulary.

(Contrived) Example:

1. Start with Characters:

Text: "low, lower, lowest, Bar"

  • Initial tokens: ["l", "o", "w", ",", "l", "o", "w", "e", "r", "l", "o", "w", "e", "s", "t", "b", "a", "r"]
2. Iteratively Merge the Most Frequent Pairs Until Desired Vocab Size is Met:

Find the most common character pairs and merge them into tokens.

  • Iteration 1: "l" + "o" → ["lo", "w", ",", "lo", "w", "e", "r", "lo", "w", "e", "s", "t", "b", "a", "r"]
  • Iteration 2: "lo" + "w" → ["low", ",", "low", "e", "r", "low", "e", "s", "t", "b", "a", "r"]
  • Iteration 3: "low" + "e" → ["low", ",", "lowe", "r", "lowe", "s", "t", "b", "a", "r"]
  • Iteration 4: "lowe" + "r" → ["low", ",", "lower", "lowe", "s", "t", "b", "a", "r"]
  • Final tokens: ["low", ",", "lower", "lowest", "bar"] (assuming desired vocab size was 5).

This (admittedly contrived, slightly falsified example) also let’s us create new words efficiently as well - e.g. "Barlow" = "bar" + "low"

Problems:

1️⃣ Segments Based on Frequency, not Meaning or Morphology:

"running" → ["run", "##ning"]
"runner" → ["run", "##ner"]
But "runs" → ["runs"] (no split!)

The model might not generalise well between “running”/“runner”, and “runs”, even though they share the same root.

2️⃣ Inconsistent Across Training Datasets:
  • In one dataset"New York" → ["New", " York"]
  • In another dataset"New York" → ["New", "Y", "ork"]

The same word might have different representations depending on the corpus used, hurting consistency in downstream tasks.

3️⃣ Fixed Vocab Size:

Once BPE is trained, the vocabulary is fixed.
Effect: New words that weren’t in the training data are awkwardly split into subwords, even if they are common in future datasets.

Example:

  • "COVID19" → ["CO", "VID", "19"]

Inconsistent tokenisation for out-of-vocabulary words leads to poor generalisation in real-world applications.

Funnily enough, even after all of those problems - it’s still the best we’ve got.