Why?

Let’s say we have an RNN as a language model (or LLM). We get it to produce to translate some French into two pieces of text (MT1 and MT2). How do we evaluate them?

We need to design a metric which would score MT1 > MT2.

BLEU (Bi-Lingual Evaluation Understudy) Metric - Example

Count the overlapping n-grams. The more overlap, across multiple ngrams, the better.

MT: “The the the the the the the a”
Human Generated Reference 1: “The cat is on the mat”
Human Generated Reference 2: “There is a cat on the mat”

Steps:

  1. We get the ngram precision for each ngram. We get this by:
    • N-gram Precision Scores =
  2. Get the geometric mean of the precision scores.
  3. Multiply by a penalty for being too short.

the’ appears 7 times in MT
the’ appears 2 times in Reference 1
the’ appears 1 time in Reference 2
’a’ appears 1 time in MT
’a’ appears 0 times in Reference 1
’a’ appears 1 time in Reference 2

Unigram precision: (2 + 1) / (7 + 1) = 3 / 8

Weaknesses:

  • Not great for paraphrases of the translation.
  • Doesn’t penalise hallucinations much.
  • Unreliable for creative tasks - “Generate a scary novel about Edinburgh” - the human examples would have deviated too much.