Why?

Let’s say we have an RNN as a language model (or LLM). We get it to produce to translate some French into two pieces of text (MT1 and MT2). How do we evaluate them?

We need to design a metric which would score MT1 > MT2.

BLEU (Bi-Lingual Evaluation Understudy) Metric - Example

Count the overlapping n-grams.

MT: “The the the the the the the a”

Human Generated Reference 1: “The cat is on the mat”

Human Generated Reference 2: “There is a cat on the mat”

the’ appears 7 times in MT
the’ appears 2 times in Reference 1
the’ appears 1 time in Reference 2
’a’ appears 1 time in MT
’a’ appears 0 times in Reference 1
’a’ appears 1 time in Reference 2

Unigram precision: (2 + 1) / (7 + 1) = 3 / 8

Weaknesses:

  • Not great for paraphrases of the translation.
  • Doesn’t penalise hallucinations much.
  • Unreliable for creative tasks - “Generate a scary novel about Edinburgh” - the human examples would have deviated too much.