Why?
Let’s say we have an RNN as a language model (or LLM). We get it to produce to translate some French into two pieces of text (MT1 and MT2). How do we evaluate them?
We need to design a metric which would score MT1 > MT2.
BLEU (Bi-Lingual Evaluation Understudy) Metric - Example
Count the overlapping n-grams. The more overlap, across multiple ngrams, the better.
MT: “The the the the the the the a”
Human Generated Reference 1: “The cat is on the mat”
Human Generated Reference 2: “There is a cat on the mat”
Steps:
- We get the ngram precision for each ngram. We get this by:
- N-gram Precision Scores =
- Get the geometric mean of the precision scores.
- Multiply by a penalty for being too short.
‘the’ appears 7 times in MT
’the’ appears 2 times in Reference 1
’the’ appears 1 time in Reference 2
’a’ appears 1 time in MT
’a’ appears 0 times in Reference 1
’a’ appears 1 time in Reference 2
Unigram precision: (2 + 1) / (7 + 1) = 3 / 8
Weaknesses:
- Not great for paraphrases of the translation.
- Doesn’t penalise hallucinations much.
- Unreliable for creative tasks - “Generate a scary novel about Edinburgh” - the human examples would have deviated too much.