Why?
Let’s say we have an RNN as a language model (or LLM). We get it to produce to translate some French into two pieces of text (MT1
and MT2
). How do we evaluate them?
We need to design a metric which would score MT1 > MT2
.
BLEU (Bi-Lingual Evaluation Understudy) Metric - Example
Count the overlapping n-grams.
MT: “The the the the the the the a”
Human Generated Reference 1: “The cat is on the mat”
Human Generated Reference 2: “There is a cat on the mat”
‘the’ appears 7 times in MT
’the’ appears 2 times in Reference 1
’the’ appears 1 time in Reference 2
’a’ appears 1 time in MT
’a’ appears 0 times in Reference 1
’a’ appears 1 time in Reference 2
Unigram precision: (2 + 1) / (7 + 1) = 3 / 8
Weaknesses:
- Not great for paraphrases of the translation.
- Doesn’t penalise hallucinations much.
- Unreliable for creative tasks - “Generate a scary novel about Edinburgh” - the human examples would have deviated too much.