Evaluating Language Model Output

Why?

Let’s say we have an RNN as a language model (or LLM). We get it to produce to translate some French into two pieces of text (MT1 and MT2). How do we evaluate them?

We need to design a metric which would score MT1 > MT2.

BLEU (Bi-Lingual Evaluation Understudy) Metric - Example

Count the overlapping n-grams. The more overlap, across multiple ngrams, the better.

MT: “The the the the the the the a”
Human Generated Reference 1: “The cat is on the mat”
Human Generated Reference 2: “There is a cat on the mat”

Steps:

We get the ngram precision for each ngram. We get this by:
- N-gram Precision Scores = $\frac{\sum _{word \in MT} m i n ( count _{word j} in MT , max count _{word j} in references )}{total candidate words}$
Get the geometric mean of the precision scores.
Multiply by a penalty for being too short.

‘the’ appears 7 times in MT
’the’ appears 2 times in Reference 1
’the’ appears 1 time in Reference 2
’a’ appears 1 time in MT
’a’ appears 0 times in Reference 1
’a’ appears 1 time in Reference 2

Unigram precision: (2 + 1) / (7 + 1) = 3 / 8

Weaknesses:

Not great for paraphrases of the translation.
Doesn’t penalise hallucinations much.
Unreliable for creative tasks - “Generate a scary novel about Edinburgh” - the human examples would have deviated too much.

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

Evaluating Language Model Output

Why?

BLEU (Bi-Lingual Evaluation Understudy) Metric - Example

Weaknesses:

Graph View

Table of Contents