...

BLEU and ROUGE

Learn how BLEU and ROUGE evaluate LLM outputs by matching n-grams to reference texts, and why they often fall short in real-world generative tasks.

We'll cover the following...

What exactly is BLEU?
How to implement BLEU in Python
What is ROUGE?
How to implement ROUGE in Python
What are the limitations of BLEU and ROUGE?
Conclusion

Top AI and tech companies now expect candidates to understand language model evaluation metrics beyond just perplexity, which was previously discussed. Metrics like BLEU and ROUGE are commonly brought up in interviews, as they assess the quality of generated outputs rather than just how well a model predicts the next token.

Interviewers want to see if candidates grasp what BLEU and ROUGE measure, when they are appropriate to use, and their limitations, especially in open-ended tasks like conversational AI. This demonstrates the difference between internal model confidence (perplexity) and output quality (BLEU/ROUGE).

Strong candidates explain how to choose the right metric for each task and critically assess their strengths and weaknesses, showing they don’t apply BLEU/ROUGE everywhere. While perplexity indicates how well a model learns language patterns, BLEU and ROUGE are crucial for evaluating the relevance and usefulness of actual model outputs in real-world applications.

What exactly is BLEU?

BLEU (Bilingual Evaluation Understudy) is an automatic metric originally designed for evaluating machine translation. Intuitively, BLEU scores a candidate translation by checking how many n-grams (contiguous word sequences) it shares with one or more human reference translations. More formally, BLEU computes n-gram precision: for each n (typically 1≤n≤4), it counts the fraction of n-grams in the model’s output that appear in the reference(s). These precisions are then typically combined (geometric mean) across n-gram orders. BLEU also includes a brevity penalty: if the generated translation is too short compared to the reference, BLEU penalizes it to avoid “cheating” by omitting content.

The BLEU score ranges from 0 to 1, where 1 indicates a perfect match to the references. For example, if the reference is “The cat is on the mat” and the model output is “The cat sits on the mat,” most words and bi-grams overlap so that the BLEU score would be relatively high.

Let’s now make this concrete with a small example:

Reference: “The cat is on the mat.”
Candidate: “The cat sat on the mat.”

Step 1: Tokenize both sentences:
- Reference tokens: [the, cat, is, on, the, mat]
- Candidate tokens: [the, cat, sat, on, the, mat]
Step 2: Extract 1-grams:

Total matched 1-grams (clipped): 2 + 1 + 1 + 1 = 5

Step 4: Calculate modified precision:
- Total candidate 1-grams: 6
- Matched (clipped): 5
- 1-gram precision = 5/6≈0.833
Step 5: Apply Brevity Penalty (Optional for BLEU-1):
- Reference length: 6
- Candidate length: 6
- As candidate length = reference length → BP = 1
Step 6: Compute final BLEU-1 score:
- BLEU-1 = BP × Precision = 1 × 0.833 = 0.833

Interpretation:

The candidate matched 5 out of 6 unigrams from the reference.
The one mismatch (“sat” vs. “is”) reduced precision.
Final BLEU-1 score: ~83.3%, reflecting strong surface similarity with one substitution.

In practice, BLEU is computed with a formula like:

Ask

Position	Candidate 1-gram	In Reference?
1	the	yes
2	cat	yes
3	sat	no
4	on	yes
5	the	yes
6	mat	yes

Introduction

Neural Network Training and Optimization

Embeddings and Tokenization

Attention Mechanisms

Evaluation Techniques

Model Architectures and Comparisons

Learning Techniques

Scalability and Efficiency

Wrap Up

BLEU and ROUGE

What exactly is BLEU?