BLEU and ROUGE

Learn how BLEU and ROUGE evaluate LLM outputs by matching n-grams to reference texts, and why they often fall short in real-world generative tasks.

Top AI and tech companies now expect candidates to understand language model evaluation metrics beyond just perplexity, which was previously discussed. Metrics like BLEU and ROUGE are commonly brought up in interviews, as they assess the quality of generated outputs rather than just how well a model predicts the next token.

Interviewers want to see if candidates grasp what BLEU and ROUGE measure, when they are appropriate to use, and their limitations, especially in open-ended tasks like conversational AI. This demonstrates the difference between internal model confidence (perplexity) and output quality (BLEU/ROUGE).

Strong candidates explain how to choose the right metric for each task and critically assess their strengths and weaknesses, showing they don’t apply BLEU/ROUGE everywhere. While perplexity indicates how well a model learns language patterns, BLEU and ROUGE are crucial for evaluating the relevance and usefulness of actual model outputs in real-world applications.

What exactly is BLEU?

BLEU (Bilingual Evaluation Understudy) is an automatic metric originally designed for evaluating machine translation. Intuitively, BLEU scores a candidate translation by checking how many n-grams (contiguous word sequences) it shares with one or more human reference translations. More formally, BLEU computes n-gram precision: for each n (typically 1≤n≤4), it counts the fraction of n-grams in the model’s output that appear in the reference(s)​. These precisions are then typically combined (geometric mean) across n-gram orders. BLEU also includes a brevity penalty: if the generated translation is too short compared to the reference, BLEU penalizes it to avoid “cheating” by omitting content​.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.