Elo Rating Systems for LLMs

Learn how the Elo rating system transforms pairwise human judgments into dynamic, scalable rankings for evaluating large language models.

The Elo rating system was originally developed for ranking players in competitive games like chess. It assigns each player a numerical rating that rises or falls based on game outcomes. In chess, for example, a newcomer might start around 1200 points and gain or lose points after each match, depending on whether they win or lose and how strong their opponent was. Over time, Elo ratings reflect players’ relative skill levels—a higher rating means the player is expected to win more often against lower-rated opponents.

Recently, this system has been adapted to evaluate LLMs by treating model comparisons like games, and it is an area that companies are exploring to see how their models perform in the real world compared to other benchmarks. Instead of chess players, we have language models; instead of a chess match, we have a pairwise comparison of their answers. Two models (say Model A and Model B) are given the same prompt, and a human judge decides which model’s response is better. We can think of the “winner” of this comparison as winning a game. Elo would update a chess player’s rating after a match, so we update the models’ Elo scores based on the comparison outcome. Over many such comparisons, the Elo scores rank the models from strongest to weakest in performance. This approach has become popular for open-ended LLM evaluation, where direct automatic metrics fail—crowdsourced pairwise voting with Elo provides an intuitive and scalable way to construct a leaderboard of models.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.