Beam Search

Learn how beam search works in the context of large language model (LLM) decoding, why it’s a widely used strategy for generating coherent text, and how to identify and fix its limitations.

“What is beam search in the context of LLM decoding, and its limitations?” This is a commonly asked question because decoding strategies are crucial for generating high-quality text with large language models (LLMs). Top AI companies—from research-focused labs like OpenAI and Anthropic to industry teams at Meta AI, Hugging Face, and startups like Mistral AI—care about how their models produce text. Interviewers at these organizations want to ensure you understand not just how to train models, but also how to guide a model’s output. Decoding algorithms like beam search directly impact the coherence, diversity, and quality of an LLM’s generated text, making it a favorite topic in technical interviews.

You’ll likely encounter this question for the AIML engineer role, where one needs to fine-tune or deploy language models. By asking about beam search, the interviewer tests a few things: your grasp of LLM internals, your understanding of sequence generation trade-offs, and whether you know how to improve or modify decoding strategies for better results. In other words, they want to see if you can reason why an LLM might generate dull or repetitive answers and how to fix that.

Before diving into beam search, it helps to understand what “search” means in LLM decoding. We’ll cover this in detail in the next section, including how greedy, beam, and sampling strategies differ in exploring the space of possible sequences. Let’s get started!

What is search in LLMs?

When an LLM generates text, it does so one token at a time, and at each step it outputs a probability distribution over all possible next tokens. Deciding which token to pick from that distribution is decoding, or “search,” because we are effectively searching for the best sequence the model can produce. You can imagine the generation process as a tree of possibilities: the root is the start of your sequence, each tree level represents the next token position, and each branch is a possible token. The job of a decoding strategy is to traverse this tree in a smart way to build a complete sequence.

The greedy search is a straightforward strategy, which takes the highest-probability token at each step. Greedy search is fast and easy—but it’s shortsighted. Picking the locally most likely token can lead to a suboptimal overall sequence​. For example, maybe the second-best token at one stage would lead to a much better sentence later. Greedy search “commits” to a path without considering alternatives, much like a person who always takes the seemingly best immediate move in a game, potentially missing a winning long-term strategy. Technically, greedy decoding doesn’t always produce the globally most likely sequence under the model’s probability distribution​. It often results in generic or repetitive text because the model chooses the obvious high-probability continuations​.

At the other extreme, one could attempt exhaustive search, evaluating every possible sequence to find the absolute top-probability one​. But this is utterly infeasible for language generation – if your vocabulary has 50,000 tokens and you generate even a 20-token sentence, the number of possible sequences is astronomically high (50,000^20!). Exhaustive search is exponentially expensive. Thus, practical decoding requires a heuristic search that balances quality and computational cost. This is where beam search comes in—a clever algorithm that lies between greedy and exhaustive search.

What is beam search?

Beam search is a popular decoding algorithm that improves greedy search by keeping multiple possibilities open at each step. The core idea is that instead of following just one path down the generation tree, we maintain K parallel candidates (where K is called the beam width). Instead of choosing the best next token at each generation step, we look at the top K tokens and branch out. We temporarily explore up to K different continuations for each partial sequence. We then pick the K best resulting sequences (by overall probability) and discard the rest. In effect, beam search expands a beam of the most promising hypotheses at each step, hence the name.

Here’s how beam search works step by step:

  1. Start with the special beginning-of-sequence token (let’s call it <SOS>). This is our initial sequence. The score (overall probability) of this sequence is 1 (or log score 0, since log(1)=0).

  2. Use the model to get probabilities for the next token. Instead of picking just one token, pick the top K tokens. This results in K one-token sequences (each sequence is <SOS> followed by one of these top tokens). We now have K hypotheses in our beam.

  3. For each hypothesis in the beam, ask the model for the probabilities of the next token after that sequence. Each hypothesis can branch out into new candidates by appending one of the top tokens following it. If our beam size is K, we take the top K expansions overall (not per hypothesis) as the new beam. This means we consider all the K * current_hypotheses possible next steps (up to K^2 new sequences), and select the K with the highest total scores.

  4. The score of a sequence is typically the product of the probabilities of each token in the sequence (or, equivalently, the sum of log probabilities). For practical implementation, we sum log-probabilities to avoid numerical underflow. The “best” sequences have the highest cumulative probability so far.

  5. Continue expanding this way—each step, expand each beam hypothesis by the top tokens, then prune back to the top K sequences—until you reach a stopping condition. The search usually stops when either (a) an end-of-sequence token (<EOS>) is produced for all top sequences or (b) a maximum length is reached. Often, beam search will continue until each of the K sequences has ended with <EOS> (or until max length), then choose the highest-scoring completed sequence as the final output.

By keeping multiple candidates, beam search can “look ahead” and find sequences that greedy search would miss​. For example, suppose our model is writing a sentence and at some position the most probable next word is “the” (according to its learned probabilities), but the second-most probable word “a” actually leads to a more fluent sentence overall. Greedy would pick “the” and maybe get stuck in a less optimal continuation. Beam search with K≥2 would consider both “the” and “a” as candidates at that step. It might find that the continuation after “a” yields a higher total probability in the end, and thus output the sentence with “a” which is better English in context. In this way, beam search balances exploration and exploitation: it doesn’t commit to one path but doesn’t explore everything—just the top few options at each point.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.