Positional Encodings

Learn how positional encodings enable Transformers to understand word order by injecting sequence information using absolute or relative methods.

Interviewers at top AI labs often ask about positional encodings in Transformers because this question probes fundamental understanding of how sequence models work. The Transformer architecture—the basis for models like BERT, GPT, and others—does not use recurrence or convolution. That means it has no built-in notion of word order, unlike an RNN that processes tokens individually. Without an extra signal, a Transformer would treat a sentence as a “bag of words.” For example, the sentences “John likes cats” and “Cats like John” would look identical to the model​​, even though their meanings are very different. Positional encodings are the mechanism for injecting order information into the Transformer.

Interviewers want to know if you understand why this is necessary and how the two main approaches—absolute vs. relative positional encodings—differ. A strong answer will explain that positional encoding tells the Transformer where in the sequence each token is (allowing it to distinguish “first word” from “second word”, etc.), and will show awareness of trade-offs and modern variants of this idea. A clear response should demonstrate that you grasp intuition and the mechanics. The interviewer checks: Can you articulate why a parallel-attention model needs positional information​? Can you explain how absolute encodings (like the original sinusoidal scheme) work, vs. relative encodings (which capture token distances)? Can you discuss why one might choose one method in practice?

A top candidate will also mention variations (learned vs. fixed embeddings, rotary or bias-based methods) and relate this to real tasks. Finally, even though new architectures (Mixture-of-Experts, RWKVRWKV (pronounced RwaKuv) is an RNN with great LLM performance and parallelizable like a Transformer., Mamba, etc.) are emerging, understanding Transformers is still critical​. Many LLMs still use Transformers (sometimes with relative encoding like T5 or with absolute encodings as in GPT/BERT), and new models often need some way to capture sequence order one way or another.

What exactly is positional encoding?

Transformers process all tokens in parallel through self-attention, so they have no inherent information about sequence order. Positional encoding is the extra signal we add to tell the model where each word sits in the sequence​. In practice, we assign each position in the input (1st word, 2nd word, etc.) a unique vector and add that to the token’s word embedding. You can think of it like giving each word a timestamp or a coordinate. For example, consider the French sentence “Je suis étudiant” (“I am a student”). Before feeding it into a Transformer encoder, we take the embedding of each word (the green bars below) and add a positional vector (yellow bars) that encodes that word’s position in the sentence​.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.