The Transformer Architecture

Learn about the inner workings of LLMs.

Let’s start with a basic question: How can a computer understand and generate text? Over the years, we’ve relied on various neural network structures, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to tackle language problems. Then a new architecture arrived: the transformer. It revolutionized the field so dramatically that most cutting-edge large language models (LLMs) today, including GPT, BERT, and T5, are built with some variation of the transformer.

That said, it’s important to note that not all state-of-the-art LLMs use the same transformer layout:

  • GPT (Generative Pre-trained Transformer) is primarily decoder-only.

  • BERT is an encoder-only model.

  • T5 (and many other text-to-text models) still employs a full encoder-decoder approach.

Press + to interact
The original transformers architecture visualized
The original transformers architecture visualized

In older RNN-based architectures, words were fed into the network one by one, which slowed down training and inference. Transformers, however, take all tokens (or words) in parallel, making them much faster and more scalable. But if we process everything at once, how do we preserve the order of the words? Positional embeddings solve this by adding small, learned “position” vectors to each token. You can think of them like little tags that tell the model, “This is word #1, this is word #2,” and so on.

One of the most powerful features of transformers is the attention mechanism, especially self-attention. This mechanism helps the model figure out which parts of a sentence matter most in relation to each ...

Ask