Multi-Head Self-Attention
Learn how transformers capture multiple relationships in parallel.
We'll cover the following...
We saw how self-attention finds relationships between words in a sequence — for example, linking “love” more strongly with “I” and “you” than with “Hello.”
But here’s the thing: a single self-attention layer focuses on one set of relationships at a time. Language is richer than that.
A sentence might have:
- Grammatical dependencies (“I” → “love”) 
- Semantic connections (“love” ↔ “you”) 
- Positional cues (“Hello” at the start indicates a greeting) 
A single attention “head” might latch onto one of these, but we want our model to notice all of them at once.
Why multiple heads?
Multi-head self-attention runs several self-attention operations in parallel, each with its own learnable projection of Queries, Keys, and Values.
Think of it as giving the model multiple sets of eyes — one head might pay attention to subject–verb links, another to nearby words, another to long-distance context.
Here’s the process:
- Project the input embeddings into multiple smaller spaces — one set for each head. 
- Apply self-attention independently in each head. 
- Concatenate the outputs from all heads. 
- Project back into the original embedding size. 
Mathematically:
where each head is:
Here,