Introduction to Deep Learning & Neural Networks/

...

Multi-Head Self-Attention

Learn how transformers capture multiple relationships in parallel.

We'll cover the following...

Why multiple heads?
- Key advantage
PyTorch implementation of multi-head attention
- What’s next?

We saw how self-attention finds relationships between words in a sequence — for example, linking “love” more strongly with “I” and “you” than with “Hello.”

But here’s the thing: a single self-attention layer focuses on one set of relationships at a time. Language is richer than that.
A sentence might have:

Grammatical dependencies (“I” → “love”)
Semantic connections (“love” ↔ “you”)
Positional cues (“Hello” at the start indicates a greeting)

A single attention “head” might latch onto one of these, but we want our model to notice all of them at once.

Why multiple heads?

Multi-head self-attention runs several self-attention operations in parallel, each with its own learnable projection of Queries, Keys, and Values.
Think of it as giving the model multiple sets of eyes — one head might pay attention to subject–verb links, another to nearby words, another to long-distance context.

Here’s the process:

Project the input embeddings into multiple smaller spaces — one set for each head.
Apply self-attention independently in each head.
Concatenate the outputs from all heads.
Project back into the original embedding size.

Mathematically:

Ask

Learn Deep Learning

Neural Networks

Training Neural Networks

Convolutional Neural Networks

Recurrent Neural Networks

Autoencoders

Generative Adversarial Networks

Attention and Transformers

Graph Neural Networks

Conclusion

Final Quiz

Multi-Head Self-Attention

Why multiple heads?