The Illustrated Attention via Einstein Summation

Introduction to einsum with attention operations.

This blog aims to lay the groundwork for a series of deep-dive articles on transformers. We briefly introduce the notion of Einstein Summation (einsum) or generalized tensor product, which provides a convenient framework for thinking about how tensors interact. With the einsum notation, we will be able to see what each operation does without having to worry about technical implementation details such as which axes to transpose or permute. If you have not heard of it before, it may take some time to develop an understanding and become comfortable with it, but it can change your life in terms of how you think about tensor operations and make things much easier to understand in the long run. For a more detailed blog on einsum, you can check out Einsum Is All You Need.

Notation

This section explains the notation that will be used in the following discussion.

Tensor Operations

In this section, we seek to develop an intuition about the meaning of different einsum operations. This will help develop a deep understanding of the attention mechanism in the future. We will see that many familiar operations such as matrix multiplication or dot products can be described neatly with einsum.

Einsum

We will use the notation \(C= \langle A,B\rangle: \langle \text{shape}_A,\text{shape}_B \rangle \to \text{shape}_C\) as the Einstein sum between \(A\) and \(B\).

Einsum Examples

Multi-Head Attention

For a detailed understanding of the GPT architecture, I recommend The Illustrated GPT-2, The GPT Architecture on a Napkin, and Let’s build GPT: from scratch, in code, spelled out.

We describe the attention in two stages. Given inputs with batch size b and m tokens, we first perform the context computation to obtain the key and value tensors that will be needed later for incremental decoding.

Figure 1: Attention via Einsum

Context Computation

Incremental Decoding