What is Cross Attention?

In transformers there are two kinds of attention - Self attention and cross attention. Cross Attention is the part on the left hand side of the transformer diagram below that is often not talked about because it is not in GPTs. ![[Pasted image 20250413154851.png]] **Self Attention** Your usual "next token" prediction task which uses GPT's (such as in chatgpt), only uses the right hand side of the image (Self Attention). They are also known as "decoder only" architectures. Self-attention is where you only have one sequence. In this sequence each token "looks at" (attends to) every other token that comes before it in the sequence and updates its hidden state. We will call this sequence the "output" sequence **Cross Attention** Cross attention (used in sequence-to-sequence methods) introduces a new sequence (which we will call the "input" sequence) which is separate to the sequence you are trying to generate. This new sequence is encoded using a bi-directional transformer (such as BERT). Then, cross attention is used so that the output sequence attends to the encoded tokens of the input sequence. This is done where the output sequence becomes the "query" matrix, and the input sequence becomes the "key" and "value" matrix. ### You might be wondering: *I can replicate Cross Attention through purely Decoder only architectures by just stuffing everything into the prompt. When and Why should I use Cross Attention?* Reasons for Cross Attention - When you want to cleanly separate the input from the output and the input is fixed over the generating task - Save compute by encoding the long input only once - You want more flexibility and modularity to change the encoder - You want long term memory # Step By Step Example ## STEP 0: Inputs **Encoder "Input" text** "The quick brown fox jumped over the lazy dog" **Decoder "Output" initial prompt** "Summarise this story for me". ## **STEP 1: Tokenize both sequences** Each sentence gets turned into tokens. For simplicity, let’s assume subword tokenization results in: - **Input tokens (Encoder)**: [The, quick, brown, fox, jumped, over, the, lazy, dog] → 9 tokens - **Target tokens (Decoder)**: [Summarise, this, story, for, me] → 5 tokens (initially available during teacher forcing, or decoded one-by-one during generation) Each input token is embedded and passed through the **encoder** (i.e., the “document processor”). The encoder outputs a **sequence of hidden states**, one per token: `Encoder output = [h_enc₁, h_enc₂, ..., h_enc₉] ← shape: (9, d_model)` These will later be used as **keys** and **values** in cross-attention. ## **🔄 STEP 3: Start decoding (self + cross attention)** Now we start generating the output sequence, _one token at a time_. Let’s say we’re at position 4 in the decoder, and the model has already generated: > "Summarise this story" So the decoder’s input so far is: [Summarise, this, story] ### **🔹 Self-Attention (within decoder)** The decoder first performs **masked self-attention** over its own past tokens: `"Summarise" can’t see anything` `"this" attends to "Summarise"` `"story" attends to "Summarise", "this"` This gives contextualized hidden states like: `Decoder hidden = [h_dec₁, h_dec₂, h_dec₃]` These are used to: - Predict the next token ("for") - Serve as **queries** in cross-attention ### **🔸 Cross-Attention (decoder to encoder)** Each decoder hidden state (h_dec₁, h_dec₂, h_dec₃) now attends over the full encoder output: - `h_dec₁ attends to all 9 tokens in the input: [h_enc₁, ..., h_enc₉]` - `h_dec₂ attends to all 9 tokens in the input` - `h_dec₃ attends to all 9 tokens in the input` **Cross-attention outputs** are then combined with self-attention outputs to produce the final decoder representation at each step. ## **🧾 STEP 4: Output prediction** Finally, the decoder’s latest hidden state (for “story”) goes through a linear layer + softmax to predict the **next token**: > Model predicts "for" with high probability. Then you feed "for" into the decoder and repeat.