Why Does Attention Sink Occur?

Mingwu Zheng

Background

The phenomenon known as Attention Sink was first observed in the paper. The observation indicated that deeper transformer layers increasingly attend to the first token. However, the original paper did not provide an explanation for this phenomenon.

But Why?

Currently, no paper has provided a fully convincing explanation for why attention sink occurs. An empirical study titled "WHEN ATTENTION SINK EMERGES" (we call “Empirical View” next) ****conducted last year attempted to analyze some patterns related to this phenomenon.

Here, we propose a hypothesis to explain all experimental phenomena related to Attention Sink:

Hypothesis: Attention Sink arises because transformers inherently require a Context-Aware Identity Layer. In other words, transformer attention blocks need the capability to output no change (identity) based on the context.

Evidence Supporting the Hypothesis

Evidence 1: The First Token’s Value is Nearly Zero

Figure 2 from https://arxiv.org/pdf/2410.10781

Observations in "Empirical View" indicate that the first token's value vector has a very small norm. Specifically, the first row, third column figure from the referenced experiment clearly shows this. Even when attended, the first token contributes very little to the final result. This raises the question: Why would the model attend to a token with near-zero value?

At early positions (first two or three tokens), the model typically learns bigram or trigram patterns, which are relatively simple tasks and can often be handled by shallow layers.
Deeper layers, thus, may simply need to perform an identity function—maintaining the output unchanged.

To achieve an identity transformation via attention blocks:

Considering residual connections, attention blocks must output zero to achieve identity.
This cannot be statically implemented with a fixed parameter (like setting weight matrices to zero) due to data-dependent conditions at each layer and token.