Mingwu Zheng
The phenomenon known as Attention Sink was first observed in the paper. The observation indicated that deeper transformer layers increasingly attend to the first token. However, the original paper did not provide an explanation for this phenomenon.
Currently, no paper has provided a fully convincing explanation for why attention sink occurs. An empirical study titled "WHEN ATTENTION SINK EMERGES" (we call “Empirical View” next) ****conducted last year attempted to analyze some patterns related to this phenomenon.
Here, we propose a hypothesis to explain all experimental phenomena related to Attention Sink:
Hypothesis: Attention Sink arises because transformers inherently require a Context-Aware Identity Layer. In other words, transformer attention blocks need the capability to output no change (identity) based on the context.
Figure 2 from https://arxiv.org/pdf/2410.10781
Observations in "Empirical View" indicate that the first token's value vector has a very small norm. Specifically, the first row, third column figure from the referenced experiment clearly shows this. Even when attended, the first token contributes very little to the final result. This raises the question: Why would the model attend to a token with near-zero value?
To achieve an identity transformation via attention blocks: