Awesome summarization by someone who read and actually understood the paper.

> One way of increasing the density of the attention layers is to add more context. They show that simply prepending any token as context to a question makes the LLM perform better. Adding relevant context makes it even better.

Right, I think a more intuitive way to think about this is to define density: the number of _edges_ in the self-attention graph connecting tokens. Maybe a simpler explanation: the number of times a token had some connection to another token divided by the number of tokens. So, tokens which actually relate to one another and provide information are good, non sequitur tokens don't help except that you say

> They show that simply prepending any token as context to a question makes the LLM perform better.

I think this is not quite right. What they found was:

> pre-pending the question at hand with any type of token does increase the intrinsic dimension at the first layer

> however, this increase is not necessarily correlated with the reasoning capability of the model

but it is only

> when the pre-pended tokens lead to an increase in the intrinsic dimension at the *final layer* of the model, the reasoning capabilities of the LLM improve significantly.

(emphasis mine)