> Any time you see one of those giant context window LLMs, you need to be asking what heuristics they added, what is getting correlated, and what is not getting correlated.
Exactly. The paper doesn't even contain any experiments with context windows over 32K tokens. Presumably because it doesn't really attend to the rest of those tokens at all. In practice it's just a 32K attention window with some "theoretical" opportunity for attending a bit further than that.
Exactly. The paper doesn't even contain any experiments with context windows over 32K tokens. Presumably because it doesn't really attend to the rest of those tokens at all. In practice it's just a 32K attention window with some "theoretical" opportunity for attending a bit further than that.