"This is more computationally efficient than performing a full content-based lookup across an entire memory buffer for each step in the future, and could be one step towards drastically increasing the context-length available for making a prediction."
Is this how they get a context window of 10 million tokens? Or are they refering to even longer context windows in the future?
Is this how they get a context window of 10 million tokens? Or are they refering to even longer context windows in the future?