I would imagine that the "attention" phase of the LLMs could get longer over time as more resources are dedicated to them.
e.g. we are seeing the equivalent of movies that are 5 minutes long b/c they were hand animated. Once we move to computer animated movies, it becomes a lot easier to generate an entire film.
I agree they will get longer. ChatGPT (GPT3.5) is 2x larger than GPT3. 8192 tokens vs 4096.
The problem is that in the existing transformer architecture, the complexity of this is O(N^2). Making the context window 10x larger involves 100x more memory and compute.
We'll either need a new architecture that improves upon the basic transformer, or just wait for Moore's law to paper over the problem for the scales we care about.
In the short term, you can also use the basic transformer with a combination of other techniques to try to find the relevant things to put into the context window. For instance, I ask "Does Harry Potter know the foobaricus spell?" and then the external system does a more traditional search technique to find all sentences relevant to the query in the novels, maybe a few paragraph summary of each novel, etc, then feeds that ~1 page worth of data to GPT to then answer the question.
This is a speculation based on a few longer chats I've had but I think ChatGPT does some text summarization (similar to the method used to name your chats) to fit more into the token window.
e.g. we are seeing the equivalent of movies that are 5 minutes long b/c they were hand animated. Once we move to computer animated movies, it becomes a lot easier to generate an entire film.