It's not. It can do in context learning, which Markov chains cannot do.

golol · on March 24, 2023

It is a Markov Chain on the state space {Tokens}^CtxWindow.

nl · on March 24, 2023

I don't think that's clear at all.

https://arxiv.org/abs/2212.10559 shows a LLM is doing gradient descent on the context window at inference time.

If it's learning relationships between concepts at runtime based on information in the context window then it seems about as useful to say it is a Markov chain as it is to say that a human is a Markov chain. Perhaps we are, but the "current state" is unmeasurably complex.

golol · on March 25, 2023

Well all the information it learns at runtime is encoded in the context window. I don't feel like {tokens}^ctxWindow is unmeasurably complex. I think one should see a transformer as a stochastic computer operating on its memory. If you modelled a computer as a stochastic process, would you taje the state space to consist of the most recent instruction, or instead the whole memory of the computer?

nl · on March 27, 2023

GPT-4 has a token window of 32K tokens. I don't think GPT-4's vocabulary size has been released but GPT-3 is 175K. I guess yes, the complexity is technically measurable but it does seem pretty large!