They're evaluated in the context of the last 2^n many tokens, for many models it...

They're evaluated in the context of the last 2^n many tokens, for many models it is 1024, 2048, or 4096 tokens as a scanning window. The tokens (words and sometimes punctuation) are represented by integer values, so the last 2^n many tokens would certainly qualify for storage in a cache. Then next token selection only has so many possible assignable selections in any given language model because of grammatical limitations. This is only one such optimization, there could also be optimizations around the likelihood of certain words to be used given the presence of certain previous tokens, and so on.

But, yes, tokens are chosen one word as a time based on the previous content, similar to earlier auto-completion algorithms.