Moreover “snapping” the hidden state to a token is akin to quantization. It’s lossy. By staying in latent space the model can “reason” at “full resolution” without discretization noise.
Sometimes discretization introduces interesting behavior though. Compare for example the logistic map and it's chaotic regime with the simplicity of the logistic ODE. Another example would be quantum mechanics compared to classical mechanics and determinism. The Poincare Conjecture was only interesting for n=3 due to too much connectivity in higher dimensions. Wouldn't it be interesting if consciousness only arose in such a discretized form, a case of incidental complexity and chaos introduced as the result of topological non-triviality from quantization?
Don't forget, non-linearity is fundamental to the whole process, otherwise you'd just have one large linear transformation. Maybe there's a similar role for discretization? :shrug:
Useful information about conceptual relationships and procedure can be captured in the LM head, so there is also potential lossiness when short-circuiting it.