Please don't act like you "know how it works" when you obviously don't. The issu...

Please don't act like you "know how it works" when you obviously don't.

The issue is not the fact that the model "thinks or doesn't think in tokens". The model is forced at the final sampling/decoding step to convert it's latent back into tokens, one token at a time.

The models are fully capable of understanding the premise that they should "output a 5-7-5 syllable Haiku", but from the perspective of a model trying to count its own syllables, this is not possible, as its own vocabulary is tokenized in such a way that not only does the model not have direct phonetic information within the dataset, but it literally has no analogue for how humans count syllables (measuring mouth drops). Models can't reason about the number of characters or even tokens used in a reply too for the same exact reason too.

The person you're replying to broadly is right, and you are broadly wrong. The internal format does not matter when the final decoding step forces a return of tokenization. Please actually use these systems rather than pontificating about them online.