> This is striking. If true, why not try to ignore whitespace and puctuation?
It is initially, but thinking about it some more, there's a lot of information packed in whitespace and punctuation choice.
Scripto continua may have worked because the few readers who lived back then expected it to encode some form of legal or religious prose, but even then they could learn things from the overall shape of the document. LLMs are working in a much richer domain of document types, but the only thing they can "see" is a stream of tokens. There's no spatial or geometric data attached there. So whitespace and punctuation are the only thing an LLM has to make inferences about otherwise textually identical inputs. Such as:
(see: other) -- vs -- {see: other}
One being likely a text fragment, the other likely a piece of code.
Or how spacing may imply Markdown or YAML being used. Or how it may imply a list. Or a poem. Or a song. Or specific writing style, such as "lol im a casual who not care bout comms" vs. "I am a distinguished professor, about to retire. Elites like us put two spaces after full stop."
It is initially, but thinking about it some more, there's a lot of information packed in whitespace and punctuation choice.
Scripto continua may have worked because the few readers who lived back then expected it to encode some form of legal or religious prose, but even then they could learn things from the overall shape of the document. LLMs are working in a much richer domain of document types, but the only thing they can "see" is a stream of tokens. There's no spatial or geometric data attached there. So whitespace and punctuation are the only thing an LLM has to make inferences about otherwise textually identical inputs. Such as:
One being likely a text fragment, the other likely a piece of code.Or how spacing may imply Markdown or YAML being used. Or how it may imply a list. Or a poem. Or a song. Or specific writing style, such as "lol im a casual who not care bout comms" vs. "I am a distinguished professor, about to retire. Elites like us put two spaces after full stop."