Hacker News new | past | comments | ask | show | jobs | submit login

> This is striking. If true, why not try to ignore whitespace and puctuation?

It is initially, but thinking about it some more, there's a lot of information packed in whitespace and punctuation choice.

Scripto continua may have worked because the few readers who lived back then expected it to encode some form of legal or religious prose, but even then they could learn things from the overall shape of the document. LLMs are working in a much richer domain of document types, but the only thing they can "see" is a stream of tokens. There's no spatial or geometric data attached there. So whitespace and punctuation are the only thing an LLM has to make inferences about otherwise textually identical inputs. Such as:

  (see: other)  -- vs -- {see: other}
One being likely a text fragment, the other likely a piece of code.

Or how spacing may imply Markdown or YAML being used. Or how it may imply a list. Or a poem. Or a song. Or specific writing style, such as "lol im a casual who not care bout comms" vs. "I am a distinguished professor, about to retire. Elites like us put two spaces after full stop."




> the few readers who lived back then expected it to encode some form of legal or religious prose

The Latin literature was extremely rich, from Cicero to Tacitus, and was certainly not limited to legal information.

Here's part of your comment with white space and punctuation stripped:

scriptocontinuamayhaveworkedbecausethefewreaderswholivedbackthenexpectedittoencodesomeformoflegalorreligiousprosebuteventhentheycouldlearnthingsfromtheoverallshapeofthedocumentllmsareworkinginamuchricherdomainofdocumenttypesbuttheonlythingtheycanseeisastreamoftokenstheresnospatialorgeometricdataattachedtheresowhitespaceandpunctuationaretheonlythinganllmhastomakeinferencesaboutotherwisetextuallyidenticalinputs

It's a little hard to read, but not that hard. I think one would get used to it.

Also, for creative use of LLM, it may be a feature, as trying to find the words could be inspiring.

I think it would be worth a try.


Now do a modern structured document with sections and bullet points and logical connectives.


    string.replace(/[\s\.\*<>\!\?,;:\-–\|"'\[\]\(\)]/g, '')


...there's only 1 space there though




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: