What every software developer must know about Unicode in 2023

layer8 · on Oct 2, 2023

neonate · on Oct 2, 2023

http://web.archive.org/web/20231002163213/https://tonsky.me/...

jcranmer · on Oct 2, 2023

There's one part of this document that I would push extremely hard against, and that's the notion that "extended grapheme clusters" are the one true, right way to think of characters in Unicode, and therefore any language that views the length in any other way is doing it wrong.

The truth of the matter is that there are several different definitions of "character", depending on what you want to use it for. An extended grapheme cluster is largely defined on "this visually displays as a single unit", which isn't necessarily correct for things like "display size in a monospace font" or "thing that gets deleted when you hit backspace." Like so many other things in Unicode, the correct answer is use-case dependent.

(And for this reason, String iteration should be based on codepoints--it's the fundamental level on which Unicode works, and whatever algorithm you want to use to derive the correct answer for your purpose will be based on codepoint iteration. hsivonen's article (https://hsivonen.fi/string-length/), linked in this one, does try to explain why extended grapheme clusters is the wrong primitive to use in a language.)

raphlinus · on Oct 2, 2023

Agreed. And one more consideration is that (extended) grapheme cluster boundaries vary from one version of Unicode to another, and also allow for "tailoring." For example, should "อำ" be one grapheme cluster or two? It's two on Android but one by Unicode recommendation and is the behavior on mac. So in applications where a query such as length needs to have one definitive answer which cannot change by context, counting (extended) grapheme clusters is the wrong way to go.

lifthrasiir · on Oct 3, 2023

In fact, the name "extended" grapheme cluster should give it away. There was already a major revision to UAX #29 so that the original version is now referred as to "legacy". Your example is exactly this case: the second character, U+0E33 THAI CHARACTER SARA AM, prohibits a cluster boundary now but previously didn't [1].

[1] Relevant specifications: https://unicode.org/reports/tr29/#SpacingMark and https://unicode.org/reports/tr29/#GB9a

pif · on Oct 2, 2023

> An extended grapheme cluster is largely defined on "this visually displays as a single unit", which isn't necessarily correct for things like "display size in a monospace font" or "thing that gets deleted when you hit backspace."

I'm sorry, but I fail to see how "This visually displays as a single unit" could ever differ from "Display size in a monospace font" or "Thing that gets deleted when you hit backspace".

jfultz · on Oct 2, 2023

A couple of cases I'm aware of...

* Coding ligatures often display as a single glyph (maybe occupying a single-width character space, or maybe spread out over multiple spaces), but are composed of multiple glyphs. The ligature may "look" like a single character for purposes of selection and cursoring, but it can act like multiple characters when subject to backspacing.

* Similarly, I've seen keyboard interfaces for various languages (e.g., Hindi) where standard grapheme cluster rules bind together a group of code points, but the grapheme cluster was composed from multiple key presses (which typically add one code point each to the cluster). And in some such interfaces I've seen, the cluster can be decomposed by an equal number of backspace presses. I don't have a good sense of how much a monospaced Hindi font makes sense, but it's definitely a case where a "character" doesn't always act "character-like".

dwringer · on Oct 3, 2023

I've always felt ligatures that condense two or more glyphs into something that takes up the space of only one in a monospace font are going beyond what a font should handle and into the realm of what an editor should do. I have several such visual substitutions set up in my .emacs but I don't use fonts that do them on their own.

trealira · on Oct 5, 2023

What about ligatures that make ASCII characters display differently when in proximity, but still use the same number of columns?

For example, when == is written, connect them to be a 2 column wide = instead.

Or when === is written, display a three column equals sign, but it's three bars instead of two.

mattnewton · on Oct 2, 2023

> Display size in a monospace font

Some clusters are going to be multiple characters wide.

> thing that gets deleted when you hit backspace

Some clusters are meant to be composted of multiple keystrokes and a natural editing experience would allow users to delete the last stroke.

Look into how Korean works.

jcranmer · on Oct 2, 2023

See, e.g., https://github.com/xi-editor/xi-editor/issues/655 for why backspace isn't the same as extended grapheme cluster.

As for "display size in monospace font", emojis and CJK characters are usually two units wide, not one (although, to be honest, there's a fair amount of bugs in the Unicode properties that define this).

RichieAHB · on Oct 2, 2023

Here’s a good example of the test cases used for backspaces in Android[1]. It’s definitely more involved than just deleting a grapheme cluster.

[1] https://android.googlesource.com/platform/frameworks/base/+/...

yeputons · on Oct 2, 2023

Here is a full article of such examples: https://manishearth.github.io/blog/2017/01/14/stop-ascribing...

Discussion on HN: https://news.ycombinator.com/item?id=31858311

orphea · on Oct 2, 2023

If you type "a", combine it with "´", then change your mind and hit backspace, you probably want to end up with "a" even through "á" was a thing "visually displayed as a single unit".

mostlylurks · on Oct 2, 2023

As a European, no I don't. á isn't used in my language, but my layout offers it via a dead-key-then-base-letter mechanism, and it is correctly treated as one unit when pressing backspace, anything else would feel incorrect. It would be even worse if such a thing happened for the letters that my layout offers individual buttons for (ÅÄÖ). Some languages do treat these as letters with attached modifiers, but many, including mine, treat them as indivisible letters that just happen to look similar to some others for historical reasons, and to treat them as combinations of base letters and diacritics would be completely incorrect, even if you typed them in using the dead-key-then-base-letter mechanism for some reason.

Findecanor · on Oct 2, 2023

Most European keyboard layouts have it the other way around: first press a "dead key" for the diacritic mark and then the letter to apply it to.

Where some layouts may require this method for some characters, another keyboard layout may have the same character on a dedicated key.

The program receives the combined character as one unit, and does not need to be aware of different keyboard layouts.

mananaysiempre · on Oct 2, 2023

> first press a "dead key" for the diacritic mark and then the letter to apply it to.

That being exactly the way “floating diacritics” in ISO 2022 (or properly one of its Latin encodings, T.51 = ISO 6937) work, amusingly. I wonder which came first. (Yes, I know that a<BS>` came first, the ASCII spec even says that this should give you an accented character IIRC. Or perhaps it was one of the other “don’t call it ASCII” specs—ISO 646? IA5?..)

umanwizard · on Oct 2, 2023

> Most European keyboard layouts have it the other way around: first press a "dead key" for the diacritic mark and then the letter to apply it to.

Which ones? At least the French and German ones don’t work like that: there is no composing, just separate keys for all the characters with diacritics that appear in the language.

TacticalCoder · on Oct 2, 2023

Nitpicking but most french keyboards have both ready-made keys for "é" and the few other commonly use keys and composing: hitting either '¨' or '^'. For example hitting '¨' then 'e' produces "ë".

umanwizard · on Oct 2, 2023

You are right, thanks.

kergonath · on Oct 3, 2023

The AZERTY layout is nothing if not inconsistent.

nicolaslem · on Oct 3, 2023

For this specific example, it is actually quite pragmatic. "é" being used many orders of magnitude more often than "ë" in French, it makes sense for it to have its own key.

paledot · on Oct 3, 2023

Also, French has no other character that takes an acute accent. For the same reason, ç isn't typed with a dead key on French AZERTY.

mostlylurks · on Oct 2, 2023

The nordic layout(s) offer such a mechanism to allow people to type in letters that you'll find in various other European languages, even though the extra letters used in the languages themselves (ÅÄÖÆØ) are present as their own keys. Interestingly, the Swedish layout has no dedicated é key, although é occurs in some Swedish words.

gumby · on Oct 2, 2023

In Swedish, Å, Ä, and Ö are actual letters of the alphabet, while é is used in foreign words. Like the English dieresis (e.g. in coöperate) is essentially unknown in the US and only occasionally used in England, so doesn't give rise to characters with dieresis on the keyboard.

kzrdude · on Oct 3, 2023

é is used commonly in names and some words that don't feel foreign. For example the word for idea is written idé. Seems like it's an old loan from greek.

karatinversion · on Oct 4, 2023

The accent gives away that this is in fact a loan from French

kzrdude · on Oct 4, 2023

I guess it's time to learn to use some real swedish words then and not the foreign ones. Bye idé, hello hugskott. (Hug/håg = mind, skott = shot)

greenshackle2 · on Oct 2, 2023

Which French layout would that be? I've never seen a French keyboard where this is true. French is my native language. On layouts I'm familiar with, some accented letters have separate keys like é, but not all, the others are made by composing an accent key with a letter.

umanwizard · on Oct 2, 2023

You're right, sorry. I had forgotten about the ^ and ¨ keys.

hnbad · on Oct 3, 2023

Press the key to the left of 1 (not the numpad) or the right of the Eszett (sharp S) on the German QWERTZ keyboard and you probably hit a dead key. There are dedicated keys for the German umlauts and Eszett but these are for French accents in loan words: â, á and à, e.g. as in Café.

It's worth mentioning of course that there are no-dead-keys variants of the keyboard layout but this has been pretty much the norm on Windows since the 1990s I think.

gdprrrr · on Oct 2, 2023

On the German Layout the backtick (next to the 1 key) is a dead key.

Akronymus · on Oct 3, 2023

Whichbis the thing that finally pushed me over the edge and switch to the US layout. backticks are something I use all the time.

riggsdk · on Oct 2, 2023

Danish keyboards also require you to press '¨' first and then 'o' to produce 'ö'.

mostlylurks · on Oct 2, 2023

Do the danes not have the mechanism that is found on Finnish keyboard layouts, where pressing AltGr+Ö yields Ø and AltGr+Ä yields Æ, except in reverse?

Findecanor · on Oct 2, 2023

Those mappings are not universal. They are present under Linux but not on MS-Windows. I don't know about Mac, but the layout has in the past been slightly different there from Windows also.

mostlylurks · on Oct 3, 2023

Interesting, it's been like a decade since I last used windows, but I had to go and check what layouts are available, since I remember having these combinations on my layout. Apparently those combinations are provided by windows in the "Finnish and Sami" layout, which provides a number of extra letters (not just ones used by the Sami languages) through AltGr+letter combinations. I must have selected that as my layout at some point while I was still using windows, possibly for the purpose of getting easier access to letters like ÆØÕ, and just forgotten it after some time.

kzrdude · on Oct 3, 2023

They are there on mac too, use option-Ä to get the Æ or the other way around. What's more, it has worked like that since system 7.x times or so, it's just a good idea.

riggsdk · on Oct 2, 2023

For me that doesn't work on Windows. Those key combinations doesn't seem to do anything.

cardiffspaceman · on Oct 2, 2023

I had to change settings on Windows to get access to a mode. I could then enable that mode to be able to readily type Spanish correctly. The mode uses the key combinations as described.

Sardtok · on Oct 2, 2023

But do you really use ö much over ø?

riggsdk · on Oct 2, 2023

No, but I do once in a while (very rarely) write a little in german that might use that character.

tpm · on Oct 2, 2023

Slovak or Czech for example.

Reefersleep · on Oct 2, 2023

Danish is one.

bombela · on Oct 2, 2023

I expect to delete the character "á". And I prefer consistency too so I expect "œ" and "<emoji>" and "<emoji>" to be deleted as one unit.

edit: emojis are filtered by HN

pests · on Oct 2, 2023

Even the emoji's that you create by combining multiple emojis? Type one emoji, then a second, it merges into one. What happens when you backspace?

bombela · on Oct 3, 2023

I expect the full emoji to be deleted. Because it merged into one visual entity.

I guess it's more ambiguous for some languages that can have long ligatures though.

xeyownt · on Oct 3, 2023

I'd say delete the whole with backspace, but only the last if you undo.

kdmccormick · on Oct 2, 2023

But then if I type "á" directly (through, say, a mobile keyboard) and hit backspace, I'd get "a", which doesn't seem terrible but does feel a little off.

Seems like the right answer for codepoints vs graphemes, unfortunately, is dependent on the context.

layer8 · on Oct 2, 2023

In terminals there is a distinction between single-width and double-width characters (east-asian characters, in particular). E.g. the three characters

    A美C

would take up the width of four ASCII monospace characters, the “美” being double-width.

Similarly, for composed characters like say the ligature “ﬀ”, you may want to backspace as if it was two “f”s (which logically it is, and decomposes to in NFKD normalization).

jameshart · on Oct 2, 2023

Unicode even has distinct full width and half width variants of Japanese katakana - where ‘full width’ is (in theory) as wide as two Latin characters.

   Latin:      Katakana
   Full width: カタカナ
   Half width: ｶﾀｶﾅ

(How that fixed width text looks in a web browser is anyone’s guess though. On iOS none of the Japanese kana stay on the fixed grid.)

layer8 · on Oct 3, 2023

The background here is that both are contained in the Japanese Shift JIS character set, and Unicode provides roundtrip compatibility. And they are in Shift JIS because the half-width katakana were in the 8-bit JIS character set [0] used with text-mode displays where all characters have the same width. To preserve screen layout, these later had to be distinguished from full-width katakana.

[0] https://en.wikipedia.org/wiki/JIS_X_0201

MatmaRex · on Oct 3, 2023

﷽

𒐫

𒈙

꧅

wbl · on Oct 3, 2023

If I have a text with niqqudim I am going to want to think of the niqqudim differently when editing despite the fact they are entwined with the consonants.

barkingcat · on Oct 3, 2023

characters that alter their appearance to be one or more display units depending on the characters that are next to it (before and after). that would be a very crude example, but these types of characters appear all the time in human language

riggsdk · on Oct 2, 2023

There are libraries that help with iterating both code-points and grapheme clusters... - but are there any of them that can help decide what to do for example when pressing backspace given an input string and a cursor position? Or any other text editing behavior. This use-case-dependent behavior must have some "correct" behavior that is standardized somewhere?

Like a way to query what should be treated like a single "symbol" when selecting text? Basically something that could help out users making simple text-editors. There are so many bad implementations out there that does it incorrectly so there must be some tools/libraries to help with this? Not only for actual applications but for people making games as well where you want users to enter names, chat or other text. Not all platforms make it easy (or possible) to embed a fully fledged text editing engine for those use-cases.

I can imagine that typing a multi-code-point character manually by hand would allow the user to undo their typing mistake by a single backspace press when they are actively typing it, but after that if you return to the symbol and press backspace that it would delete the whole symbol (grapheme cluster).

For example if you manually entered the code points for the various family combination emojis (mother, son, daughter) you could still correct it for a while - but after the fact the editor would only see it as a single symbol to be deleted with a single backspace press?

Or typing 'o' + '¨' to produce 'ö' but realizing you wanted to type 'ô', there just one backspace press would revert it to 'o' again and you could press '^' to get the 'ô'. (Not sure that is the way in which you would normally type those characters but it seems possible to do it with unicode that way).

PeterisP · on Oct 2, 2023

I'd argue that you must use grapheme clusters for text editing and cursor position, because here are popular characters (like ö you used as example) which can be either one or two codepoints depending on the normalization choice, but the difference is invisible to the user and should not matter to the user, so any editor should behave exactly the same for ö as U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS) and ö as a sequence of U+006F (LATIN SMALL LETTER O) and U+0308 (COMBINING DIAERESIS).

Furthermore, you shouldn't assume that there is any relationship between how unicode constructs a combined character from codepoints with how that character is typed, even at the level of typing you're not typing unicode codepoints - they're just a technical standard representation of "text at rest", unicode codepoints do not define an input method. Depending on your language and device, a sequence of three or more keystrokes may be used to get a single codepoint, or a dedicated key on keyboard or a virtual button may spawn a combined character of multiple codepoints as a single unit; you definitely can't assume that the "last codepoint" corresponds to "last user action" even if you're writing a text editor - much of that can happen before your editor receives that input from e.g. OS keyboard layout code; your editor won't know whether I input that ö from a dedicated key, a 'chord' of 'o' key with a modifier, or a sequence of two keystrokes (and if so, whether 'o' was the first keystroke or the second, opposite of how the unicode codepoints are ordered).

mananaysiempre · on Oct 2, 2023

> I'd argue that you must use grapheme clusters for text editing and cursor position

Korean packs syllables into Han-script-like squares, but they are unmistakably composed of alphabetic letters, and are both typed and erased that way (the latter may depend on system configuration), yet the NFC form has only a single codepoint per syllable (a fortiori a single grapheme cluster). Hebrew vowel markings are (reasonably) considered to be part of the grapheme cluster including their carrier letter but are nevertheless erased and deleted separately. In both of those cases, pressing backspace will erase less than pressing shift-left, backspace; that is, cursor movement and backspace boundaries are different.

There are IIRC also scripts that will have a vowel both pronounced and encoded in the codepoint stream after the syllable-initial consonant but written before it; and ones where some parts of a syllable will enclose it. I don’t even want to think how cursor movement works there.

Overall, your suggestion will work for Latin, Cyrillic, Greek, and maybe other nonfancy scripts like Armenian, Ge’ez, or Georgian, but will absolutely crash and burn when used for others.

PeterisP · on Oct 2, 2023

OK, I understand that the initial sentence is too strict, however, using codepoints for text editing and cursor position is even worse - even in your example of Korean there's a clear distinction depending on how the same character is encoded (combined NFC or not), but it should be the same to the user; and obviously if someone inputs a latin-diacritic character by pressing a modifier key before the base letter, then backspace removing the diacritic (since unicode modifiers are after the base letter) would be just ridiculous.

Backspace in general seems to be a very difficult problem because of subtly incompatible expectations depending on the context, as 'undo last input' when you're typing new text, and 'delete previous symbol' if you're editing existing text.

jcranmer · on Oct 2, 2023

Some platforms (e.g., Android) have methods specifically for asking how to edit a string following a backspace. However, there's no standard Unicode algorithm to answer the question (and I strongly suspect that it's something that's actually locale-dependent to a degree).

On further reflection, probably the best starting point for string editing on backspace is to operate on codepoints, not grapheme clusters. For most written languages, the various elements that make up a character are likely to be separate codepoints. In Latin text, diacritics are generally precomposed (I mean, you can have a + diacritic as opposed to precomposed ä in theory, but the IME system is going to spit out ä anyways, even if dead keys are used). But if you have Indic characters or Hangul, the grapheme cluster algorithm is going to erroneously combine multiple characters into a single unit. The issue is that the biggest false positive for a codepoint-based algorithm is emoji, and if you're a monolingual speaker whose only exposure to complex written scripts is Unicode emoji, you're going to incorrectly generalize it for all written languages.

Etherlord87 · on Oct 3, 2023

IMHO backspace is not an undo key. Use CTRL+Z if you want to undo converting a grapheme to another grapheme with a diacritic character. Backspace should just delete it.

On the other hand, a ligature shouldn't be deleted entirely with just one backspace. It's two letters after all, just connected.

So how do we distinguish when codepoints are separate graphemes, and when they constitute a single grapheme? Based on if they they can still be recognized as separate within the glyph? If they combine horizontally vs vertically (along the text direction or orthogonal?) What about e.g. "¼" - are those 3 graphemes? What about "%" and "‰"? What about "&" ("et" ligature)? It seems you can't run away from being arbitrary…

gumby · on Oct 2, 2023

> Or typing 'o' + '¨' to produce 'ö' but realizing you wanted to type 'ô', there just one backspace press would revert it to 'o' again and you could press '^' to get the 'ô'.

This is a good example because in German I would expect 'o' + '¨' + <delete> to leave no character at all while in French I would expect 'e' + '`' + <delete> to leave the e behind because in my mind it was a typo.

The rendering of brahmic- and arabic-derived scripts makes these choices even more interesting.

posix86 · on Oct 2, 2023

But typing "ö" (e.g. swiss keyboard) and pressing delete & getting an o would be annoying af

jraph · on Oct 2, 2023

Same for a French keyboard with éèàù which are all typed using one key. But even êôûæœäëïöü, all typed using at least two keys, if not 3 with a compose key (from memory, I'm using a phone). Everybody is used to the way it has been working on all OSes.

riggsdk · on Oct 2, 2023

I realize that the editor would be the system to keep track of how the character was entered for this to work. If you made the character from a single keypress it would only make sense that backspace also undid the entire character. Only if you created the character from multiple keypresses it would make sense to "undo" only part of it with backspace (at least until you move away from the character).

hunter2_ · on Oct 2, 2023

> make sense to "undo" only part of it with backspace

I'm not sure that ever really makes sense: it would be a misnomer if "backspace" didn't bring you "back" some amount of horizontal "space," I reckon. This logic holds up not only for cases like ö and emoji (where I'd hope the whole grapheme disappears), but also for scenarios like if one types <f><i> and an <fi> ligature appears, where I'd hope only the <i> disappears: that's fine because you are still going back some space.

If the key ever gets repurposed from "backspace" to "undo" then I would agree that it should step back to the previous state with as much granularity as possible.

mananaysiempre · on Oct 3, 2023

“Backspace” is already a misnomer: the original intention was for it to move the caret one position back, the other way around from the usual space, thus enabling overstriking (whence also the characters _ ^ ` , unheard of before the typewriter age). You can still see this used by the troff|less internals of man, which encode underline and bold as _<BS>a and a<BS>a respectively.

extraduder_ire · on Oct 3, 2023

The text already contains the data about how the character was entered (more specifically, what codepoints it's made of), unless it has been normalized somehow. It seems like a tarpit to me to properly specify these behaviours in a way that won't be an annoyance/surprise to a lot of people.

extraduder_ire · on Oct 3, 2023

It would probably expose the implementation too much, but if you wanted combined characters to be split apart, I would expect "ö" to be removed in one go, whereas an "o" with a combined diaeresis (No idea if it's called that, I got the name from the code chart) would take two backspaces.

gumby · on Oct 3, 2023

The mark with two dots can refer to various things.

Dieresis (Greek word we use in English) is a phenomenon where two vowels don’t merge together to make a new sound. Co-operate, coöperate, or cooperate. You could also shove a glottal stop in there but this is much less extreme. The diaresis marker isn’t used much in the USA. Another common example is naïve.

Umlaut (German word) is the opposite phenomenon: the first vowel sound is inflected by the following, so in German Apfel (apple, pronounced UPfell) becomes Aepfel (apples, EPPfell), instead commonly written Äpfel instead.

People usually refer to the marks by the term in their own language (umlaut, diaresis, caron, accent…). I say Umlaut in German (which doesn’t mark the phenomenon, you just have to know) and diaresis in English though I’m talking about the same symbol.

But diacritics are used in no systematic way in different languages anyway (this is true of the letters themselves too: letters like S, W, V and Z have different pronunciations in English and German, for example). Ppl just needed ways to say “the sound here is similar to the sound of this letter but not really the same” or “when you say this letter also do something else” (like, say, add stress for languages that have that).

English mostly dispenses with them, sometimes adding a letter (g vs gh) but mostly having a very casual relationship with the sound at all. You simply have to know that cooperate isn’t pronounced like chicken coop. I personally think that’s a good thing.

brewmarche · on Oct 14, 2023

If you want to refer to the symbol in a neutral way there is the term trema

HelloNurse · on Oct 3, 2023

Pragmatically, to get from "ö" to "o" one would delete "ö" (hopefully by pressing backspace) and then type "o"; deleting the combining marks isn't actually useful.

gumby · on Oct 2, 2023

Definitely agree with that! I use a US kbd (incl on phone) no matter what language I’m writing in. A little annoying but switching kbd layouts is more disruptive for me.

makapuf · on Oct 2, 2023

In French, è is a single character issued by a single keypress on a French keyboard, like e, or +. (Note that A is shift+a). Why should it need two backspaces? If you press e+` well you have e`, not è.

NikolaNovak · on Oct 2, 2023

I am assuming that means "on French keyboard", not "in French". I have a usa keyboard and live in Canada...Every now and then it thinks I'm typing French and keyboard indeed behaves in a way that some vowel plus some quotation mark indeed gives me some other character (that I don't need :)

kergonath · on Oct 3, 2023

That would feel very strange to me. The Canadian layout probably behaves differently, but à is a modifier letter, not two letters. I’d expect backspace to remove the whole letter, including the accent.

NikolaNovak · on Oct 3, 2023

Oh, it's strange to me too! I curse every time I accidentally hit that setting and my computer seems possessed :)

It appears there's a few keyboard layouts and language options:

French (Canada) - Canadian French

French (Canada) - Canadian Multilingual Standard

French (France)

etc

And then I think it intersects with what keyboard you actually have.

Some of these layouts maybe are designed to enable you to type in French characters on a non-French keyboard? Not sure.

layer8 · on Oct 2, 2023

Behavior that depends on whether you edited something else in between, or that depends on timing, is just bad. Either always backspace grapheme clusters, or else backspace characters, possibly NFC-normalized. I could also imagine having something like Shift+Backspace to backspace NFKD-normalized characters when normal Backspace deletes grapheme clusters.

As for selection and cursor movement, grapheme clusters would seem to be the correct choice. Same for Delete. An editor may also support an “exploded” view of separate characters (like WordPerfect Reveal Codes) where you manipulate individual characters.

hgs3 · on Oct 2, 2023

Everybody loves to debate what "character" means but nobody ever consults the standard. In the Unicode Standard a "character" is an abstract unit of textual data identified by a code point. The standard never refers to graphemes as "characters" but rather as user-perceived characters which the article omits.

lucideer · on Oct 2, 2023

I'm not Korean but seeing that said of the Hangul example definitely made me pause - I doubt Koreans think of that example as a single grapheme (open to correction), though it is an excellent example all the same since it demonstrates the complexity of defining "units" consistently across language.

It reminds me a little of Open Street Map's inconsistent administrative hierarchies ("states", "countries", "counties", etc. being represented at different administrative "levels" in their hierarchy for each geographical area), and how that hinders consistency in styling- font size, zoom levels, etc. being generally applied by level.

lifthrasiir · on Oct 3, 2023

As a native Korean, I can confirm that "각" is perceived as a single character. But the example itself is bad anyway because everyone use a precomposed form U+AC01 instead of U+1100 U+1161 U+11A8 instead (they are canonically equivalent). This is more clear when you also consider a compatibility representation "ㄱㅏㄱ" U+3131 U+314F U+3131, which is same to "각" after compatibility normalizations (NFKC or NFKD), but perceived as three atomic characters in general.

lucideer · on Oct 3, 2023

Thanks for the clarification, that's interesting.

My impression before was it would be considered a single "entity" (not the same as a roman-alphabetic word, but not a character either) containing 3 characters.

haberman · on Oct 2, 2023

In that case, it sounds like `length` on Unicode strings simply shouldn't exist, since there is no obvious right answer for it. Instead there should be `codepointCount`, `graphemeCount`, etc.

kalleboo · on Oct 3, 2023

There are basically 2 places where programmers mostly want the "length" of a string:

1. To save storage space or avoid pathological input, they want to limit the "length" of text input fields. E.g., not allow a name to be 4 KB long

2. To fit something on screen

Developers mostly used to western languages can approximate both with "number of letters", but the correct answers are

For 1. Limit to bytes to avoid people building infinite zalgo characters, but be intelligent about it - don't just crop the byte array not taking into account graphemes.

For 2. This sucks, especially for the web, but the only really correct answer here is to render it and check.

Did I miss any other common cases?

mike_hearn · on Oct 3, 2023

Presizing buffers, initializing for loop counts ...

davidmurdoch · on Oct 3, 2023

Unrelated to this post, but you suggested (https://news.ycombinator.com/item?id=37381390) I use your company hydraulic.dev for my electron build. I ultimately gave up on it, then someone from node-gyp comment on an issue I opened about it, and provided the solution:

https://github.com/electron-userland/electron-builder/issues...

Just wanted to let you know in case it's a gotcha you might not be aware of that might help you out if you run into similar problems with some of your customer builds.

mike_hearn · on Oct 4, 2023

Thanks for the tip! Sorry to hear you gave up on it, it'd be really appreciated if you could email me with some info as to what problems you hit. We're improving our Electron support at the moment (adding ASAR support and so on), so if there's low hanging fruit it'd be good to know where to look.

The bug you linked to is a bit confusing, it seems to be a bug in electron-builder (or node-gyp), not Conveyor?

kalleboo · on Oct 3, 2023

Buffers should just be working on raw byte arrays without even considering the content (if it's a string or data or whatever).

"for loop counts" depends on what you're doing in the loop...

Etherlord87 · on Oct 3, 2023

You're absolutely correct! `length` is ambiguous - you shouldn't have a `time` argument in a `sleep` function; you should have `milliseconds` and/or `seconds` etc.

extraduder_ire · on Oct 3, 2023

You could have a Duration argument though.

The parallels of string length with the phrase "How long is a piece of string?"[0] make this apparent/amusing. I'm sure I'm not the first person to think that.

[0]: https://en.wiktionary.org/wiki/how_long_is_a_piece_of_string

Etherlord87 · on Oct 3, 2023

The `Duration` being a type that implements an interface removing the ambiguity? Like a `DateTime` object does? I think it might be useful to have a function returning a collection of information about text, how many unicode points, how many grapheme clusters, how many syllables, vowels, consonants, special characters… But for performance reasons you probably want separate functions that give you just one of these.

extraduder_ire · on Oct 4, 2023

DateTime, Instant, and Duration are totally different things. I believe I was thinking of a class I remember seeing added to java at one point. Time is probably one of the few things in programming that's even more cursed than strings.

astrange · on Oct 2, 2023

String iteration should be based on whatever you want to iterate on - bytes, codepoints, grapheme clusters, words or paragraphs. There's no reason to privilege any one of these, and Swift doesn't do this.

"Length" is a meaningless query because of this, but you might want to default to whatever approximates width in a UI label, so that's grapheme clusters. Using codepoints mostly means you wish you were doing bytes.

b3morales · on Oct 2, 2023

> There's no reason to privilege any one of these, and Swift doesn't do this.

Strange thing to say: Swift String count property is the count of extended grapheme clusters. The documentation is explicit:

> A string is a collection of extended grapheme clusters, which approximate human-readable characters. [emphasis in original]

astrange · on Oct 2, 2023

The length/count property was added after people asked for it, but it wasn't originally in the String revamp, and it provides iterators for all of the above. .count also only claims to be O(n) to discourage using it.

b3morales · on Oct 3, 2023

That was almost seven years ago now. It has been the String API twice as long as it has not been the API.

mananaysiempre · on Oct 2, 2023

> thing that gets deleted when you hit backspace

Is there a canonical source for this part, by the way? Xi copied the logic from Android[1] (per the issue you linked downthread), which is reasonable given its heritage but seems suboptimal generally, and I vaguely remember that CLDR had something to say about this too, but I don’t know if there’s any sort of consensus on this problem that’s actually written down anywhere.

[1] https://github.com/xi-editor/xi-editor/pull/837

grumpyprole · on Oct 3, 2023

> And for this reason, String iteration should be based on codepoints

Why not offer both and be clear about it? Rather than just "length", why not call them code points? The Python docs for "len" which can be called on a unicode string say "Return the length (the number of items) of an object.". It doesn't look like a clear and easy to use API to me.

archgoon · on Oct 3, 2023

If you insist that `len` shouldn't be defined on strings, and the default iterator should be undefined in python then:

  for c in "Hello":
    pass

should throw an exception. Also

  if word[0] == 'H':
     pass

should throw an exception.

This would have been an extremely controversial suggestion when python3 came out to say the least.

Codepoints is a natural way of defining unicode strings in python, and it mostly works the way you expect once you give it a bit of thought. It is lower level than, say, grapheme clusters, but its more well defined and it provides the proper primitives for dealing with all use cases.

grumpyprole · on Oct 3, 2023

I would suggest that len works as the article suggests; and "Hello".codepoints gives the behaviour you want.

nerdbert · on Oct 3, 2023

> for this reason, String iteration should be based on codepoints

This leads you to the problem where you'll get different results iterating over

n a ï v e

vs

n a ̈ i v e

And I can't see how that's ever going to be a useful outcome.

If you normalize everything first, then you can sidestep this to some degree, but then in effect your normalization has turned codepoint iteration into grapheme iteration for most common Latin-script text characters.

gumby · on Oct 2, 2023

This is quite a good write up. An answer to one of the author's questions:

> Why does the ﬁ ligature even have its own code point? No idea.

On of the principles of Unicode is round trip compatibility. That is you should be able to read in a file encoded with some obsolete coding system and write it out again properly. Maybe frob it a bit with your unicode-based tools first. This is a good principle, though less useful today.

So the ﬁ ligature was in a legacy encoding system and thus must be in Unicode. That's also why things like digits with a circle around them exist: they were in some old Japanese character set. Nowadays we might compose them with some zwj or even just leave them to some higher level formatting (my preference).

sdrothrock · on Oct 2, 2023

> they were in some old Japanese character set

This implies that they're obsolete, but they're not -- they're still in very common use today. You can type them in Japanese by typing まる (maru, circle) and the number, then pick it out of the IME menu. Some IMEs will bring them up if you just type the number and go to the menu, too. :)

gumby · on Oct 2, 2023

Fair enough. I was thinking of them as obsolete, but shouldn't since you do see them a surprising amount in Japan.

ufo · on Oct 3, 2023

What do the Japanese use the circled numbers for?

sdrothrock · on Oct 3, 2023

Ordinals and references, like this:

① Draw some circles

② Draw the rest of the owl

Commentary: ① is simple but ② is masking many complex steps that are necessary to draw an owl.

WorldMaker · on Oct 2, 2023

> So the ﬁ ligature was in a legacy encoding system and thus must be in Unicode.

Most of the pre-composed latin ligatures are generally from EBCDIC codepages. People in the ancient Mainframe era wanted nice typesetting too, but computer fonts with ligature support were a much later invention.

You can see ﬁ and several others directly in EBCDIC code page 361:

https://en.wikibooks.org/wiki/Character_Encodings/Code_Table...

gumby · on Oct 2, 2023

Thanks. Some alphabets have precomposed ligatures that aren't really letters, like old German alphabets with tz, ch, ss (I only know how to type the last one, ß, because the others have died out over the last hundred years).

Actually in German (at least) ä, ö and ü really are actually ligatures for ae, oe, and ue -- the scribes started to write the E's on their sides above the base letters, and over time the superscript "E"s became dots or dashes. Often they are described the other way around: "you can type oe if you can't type ö." That's what my kid was told in school!

But Ö and ß aren't really part of the alphabet in German, while, say, in Swedish, ä and ö became actual letters of the alphabet. English got W that way too.

cyxxon · on Oct 2, 2023

That's sounds a bit false to me. The Umlaute (ä,ö, ü) and the "eszett" ß are actually part of the German alphabet[1]. Also it is kinda weird to describe them as ligatures of the original letters and the diaeresis, because while this is what they started out as a long time ago, they are just their own letters now (as opposed to "real" stylistic ligatures like combining fi into one glyph). The advice your kid was told that they can be replaced with ae, oe and ue is correct - it is a replacement nowadays.

[1] https://de.wikipedia.org/wiki/Deutsches_Alphabet

gumby · on Oct 3, 2023

C'mon that page is highly technical, really just listing the letters or glyphs used for forming printed text. In reality, if you walk into any first grade classroom you see pictures of the letters A-Z with pictures (Apfel, Bär, uwv) and then after the end maybe around the corner, what, Öl? I can't even remember. When the kids recite the letters they don't recite äöüß. TBH I really only remember this because when kiddo was that age the Neue Rechtscribung transition was mid-process and the parents were angrily divided so I was kinda paying attention.

Also, though it's hardly authoritative, my kids' school taught English through immersion from grade 1 too, and both German and English teachers said "same alphabet".

As bmicraft pointed out, even in that wikipedia chart those inflected letters are spaced apart from the others. Yes, they are letterforms, but not part of the "alphabet" -- they don't even have a sorting like the Swedish Ä or W do.

And you can switch in running text from using the marker for umlaut (dots or bar, not semantically dieresis) or a normal "e" without anyone blinking. There's no problem reading a Swiss book even though ß refuses to cross the border. Though I personally prefer to read Äpfel and Bär rather than Aepfel and Baer, really, they are the same.

FoeNyx · on Oct 3, 2023

> When the kids recite the letters they don't recite äöüß.

On a somehow related side note, I read that "&", which is derived from a ligature of the Latin conjunction "Et" (meaning "and"), was named "ampersand" in English as a mondegreen for "and per se and" as it used to be placed at the end of the English alphabet recitation.

bmicraft · on Oct 2, 2023

They sure are letters, but they aren't generally thought of as being in the alphabet (which seems to be why they are just kinda tacked on after a space on wikipedia) and get ordered as if they where just the base letter (mostly)

Symbiote · on Oct 3, 2023

Note that in Swedish they are considered letters, and in Danish and Norwegian Æ, Ø and Å are letters.

paledot · on Oct 3, 2023

Letters which are sorted separately from what we'd think of as the base characters in English (they appear at the end of the alphabet, as W X Y Z Æ Ø Å, with C often omitted in Norwegian).

By contrast, my French dictionary has énorme nestled between enorgueillir and enquiérir. (Looking for an example does underscore some of the patterns in the language: page after page of ét~ with only a few et~ and one êt~ among them; pages of ex~ with no éx~ at all.)

kalleboo · on Oct 4, 2023

Similarly in Swedish, W was not considered a letter but just a variant of V, so in phone books etc all the W names were mixed in with V names. This was changed in 2006 due to an increase in English loanwords.

nerdbert · on Oct 3, 2023

> old German alphabets with tz, ch, ss (I only know how to type the last one, ß, because the others have died out over the last hundred years)

They still ꜩ on some German street signs. I can't find ch in Unicode though (could just be my old eyes).

gumby · on Oct 4, 2023

You sent me on an enjoyable wild goose chase but it appears that only ß and ff are in unicode: tz, ch, ck have to be handled completely in rendering.

I have my music teacher's German schoolbook from around 1915 and it lists them all as letters (the whole book is in Fraktur). I have various old books in Fraktur and once could read them. I imagine that if I sat down and tried to read one it would come back, but at the moment I have to thin a little to read just the titles!

gwervc · on Oct 2, 2023

The circled digits as code points are very nice to have precisely because they are available in applications that don't support them otherwise... which is actually most of the software I can think of (Notepad, Apple Notes, chat applications, most websites, etc).

gumby · on Oct 2, 2023

My point was that, had they not been legacy characters (or had RT compatibility been disregarded) Unicode could still have supported them as composed characters. Though I personally still feel they are a kind of ligature or graphic, but luckily for everyone else I’m not the dictator of the world :-).

We should be careful: someone on HN could write a proposal that they should be considered pre composed forms that should also have an un-composed sequence… so there could in future be not just 1 in a circle but 1 ZWJ circle, circle ZWJ 1 all considered the same…I can imagine some HN readers being pranksters like that.

swores · on Oct 2, 2023

Can you write them with iOS keyboard? Or when you say Apple Notes and chat apps you just mean from desktop?

Edit ①: seems the answer is not with the default iOS keyboard, but possible to paste it and perhaps possible with a third party keyboard that I'm not keen on trying (unless I hear of a keyboard that's both genuinely useful / better than default, and that doesn't send keystrokes to the developer - though I can't remember if the latter is even a risk on iOS, better go search about that next..)

d11z · on Oct 2, 2023

Speaking of third party keyboards, I’m still upset about what happened to Nintype[0]. I’ve never ever been able to type faster on mobile than with it’s intuitive hybrid input style of sliding and tapping, paired with AI that was actually good. It used to be quite performant, fully customizable, and it worked beautifully as a replacement for default on jailbroken iOS.

Today, it’s buggy $5 abandonware that only makes me sad when I am reminded of it.

EDIT: Here[1] is a blog post that claims it's still the best keyboard in 2023. I actually might give it another shot... Not holding my breath though.

*EDIT: Looks like another dedicated fan has actually taken it upon themself to revive the project, under the new name Keyboard71[2].

[0] https://apps.apple.com/us/app/nintype/id796959534

[1] https://maxleiter.com/blog/nintype

[2] https://www.reddit.com/r/keyboard71/

dotancohen · on Oct 3, 2023

I just want to say thank you for introducing me to Keyboard 71. I've never heard of Nintype, but this thing is incredible!

swores · on Oct 2, 2023

I wonder why they haven't open sourced their fork, over than vague worry it might get DMCA'd

tiltowait · on Oct 2, 2023

Nintype was absolutely incredible. I still open it every now and then after an iOS update in the vain hope some system change made it less buggy.

d11z · on Oct 2, 2023

I’m really considering repurchasing (I definitely owned it previously, no idea what happened), can you describe specifically what the main bugs are for you? I’d be happy if I could use it solely for occasionally writing long notes, not as a replacement for all text inputs.

Really not looking to burn another $5, I’d greatly appreciate any thoughts/concerns at all.

P-Nuts · on Oct 2, 2023

You can type ① with the UniChar keyboard app on iOS. It at least claims it doesn’t transmit information. As it’s only useful for special characters I don’t worry because I can’t use it for normal typing anyway.

https://unichar.app

astrange · on Oct 2, 2023

No third party keyboard transmits information without you permitting it.

amake · on Oct 2, 2023

The default iOS Japanese keyboard allows easily entering circled numerals and many other “exotic” characters.

fomine3 · on Oct 3, 2023

Yeah many "exotic" characters were introduced to Unicode from Japanese legacy computers 〠〄

phantomathkg · on Oct 3, 2023

If you use iOS Japanese romanji keyboard, typing "maruichi" will give you all the options.

masklinn · on Oct 2, 2023

You can copy/paste them from a character board, a dedicated website, or even the wiki.

glandium · on Oct 3, 2023

> That's also why things like digits with a circle around them exist: they were in some old Japanese character set.

Replace "digits with a circle around them" with "emojis" and that's also true.

davidham · on Oct 2, 2023

Is it just me, or is anyone else seeing what looks like the mouse pointer of everyone else reading the page, like 1,000 little ants on the screen

keb_ · on Oct 2, 2023

Anytime tonsky's site gets posted here, I'm reminded by how awful it is, which is ironic given his UI/UX background. The site's lightmode is a blinding saturated yellow, and if you switch into darkmode, it's an even less readable "cute" flashlight js trick. I don't know why he thought this was a good idea. Thank god for Firefox reader mode.

ericmcer · on Oct 2, 2023

I don't think he added moving cursors all over the page because he thought it was good UI/UX, he knows what he is doing.

keb_ · on Oct 3, 2023

I'm having a hard time reconciling "he knows what he is doing" with him making his site practically unusable without a reader mode, which by the way, not every browser supports (especially on mobile).

throwaway290 · on Oct 3, 2023

Don't even think of switching on the dark (night) mode with that attitude! :D

I really enjoyed the tongue in cheek design. I think every modern browser either allows you to turn on reader mode (especially on mobile) or just turn off CSS. This particular article works excellently even in w3m.

slipperlobster · on Oct 3, 2023

Fine, sure. Cute - turn on reader mode. Now the images that are supposed to be sitting over yellow background are dark gray images over a slightly less dark gray background.

The decision to design a serious (read: not-tongue-in-cheek) topic with these "quirky" tricks sucks, JMHO.

keb_ · on Oct 3, 2023

> Don't even think of switching on the dark (night) mode with that attitude! :D

:D

> I really enjoyed the tongue in cheek design. I think every modern browser either allows you to turn on reader mode (especially on mobile) or just turn off CSS. This particular article works excellently even in w3m.

Firefox Focus on mobile does not have Reader Mode.

throwaway290 · on Oct 3, 2023

Fair enough. Tbh at a different time I would probably be as pissed at the author as anyone. Might be mood dependent.

I should also add that the night mode won't be as fun on mobile either, just checked and I don't think it works with pointer events for the effect

876978095789789 · on Oct 2, 2023

He appears (if his logos are anything to go by) to be a flat UI guy. I doubt any of these people know what they're doing.

superq · on Oct 2, 2023

This is seemingly self-contradictory. Perhaps you could explain your reasoning further?

gpvos · on Oct 2, 2023

Doing bad things is their idea of fun.

ksoped · on Oct 2, 2023

You gotta know the rules to bend the rules

chaorace · on Oct 2, 2023

It lets you hold hands with strangers

Scarbutt · on Oct 2, 2023

It's called satire.

LordDragonfang · on Oct 2, 2023

It's deeply ironic that an article about dealing with text properly has images which are part of the article text and yet have no alt-text, rendering parts of the article unreadable in reader mode if the server is slow.

spacechild1 · on Oct 2, 2023

It is obviously a joke (and a good one, I dare say). The fact that people seem to take it seriously says something about the contemporary state of webdesign :)

mplewis · on Oct 2, 2023

It would be a better joke if there were an option to turn the joke off. As it is, dark mode doesn't exist and the pointers occlude text.

spacechild1 · on Oct 2, 2023

> It would be a better joke if there were an option to turn the joke off.

As others have pointed out, reader mode works as expected.

keb_ · on Oct 3, 2023

1. Not every browser has reader mode

2. I don't think it's a very good joke to post long-form content on your blog with the expectation that it's basically unreadable without a reader mode.

> The fact that people seem to take it seriously says something about the contemporary state of webdesign :)

Mind expanding and what it says exactly about contemporary web design?

Whether I take it seriously or not doesn't change the fact that it's still damn hard to read anything.

spacechild1 · on Oct 3, 2023

> Mind expanding and what it says exactly about contemporary web design?

The same as when political satire is indistinguishable from actual politics. It means that the real things has sort of become a joke itself.

keb_ · on Oct 3, 2023

I can agree that most modern web design is bad. I can also agree that the web design on tonsky's site is bad, but OK, I acknowledge that it is intentionally bad; so bad that it's unreadable. I had myself a chuckle now, and next time I see a link to tonsky's site, I'll click on it, chuckle, and immediately leave.

tqwhite · on Oct 3, 2023

If you don't have reader mode, get a new browser. Don't tell him to make it boring for all the rest of us who behave normally.

keb_ · on Oct 4, 2023

Boring and readable are not the same thing. Also, you can edit your comments on HackerNews

tqwhite · on Oct 3, 2023

Also, I read it perfectly well and never thought about switching to reader. It made me laugh.

coldpie · on Oct 2, 2023

Works like a normal website with JavaScript disabled. I didn't even know it did fancy junk until reading the comments here. NoScript saves the day again! I don't know how people can browse the web without it.

redder23 · on Oct 3, 2023

I never understood how people can browser the we WITH IT. Even 10 years ago. today more then ever basically every website needs JS to work properly. I basically never come across a page where I have the urge to disable JS. I have a large list of adblock lists active that also help getting rid of cookie banners and other shit.

I can not imagine manually approving JS for every site. And again doing the inverse and have noscript installed to deny one website a year does not seem to be worth it for me. In this case I can also just use a adblock rule to block that specific script or all .js files from the domain I guess. So I really no not need NoScript.

coldpie · on Oct 3, 2023

Many sites don't need JS at all, like the OP of this thread for example. For a lot of sites, disabling JS actually gives a better experience than leaving it enabled, again like the OP of this thread. It's a trade-off, but I find most uses of JS are so bad it's worth putting up with whitelisting. For example, I don't see cookie pop-ups, I don't see videos, disabling JS kills most of those stupid sticky headers that web designers love so much, and whatever too-clever crap the OP of this thread was doing is completely bypassed. The web is so, so much better with JS off by default.

For those sites that do need JS, NoScript's whitelist feature makes it quick & easy to fix. The first time I visit a new site, if it is obviously broken, then I whitelist the main domain. If that doesn't fix it, then I whitelist a couple likely-looking domains (often sites import JS from similar domains, or from common library domains). That's enough to get probably 90+% of websites working, while still leaving most garbage JS disabled. The remaining ~10% of websites that need a dozen domains whitelisted are probably not worth visiting anyway, so I just move on at that point. Or NoScript even lets you temp-whitelist everything for a given tab and just put up with the misery to get whatever I need from that one site. Since the whitelist persists forever, and I don't visit hundreds of different websites every day, after some time it becomes pretty rare that I need to whitelist more than one or two things per day.

You maintain an adblock blacklist, I maintain a NoScript whitelist. Not so different :)

jandrese · on Oct 3, 2023

By default the script for the page itself is whitelisted, it is just the third party scripts that are blocked. This works fairly often, but there are a few sites that you can also globally unblock because they provide value. One example is mathjax, used to format equations on many pages.

gpvos · on Oct 2, 2023

It is some time ago since I last used it, but I found that too many websites that I want to read require Javascript to even show you the main body of text, or a reasonable layout. Is that different now?

coldpie · on Oct 3, 2023

No. Just whitelist the main domain for sites that are obviously broken. Then try one or two likely subdomains if that's not enough. In the rare cases where it still wants more crap enabled, then it's usually not worth the effort, close tab and move on to something else. As you build up a whitelist over time, it becomes pretty rare that you need to interact with it more than a couple times per day. Yeah, it takes some effort, but it's worth it to nuke cookie banners and sticky headers and videos and all the other crap people do with JS.

gpvos · on Oct 3, 2023

I already have that routine with uBlock Origin. I don't think NoScript offers all of uBO's functionality, and I certainly won't do the same dance for two extensions, but I'll look into uBO's abilities to specifically block JS.

coldpie · on Oct 4, 2023

Makes sense! I use both, uBO just does its thing and I never interact with it. NoScript handles blocking & whitelisting javascript. It's totally possible uBO has a similar feature and I just don't know about it.

aembleton · on Oct 2, 2023

By using reader mode

johnnyanmac · on Oct 9, 2023

>it's an even less readable "cute" flashlight js trick. I don't know why he thought this was a good idea. Thank god for Firefox reader mode.

not even a proper flashlight. it updates when the mouse moves, so you're SOL if you scroll on desktop.

torgard · on Oct 3, 2023

Well, I thought it was fun.

Arech · on Oct 2, 2023

I'd say this annoying trick is highly appropriate for the topic!

wirelesspotat · on Oct 2, 2023

Yep, the website opens a websocket connection[0] and sends the mouse position every 1 second

[0] WS connection is on `wss://tonsky.me/pointers?id=XXXXXX&page=/blog/unicode/&platform=XXX`

aragonite · on Oct 2, 2023

I've been drawing circles for over a minute now and no one has joined me yet, so I conclude those movements are random rather than made by intelligent beings. :)

extraduder_ire · on Oct 3, 2023

I did the same for a while while I was reading. From another comment, the position only seems to update once a second, so it'll be hard for someone to notice your movements.

KyleBerezin · on Oct 2, 2023

That makes me think of this old gem https://imgur.com/gallery/BgKFcI9

wizofaus · on Oct 3, 2023

It's quite possibly the worst web page presentation I've come across in a long time - aside from the fact it looks like some bug has caused my OS to leave a random trail of mouse pointers all over the screen, some of them even move around, making me doubt my sanity when I'm quite sure I'm holding the mouse still. And the less said about the colours the better. There's no way I was going to put up with that long enough to read all the text on it.

pookha · on Oct 2, 2023

Good times. If you click on the sun switch the entire UI gets zeroed out and you get to use on:hover mouse shtick to read the UI through a fuzzy radius. Is Yoko Ono designing websites now?

WD-42 · on Oct 2, 2023

It's a joke. It made me laugh.

keb_ · on Oct 3, 2023

It's a bad joke. It made me close the browser tab.

Dylan16807 · on Oct 3, 2023

> It's a bad joke.

To each their own.

> It made me close the browser tab.

If you can't handle refreshing or merely clicking it again, that's you having a problem, not the site having a problem.

keb_ · on Oct 3, 2023

> If you can't handle refreshing or merely clicking it again, that's you having a problem, not the site having a problem.

No, it's the site's problem. The contrast between the blinding radioactive yellow background and the font is eye straining and doesn't meet the WCAG standards for accessible text. And the dark mode is unusable. The joke would be funnier if there was a real dark mode or if the light mode was readable.

Dylan16807 · on Oct 3, 2023

You said the joke made you leave. The normal color scheme is not part of the joke, and I'm sorry it hurts your eyes. I won't try to defend the eye hurting.

> doesn't meet the WCAG standards for accessible text.

Oh, which part of the standards? When I punch #000000 on #FDDB29 into contrast checkers I get good results.

The single quote isn't as good but all the rest has those colors.

keb_ · on Oct 3, 2023

I'm using the Firefox Accessibility Tools. But you're right, I mistook the accessibility warning for the quote/header text for the body text. #000 > #FDDB29 does pass unfortunately.

jplusequalt · on Oct 3, 2023

Boo! I enjoyed it a ton. More fun than another bland sleek web page.

keb_ · on Oct 3, 2023

It's a creative and fun website, just not nice to use.

hwillis · on Oct 2, 2023

turned off javascript as soon as I saw it. Like trying to read with twenty mosquitos in your face.

sebstefan · on Oct 2, 2023

hey be nice to my mouse cursor

neurostimulant · on Oct 3, 2023

If you're using firefox, toggling reader view should do the trick.

ilyt · on Oct 2, 2023

I see nice crisp black text on white background because apparently server melted down

hot_gril · on Oct 2, 2023

I saw that, except half the images weren't loading, and there was just one mouse pointer.

zzzeek · on Oct 2, 2023

yeah....why on earth would someone want their webpage to do this, especially if they have text they'd presumably want you to read?

WD-42 · on Oct 2, 2023

It's cute, and provides a hint of human connection that is otherwise absent on the web "hey, another human is reading this too!" which you probably know but something about seeing the pointer move makes it feel real.

Probably not the greatest during a hacker news hug of death, but if I read that article some other time and saw one of the moving pointers, I would think it was really cool.

fragmede · on Oct 2, 2023

Have you ever read with other people, like in school or a book club, or been somewhere that there were other people around? It's an interesting move by the author; the loneliness epidemic hasn't gone unnoticed.

eg https://www.npr.org/2023/05/02/1173418268/loneliness-connect...

arendtio · on Oct 3, 2023

Too bad the Linux and the Mac pointer look so similar. But when you give them different background colors, it becomes more obvious which platform dominates, like:

  .pointer.l {
    background-color: green;
  }

nigma1337 · on Oct 2, 2023

Distracted me from reading the article, I just started chasing other people around.

dystroy · on Oct 3, 2023

That's what's missing. When I click on a pointer, its owner should have the article replaced with a "GAME OVER" message.

pests · on Oct 2, 2023

I know which site you are talking about before even clicking the article :(

lbltavares · on Oct 2, 2023

It's fun specially for folks like me who have ADHD. But there should be a button to disable it

neonsunset · on Oct 2, 2023

Yes, reading the article is impossible with erratic movement on the screen.

scruss · on Oct 5, 2023

as someone with a visual processing disorder, this is like having a page scream at me. Repeatedly. Never do this

lifeinthevoid · on Oct 2, 2023

yup, pretty annoying

876978095789789 · on Oct 2, 2023

Yeah, it's extremely obnoxious.

ormax3 · on Oct 3, 2023

first thing I did before reading the article, using uBO to block JS on the page