Hacker News new | past | comments | ask | show | jobs | submit login

> The problem is, you don’t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called “extended grapheme clusters”, or graphemes for short.

It's best to avoid making overly-general claims like this. There are plenty of situations that warrant operating on code points, and it's likely that software trying and failing to make sense of grapheme clusters will result it in a worse screwup. Codepoints are probably the best default. For example, it probably makes the most sense for programming languages to define strings as arrays of code points, and not characters or 16-bit chunks or an encoding, or whatever.




> There are plenty of situations that warrant operating on code points

Absolutely correct. All algorithms defined by the Unicode Standard and its technical reports operate on the code point. All 90+ character properties defined by the standard are queried for with the code point. The article omits this information and ironically links to the grapheme cluster break rules which operate on code points.


The article doesn't say not to use code points, it says you should not be iterating on them.

Very rarely will you be implementing those algorithms. And if you're looking at character properties, the article says you should be looking at multiple together, which is correct.


> And if you're looking at character properties, the article says you should be looking at multiple together, which is correct.

I don't see where the article mentions Unicode character properties [1]. These properties are assigned to individual characters, not groups of characters or grapheme clusters.

> Very rarely will you be implementing those algorithms.

True, but character properties are frequently used, i.e. every time you parse text and call a character classification function like "isDigit" or "isControl" provided by your standard library you are in fact querying a Unicode character property.

[1] https://unicode.org/reports/tr44/#Properties


> These properties are assigned to individual characters, not groups of characters or grapheme clusters.

But you need to deal with the whole cluster. You can't just look at the properties on a single combining character and know what to do with it.

If the article's saying to iterate one cluster at a time, then if you're doing properties a direct consequence is that you should be looking at the properties of specific code points per cluster or all of them.


The Unicode Standard does not specify how character properties should be extracted from a grapheme cluster. Programming languages that define "character" to mean grapheme cluster (like Swift) need to establish their own ad-hoc rules.

As others have pointed out in this thread, the article is full of the authors own personal opinions. The author suggests iterating text as grapheme clusters, but fails to consider that this breaks tokenizers, e.g. a tokenizer for a comma-separated list [1] won't see the comma as "just a comma" if the value after it begins with a combining character.

[1] https://en.wikipedia.org/wiki/Comma-separated_values


If some tokenizer of a comma-separated list treats the comma (I'm assuming any 0x2C byte) as "just a comma" even if the value after it begins with a combining character, that's a broken, buggy tokenizer, and one that can potentially be exploited by providing some specifically crafted unicode data in a single field that then causes the tokenizer to misinterpret field boundaries. If you combine a character with something, that's not the same character anymore - it's not equal to that, it's not that separator anymore, and you can't tell that unless/until you look at the following codepoints. If the combined character isn't valid, then either the message should be discarded as invalid or the character replaced with U+FFFD, the Replacement Character, but it should definitely not be interpreted as "just a comma" simply because one part of some broken character matches the ASCII code for a comma.

If anything, your example is an illustration why it's dangerous to iterate over codepoints and not graphemes. Unless you're explicitly tranforming encodings to/from unicode, anything that processes text (not the encoding, but actual text content, like tokenizers do) - should work with graphemes as the basic atomic indivisible unit.


> The Unicode Standard does not specify how character properties should be extracted from a grapheme cluster. Programming languages that define "character" to mean grapheme cluster (like Swift) need to establish their own ad-hoc rules.

Right. Which means not just iterating by code point.

> The author suggests iterating text as grapheme clusters, but fails to consider that this breaks tokenizers, e.g. a tokenizer for a comma-separated list [1] won't see the comma as "just a comma" if the value after it begins with a combining character.

I don't think they're talking about tokenizers. It's a general purpose rule.

Also I would argue that a CSV file with non-attached combining characters doesn't qualify as "text".


Situations such as?

Sometimes editing wants to go inside clusters but that's not code-point based either.

I'd say that in a big majority of situations, code that is indexing an array with code points is either treating the indexes as opaque pointers or is doing something wrong.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: