Hacker News new | past | comments | ask | show | jobs | submit login

Except almost everyone always means #2. No one asked for strings to be ruined in this way, and this kind of pedantry has caused untold frustration from developers who just want their strings to work properly.

If you must expose the underlying byte array do so via a sane access function that returns a typed array.

As for “string length in pixels”, that has absolutely nothing to do with the string itself as that’s determined in the UI layer that ingests the string.




  > Except almost everyone always means #2.
Until the string has to be stored in a database. Or transmitted over HTTP. Or copy-pasted in Windows running Autohotkey. Or stored in a logfile. Or used to authenticate. Or used to authorize. Or used by a human to self-identify. Or encoded. Or encrypted. Or used in an element on a web page. Or sent in an email to 12,000,000 users, some of whom might read it on a Windows 2000 box running Nutscrape. Or sent to a vendor in China. Or sent to a client in Israel. Or sent in an SMS message to 12,000,000 users, some of whom might read it on a Nokia 3310. Or sent to my exwife.


English speaking world has developed intuition about strings due to ASCII which simply fails when it comes to Unicode and that basically explains a lot of these pitfalls.

String length when defined #2 is also fairly complex when it comes to some languages such as Hindi. There are some symbols in Hindi which are not characters and can never exist as their own character but when placed next to a character they create a new character. So when you type them out on a keyboard you have to bit two keys but only one character will appear on screen. Unicode too represents this as two separate characters but for human eye it is one.

त + या = त्या

Following code will print 4

console.log("त्या".length);


"symbols in Hindi which are not characters and can never exist as their own character but when placed next to a character they create a new character"

a.k.a. 'ligatures', as in f+f+i -> U+fb03 'ffi'


I would consider ligatures a text rendering concept, which allows for but is distinct from the linguistic concept described by GP.

Edit: to further illustrate my point, in the ligatures I'm familiar with (including the ones in your link), the component characters exist standalone and can be used on their own, unlike GP's example.


In the example "Straße", the ß is, in fact, derived from an ancient ligature for sz. Old German fonts often had s as ſ, and z as ʒ. This ſʒ eventually became ß.

We (completely?) lost ſ and ʒ over the years, but ß was here to stay. Its usage changed heavily over time (replacing ss instead of sz), I think for the last time in the 90s (https://en.wikipedia.org/wiki/German_orthography_reform_of_1...), where we changed when to use ß and when ss.

So while we do replace ß with ss if we uppercase or have no ß available on the keyboard, no one would ever replace ß by sz (or even ſʒ) today, unless for artistic or traditional reasons.

Many people uppercase ß with lowercase ß or, for various reasons, an uppercase B. I have yet to see a real world example of an uppercase ẞ, it does not seem to exist outside of the internet. For example, "Straße" could be seen capitalized in the wild as STRAßE, STRASSE, STRABE, with Unicode it could also be STRAẞE. It would not be capitalized with sz (STRASZE) or even ſʒ (STRAſƷE – there is no uppercase ſ) – at least not in Germany. In Austria, sz seeems to be an option.

So, for most ligatures I would agree with you, but specifically ß is one of those ligatures I would call an outlier, at least in Germany.

P.S.: Maybe the ampersand (&), which is derived from ligatures of the latin "et", has sometimes similar problems, alhough on a different level, since it replaces a whole word. However, I have seen it being used as part of "etc.", as in "&c." (https://en.wiktionary.org/wiki/%26c.), so your point might also hold.

P.P.S.: I wonder why the uppercasing in the original post did not use ẞ, but I guess it is because of the rules in https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.... (link taken from the feed). The wikipedia entry says we adopted the capital ẞ in 2017 (but it is part of unicode since 2008). It also states that the replacement SZ should be used if the meaning would otherwise get lost (e.g. "in Maßen" vs. "in Massen" would both be "IN MASSEN" but mean either "in moderate amounts" or "in masses", forcing the first to be capitalized as MASZEN). I doubt any programming language or library handles this. I would not have even handled it myself in a manual setting, as it is such an extreme edge case. And I when I read it, I would stumble over it.


Swift handles this really well,

"त्या".count // 1

"त्या".unicodeScalars.count // 4

"त्या".utf8.count // 12

Javascript's minimal library is of course not great, but there are libraries which can help, e.g. grapheme-splitter, although it's not language-aware by design, so in this instance it'll return 2.

graphemeSplitter.countGraphemes("त्या") // 2


We even already had something like this in pure ASCII: "a\bc" has "length" 3 but appears as one glyph when printed (assuming your terminal interprets backspace).


This made me think of Hangul, when not using the precomposed block. What's the string length of 한글?


In the Rakudo compiler for Raku that I just tried its "chars" count using the default EGC counting is 2.


Or sorted! That’s its own special hell.


Or compared! How did I even forget about that. There is no form of normalization that covers all use cases.

Full text search... Oh I want to cry...


Except for SMS, #2 should work fine for all of those uses. And the reason you'd need a different measure for SMS is because SMS is bad.

And half the things on your list are just places that Unicode itself might fail, regardless of what the length is? That has nothing to do with how you count, because the number you get won't matter.


>, #2 should work fine for all of those uses.

No, #2-length is a higher-level of abstraction to display for humans. E.g. calculating pixel widths for sizing a column in a GUI table, etc.

I don't think you understood the lower-level of abstraction in the examples that gp (dotancohen) laid out. (Encrypting/compressing/transmitting/etc strings.) In those cases, you have to know exact count of bytes and not the composed graphemes that collapse to a human visible "length".

In other words, the following example code to allocate a text buffer array would be wrong:

  char *t = malloc(length_type_2); // buggy code with an undersized buffer


>is a higher-level of abstraction to display for humans

Am I missing something here? Who do you think is actually writing this stuff? Aliens?

I just want the first 5 characters of an input string, or to split a string by ":" or something.

I have plenty of apps where none of my strings _ever_ touch a database and are purely used for UI. I'll ask this again: Why break strings _globally_ to solve some arcane <0.0001% edge case instead of fixing the interface at the edges for that use case?

All this mindset has done is cause countless hours of harm/frustration in the developer community.


It's causing frustration because it's breaking people's preconceived notions about how text is simple. Previously, the frustration caused was usually to the end users of apps written by people with those preconceived notion. And there are far more end users than developers, so...


>Am I missing something here? [...] I have plenty of apps where none of my strings _ever_ touch a database and are purely used for UI.

Yes, the part you were missing is that I was continuing a discussion about a specific technical point brought up by parents' Dylan16807 and dotancohen. The fact that you manipulate a lot strings without databases or persistence/transmission across other i/o boundaries is not relevant to the context of my reply. To restate that context... Dylan16807's claim that length_type_#2 (counting human visible grapheme clusters) is all the information needed for i/o boundaries is incorrect and will lead to buggy code.

With that aside, I do understand your complaint in the following:

>I just want the first 5 characters of an input string, or to split a string by ":" or something.

This is a separate issue about ergonomics of syntax and has been debated often. A previous subthread from 6 years ago had the same debate: https://news.ycombinator.com/item?id=10519999

In that thread, some commenters (tigeba, cookiecaper) has the same complaint as you that Swift makes it hard to do simple tasks with strings (e.g. IndexOf()). The other commenters (pilif, mikeash) respond that Swift makes the high-level & low-level abstractions of Unicode more explicit. Similar philosophy in this Quora answer by a Swift compiler contributor highlighting tradeoffs of programmers not being aware if string tasks are O(1) fast -vs- O(n) slow:

https://www.quora.com/Why-is-string-manipulation-so-difficul...

And yes your complaint is shared by many... By making string api explicitly confront the "graphemes -vs- codeunits - codepoints -vs- bytes", you do get clunky cumbersome syntax such as "string.substringFromIndex(string.startIndex.advancedBy(1))" as in the Q&A answer:

https://stackoverflow.com/questions/2503436/how-to-check-if-...

I suppose Swift could have designed the string api to your philosophy. I assume the language designers anticipated it would have lead to more bugs. The tradeoff was:

- "cumbersome syntax with less bugs"

...or...

- "easier traditional syntax (like simple days of ASCII) with more hidden bugs and/or naive performance gotchas"


Manually allocating a buffer wasn't on the list of things I replied to, and you shouldn't be manually allocating buffers that way in the first place.


The main reason I care about the length of strings is to limit storage/memory/transmission size. 1) and to a lesser degree 4) achieve that (max 4 utf8 bytes per codepoint).

2) comes with a log of complexity, so I'd only use 2) in a few places where string lengths need to be meaningful for humans and when stepping through a string in the UI.

* One (extended) grapheme cluster can require a lot of storage (I'm not sure if it's even bounded at all), so it's unsuitable for length limitations

* Needs knowledge of big and regularly updated unicode tables. So if you use it in an API chances are high that both sides will interpret it differently, so it's unsuitable for API use.


> One (extended) grapheme cluster can require a lot of storage (I'm not sure if it's even bounded at all), so it's unsuitable for length limitations

You could take inspiration from one of the annexes and count no more than 31 code points together.

> Needs knowledge of big and regularly updated unicode tables. So if you use it in an API chances are high that both sides will interpret it differently, so it's unsuitable for API use.

It depends on what you're doing. That could cause problems, but so could counting bytes. Users will not be thrilled when their character limit varies wildly based on which characters they use. Computers aren't going to be thrilled either when replacing one character with another changes the length. Or, you know, when they can't uppercase a string.


Easy enough to normalize the terminology for length to mean character count and size to mean byte capacity.


So you mean grapheme clusters or code points? Do you want to count zero-width characters? More specifically what are you trying to do?


Grapheme clusters and no. Let’s see if anyone has a different interpretation or unsupported use-case.


That makes length operation require full parsing of the text according to Unicode rules. Hope you have an up to date set of these!

Codepoints is easy, as are bytes. Graphemes much harder.

So, what's the length of a sole non-breaking joiner codepoint? (U+2060, yes it's allowed)


I’m really not an expert but is there anything particular with this one?

Looking here the answer looks like “0” for all contexts to me.

https://codepoints.net/U+2060

And well, I don’t think anyone’s arguing that a “length” function needs to be exposed by the standard library of any language (I saw that Swift does btw).

Besides from byte count it’s the only interpretation of “length” that I think makes sense for users (like I mentioned above, validating that a name has at least length 1, for example)


> And well, I don’t think anyone’s arguing that a “length” function needs to be exposed by the standard library of any language

that's literally the core premise of the original post


...is it?


One problem with grapheme clusters as length is that you can't do math with it. In other words, string A (of length 1) concatenated with string B (of length 1) might result in string C that also has length 1. Meanwhile, concatenating them in the reverse order might make a string D with length 2.

This won't matter for most applications but it's something you may easily overlook when assuming grapheme clusters are enough for everything.


  > Grapheme clusters and no. Let’s see if anyone has a different interpretation or unsupported use-case.
Here's one. How many characters is this: שלום

And how about this: שַלוֹם

Notice that the former is lacking the diacritics of the latter. Is that more characters?

Maybe a better example: ال vs ﻻ. Those are both Alif-Lamm, and no that is not a ligature. The former is often normalized to the latter, but could be stored separately. And does adding Hamza ء add a character? Does it depend if it's at the end of the word?


I completely lack the cultural context to tell if this is trivial or not.

Naive heuristic: look at the font rendering on my Mac:

ء : 1 (but it sounds like it can be 0 depending on context)

ﻻ : 1

ال : 2

שלום : 4

שַלוֹם : 4

Do you see any issues?

In Japanese, which I am more familiar with, maybe it’s similar to ゙, which as you can see here has width 1 but is mostly used as a voicing sound mark as in turning か (ka) into が (ga), thus 0 (or more properly, as part of a cluster with length 1)

The use-case I’m getting at is answering the question “does this render anything visible (including white space)”? I think it’s a reasonable answer on the whole falsehood-programmers-believe-about-names validation deal (after trimming leading and trailing whitespace).

It’s the most intuitive and practically useful interpretation of “length” for arbitrary Unicode as text. The fact that it’s hard, complex, inelegant and arguably inconsistent does not change that.


> Naive heuristic:

Unicode has a fully spelled out and testable algorithm that covers this. There's no need to make anything up on the spot.

You don't need cultural context. It's not trivial. The algorithm is six printed pages.

No individual part of it is particularly hard, but every language has its own weird stuff to add.

I feel strongly that programmers should implement Unicode case folding with tests at least once, the way they should implement sorting and network connections and so on.

.

> I think it’s a reasonable answer on the whole falsehood-programmers-believe-about-names validation deal

All these "false things programmers believe about X" documents? Go through Unicode just one time, you know 80% of all of them.

I get that you like the framing you just made up, but there's an official one that the world's language nerds have been arguing over for 20 years, including having most of the population actually using it.

I suggest you consider that it might be pretty well developed by now.

.

> The use-case I’m getting at is answering the question “does this render anything visible (including white space)”?

This is not the appropriate approach, as white space can be rendered by non-characters, and characters can render nothing and still be characters under Unicode rules.

.

> It’s the most intuitive and practically useful interpretation of “length” for arbitrary Unicode as text.

It also fails for Mongol, math symbols, music symbols, the zero-width space, the conjoiner system, et cetera.

.

> The fact that it’s hard, complex, inelegant and arguably inconsistent

Fortunately, the real approach is easy, complex, inelegant, and wholly consistant. It's also well documented, and works reliably between programming languages that aren't Perl.

At any rate, you're off in another thread instructing me that the linguists are wrong with their centuries of focus because of things you thought up on the spot, and that Unicode is wrong because it counts spaces when counting characters, so, I think I'm going to disconnect now.

Thanks for understanding. Have a good day.


  > Do you see any issues?
The only real issue that I see is that ﻻ is considered to be two letters, even if it is a single codepoint or grapheme.

Regarding שַלוֹם I personally would consider it to be four letters, but the culture associated with that script (my culture) enjoys dissecting even the shape of the lines of the letters to look for meaning, I'm willing to bet that I'll find someone who will argue it to be six for the context he needs it to be. That's why I mentioned it.

  > In Japanese, which I am more familiar with, maybe it’s similar to ゙,
Yes, for the Hebrew example, I believe so. If you're really interested, it would be more akin to our voicing mark dagesh, which e.g. turns V ב into B בּ and F פ into P פּ.‎


Wait'll you find out that whether ß is one letter or two varies based on which brand of German you're drinking.

Programmers should understand that it doesn't matter if they think the thousands of years of language rules are irrational; Unicode either handles them or it's wrong.

Unicode doesn't exist to standardize the world's languages to a representation that programmers enjoy. It exists to encode all the world's languages *as they are*.

Whether you are sympathetic to the rules of foreign languages isn't super relevant in practice.


  > Unicode doesn't exist to standardize the world's languages to a
  > representationthat programmers enjoy. It exists to encode all
  > the world's languages *as they are*.
Thank you for this terrific quote that I'm going to have to upset managers and clients with. This is the eloquence that I've been striving to express for a decade.


Very kind words. Thank you.


But who cares? This is like small potatoes.

If you give a Japanese writer 140 characters, they can already encode double the amount of information an English writer can. Non-storage “character”-based lengths have always been a poor estimation of information encoding so worrying about some missing graphemes feels like you’re missing the bigger problem.


> So you mean grapheme clusters or code points?

I mean character count. The unicode standard defines that as separate from and meaningfully different than grapheme clusters or codepoints.

The confusion you're relying on isn't real.

.

> Do you want to count zero-width characters?

Are they characters? Yes? Then I want to count them.

.

> More specifically what are you trying to do?

(narrows eyes)

I want to count characters. You're trying to make that sound confusing, but it really isn't.

I don't care if you think it's a "zero width" character. Zero width space is often not actually zero width in programmers' fonts, and almost every font has at least a dozen of these wrong.

I don't care about whatever other special moves you think you have, either.

This is actually very simple.

The unicode standard has something called a "character count." It is more work than the grapheme cluster count. The grapheme cluster count doesn't honor removals, replacements, substitutions, and it does something different in case folding.

The people trying super hard to show how technically apt they are at the difficulties in Unicode are just repeating things they've heard other people say.

The actual unicode standard makes this straightforward, and has these two terms separated already.

If you genuinely want to undersatnd this, read Unicode 14 chapter 2, "general structure." It's about 50 pages. You can probably get away with just reading 2.2.3.

It is three pages long.

It's called "characters, not glyphs," because 𝐞𝐯𝐞𝐧 𝐭𝐡𝐞 𝐚𝐮𝐭𝐡𝐨𝐫𝐬 𝐨𝐟 𝐭𝐡𝐞 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐰𝐚𝐧𝐭 𝐩𝐞𝐨𝐩𝐥𝐞 𝐭𝐨 𝐬𝐭𝐨𝐩 𝐩𝐫𝐞𝐭𝐞𝐧𝐝𝐢𝐧𝐠 𝐭𝐡𝐚𝐭 "𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫" 𝐦𝐞𝐚𝐧𝐬 𝐚𝐧𝐲𝐭𝐡𝐢𝐧𝐠 𝐨𝐭𝐡𝐞𝐫 𝐭𝐡𝐚𝐧 𝐚 𝐟𝐮𝐥𝐥𝐲 𝐚𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐞𝐝 𝐬𝐞𝐫𝐢𝐞𝐬.

The word "character" is well defined in Unicode. If you think it means anything other than point 2, 𝒚𝒐𝒖 𝒂𝒓𝒆 𝒔𝒊𝒎𝒑𝒍𝒚 𝒊𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕, 𝒏𝒐𝒕 𝒑𝒍𝒚𝒊𝒏𝒈 𝒂 𝒅𝒆𝒆𝒑 𝒖𝒏𝒅𝒆𝒓𝒔𝒕𝒂𝒏𝒅𝒊𝒏𝒈 𝒐𝒇 𝒄𝒉𝒂𝒓𝒂𝒄𝒕𝒆𝒓 𝒆𝒏𝒄𝒐𝒅𝒊𝒏𝒈 𝒕𝒐𝒑𝒊𝒄𝒔.

All you need is on pages 15 and 16. Go ahead.

.

I want every choice to be made in accord with the Unicode standard. Every technicality you guys are trying to bring up was handled 20 years ago.

These words are actually well defined in the context of Unicode, and they're non-confusing in any other context. If you struggle with this, it is by choice.

Size means byte count. Length means character count. No, it doesn't matter if you incorrectly pull technical terminology like "grapheme clusters" and "code points," because I don't mean either of those. I mean 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫 𝐜𝐨𝐮𝐧𝐭.

If you have the sequence `capital a`, `zero width joiner`, `joining umlaut`, `special character`, `emoji face`, `skin color modifier`, you have:

1. Six codepoints 2. Three grapheme clusters 3. Four characters

Please wait until you can explain why before continuing to attempt to teach technicalities, friend. It doesn't work the way you claim.

This might help.

https://www.unicode.org/versions/Unicode14.0.0/UnicodeStanda...

Here, let's save you some time on some other technicalities that aren't.

1. If you write a unicode modifying character after something that cannot be modified - like adding an umlaut to a zero width joiner - then the umlaut will be considered a full character, but also discarded. Should it be counted? According to the Unicode standard, yes. 2. If you write a zero width anything, and it's a character, should it be counted? Yes. 3. If you write something that is a grapheme cluster, but your font or renderer doesn't support it, so it renders as two characters (by example, they added couple emoji in Unicode 10 and gay couples in Unicode 11, so phones that were released inbetween would render a two-man couple as two individual men.) Should that be counted as a single character as it's written, or a double character as it's rendered? Single. 4. If you're in a font that includes language variants - by example, Swiss German fancy fonts sometimes separate the S Tset into two distinct S characters that are flowed separately - should that be calculated as one character or two? Two, it turns out. 5. If a character pair is kerned into a single symbol, like an English cursive capital F to lower case E, should that be counted as one character or two? Two.

There's a whole chapter of these. If you really enjoy going through them, stop quizzing me and just read it.

These questions have been answered for literal decades. Just go read the standard already.


You just said there was one canonical length for strings then gave me 4: bytes, code points, grapheme clusters and characters, each of which applies in a different context.

I’m not 100% sure what context characters even apply in on a computer other than interest sake.

Invisible/zero-width characters are not interesting when editing text, and character count doesn’t correlate with size, therefore there’s no canonical length.


> You just said there was one canonical length for strings then gave me 4: bytes, code points, grapheme clusters and characters, each of which applies in a different context.

The other things have different labels than "length," which you've already been told.

.

> Invisible/zero-width characters are not interesting when editing text

Well, tell Unicode they're wrong, then, I guess.

Have a good day.


To be clear I appreciate you sharing the official characters concept definition, and I do think it’s valuable.


  > These words are actually well defined in the context of Unicode, and
  > they're non-confusing in any other context.
John, you've set right so many misconceptions here, and taught much. I appreciate that. However, unfortunately, this sentence I must disagree with. Software developers are not reading the spec, and the terms therefore 𝒂𝒓𝒆 confusing them.

As with security issues, I see no simple solution to get developers to familiarize themselves with even the basics before writing code, nor a way to get PMs to vet and hire such developers. Until the software development profession becomes accredited like engineering or law, Unicode and security issues (and accessibility, and robustness, and dataloss, and performance, and maintainability issues) will continue to plague the industry.


Something isn't confusing merely because people fail to try.

It's pretty easy to swap your spark plugs but most people never learn how. That doesn't mean it's secretly hard, though.

Any half-competent programmer who sat down and tried to learn it would succeed immediately


Just wanted to say thnx.

This was informative.


No problem, boblem.


A bold decision, to use the term 'easy' in a discussion of unicode.


However in this case it seems a justified one, having read your outstanding rebuttal to a sibling comment. Much appreciated.


"Character count" is not a well-defined concept because "character" is not a well-defined concept in general.


It absolutely is, unless you allow xenophilia to overcomplicate the issue. You don't need to completely bork your string implementation to support every language ever concocted by mankind. No one is coding in cuneiform.


"I don't know or care what a character is. You're a character. Now limit this field to exactly 50 letters. Not 51. And not 49."

And that client doesn't consider spaces nor punctuation as letters. Numbers count, though. Thankfully at the time emojis were not a consideration.


> No one asked for strings to be ruined in this way

Except for the, what, 80% of the world's population who use languages that can't be written with ASCII.


According to the CIA, 4.8% of the world's population speaks English as a native language and further references show 13% of the entire global population can speak English at one level or another [0]. For reference, the USA is 4.2% of the global population.

The real statement is that no one asked a few self-chosen individual who have never travelled beyond their own valley to ruin text handling in computers like the American Standard Code for Information Interchange has.

[0] ttps://www.reference.com/world-view/percentage-world-speaks-english-859e211be5634567


It seems unfortunate now, but I'd argue it was pretty reasonable at the time.

Unicode is the best attempt I know of to actually account for all of the various types of weirdness in every language on Earth. It's a huge and complex project, requiring a ton of resources at all levels, and generates a ton of confusion and edge cases, as we can see by various discussions in this whole comments section. Meanwhile, when all of the tech world was being built, computers were so slow and memory-constrained that it was considered quite reasonable to do things like represent years as 2 chars to save a little space. In a world like that, it seems like a bit much to ask anyone to spin up a project to properly account for every language in the world and figure out how to represent them and get the computers of the time to actually do it reasonably fast.

ASCII is a pretty ugly hack by any measure, but who could have made something actually better at the time? Not many individual projects could reasonably to more than try to tack on one or two other languages with more ugly hackery. Probably best to go with the English-only hack and kick the can down the road for a real solution.

I don't think anyone could have pulled together the resources to build Unicode until computers were effective and ubiquitous enough that people in all nations and cultures were clamoring to use them in their native languages, and computers would have to be fast and powerful enough to actually handle all the data and edge cases as well.


> The real statement is that no one asked a few self-chosen individual who have never travelled beyond their own valley to ruin text handling in computers like the American Standard Code for Information Interchange has.

That much is objectively false. Language designers made a choice to use it; they could’ve used other systems.

Also LBJ mandated that all systems used by the federal government use ASCII starting in 1969. Arguably that tied language designers hands, since easily the largest customer for computers had chosen ASCII.


>no one asked a few self-chosen individual who have never travelled beyond their own valley to ruin text handling in computers like the American Standard Code for Information Interchange has.

Are you seriously mad that the people who invented computers chose to program in their native language, and built tooling that was designed around said language?


That's a rather uncharitable take on ASCII when you consider the context. Going all the way back to the roots of encodings, it was a question of how much you could squeeze into N bits of data, because bits were expensive. In countries where non-Latin alphabets were used, they used similar encoding schemes with those alphabets (and for those variants, Latin was not supported!).


The primary problem is language/library designers/users believing there must be one true canonical meaning of the word „length“ like you just did, and that „length“ would be the best name for the given interface.

In database or more subtly various filesystems code the notion of bytes or codepoints might be more relevant.

By the way, what about ASCII control characters? Does carriage return have some intrinsic or clearly well defined notion of „length“ to you?

What about digraphs like ij in Dutch? Are they a singular grapheme cluster? Is this locale dependent? Do you have all scripts and cultures in mind?


A CR is a space-type character. A string containing it has a length of 1.


Whitespace is the term.

And some clients expect that whitespace is not included in string length. "I asked to put 50 letters in this box, why can I only put 42?" would not be an unexpected complaint when working with clients. Even if you manage to convey that spaces are something funny called "characters", they might not understand that newlines are characters as well. Or emojis.


Credit card numbers come to mind, printed in letters they are often grouped into four number block separated by whitespace, e.g. "5432 7890 6543 4365" and now try to copy-paste this into a form field of "length" 16.

Ok, that's more of a usability issue and many front end developers seem to be rather disconnected from the real world. Phone number entry is an even worse case, but I digress ...


The UK Government (at least those based in GDS) has noted it (https://design-system.service.gov.uk/patterns/payment-card-d...), but some definitely are not good here. Also, hypens (or dashes) aren't popular in the US but (somewhat) popular in the UK!


Riddle me this: name one algorithm that requires #2 string length. I'll name a way that it doesn't matter.

The reason this isn't an issue is because it's nearly impossible to think of a legit use case.


Since you ask for an example: moving the cursor in a terminal-based text editor requires #2 string length. That said I understand your point about focusing on what is actually needed.


Moving the cursor doesn't require string length, it requires finding the new cursor position. You can compute #2 length by moving the cursor one grapheme cluster at a time until you hit the end, but if you only need to move the cursor once, the length is irrelevant. (Also, cursor positions in a terminal-based text editor don't necessarily correspond to graphemes.)


It’s pretty crucial to the game Hangman.


Not length directly in itself, but the abstraction seems applicable to text editing controls.


That’s both #2 and #3, because Unicode characters can have various widths including zero. Gotta find out how far to move the cursor.


The Unicode character doesn't really have an intrinsic width; the font will define one (and I think a font could define a glyph even for things like zero-width spaces). But as soon as you want to display text you'll get into all sorts of fun problems and string length is most likely never applicable in any way. At that point the Unicode string becomes a sequence of glyphs from a font with their respective placements and there's no guaranteed relationship between the number of glyphs and the original number of code points, anyway.


“Full name” has a minimum length of 1.


Even for this you probably wouldn't want to use #2, because some grapheme clusters are composed to create ligatures and wouldn't be used outside of composing to create a ligature, so they shouldn't ever form a valid string by themselves. But by the #2 analysis this would read as "length 1".


Can you think of a way to do stricter while still not rejecting valid names?

I don’t have one.

(Except the airline model of “only A-Z” but it seems silly to stay there just because people could slip ligatures through the form)


No. In fixed width settings, it’s usually very important to distinguish 0-width characters (I’ll use “characters” for grapheme clusters), 1-width characters, and 2-width characters (e.g. CJK characters and emojis). See wcwidth(3).


Very much so. And one may add that the implementation of UAX#11 "East Asian Width"[1] was done in a myopic, backwards-oriented way in that it only distinguishes between multiples 0, 1 and 2 of a 'character cell' (conceptually equivalent to the width one needs in monospace typesetting to fit in a Latin letter). There are many Unicode glyphs that would need 3 or more units to fit into a given text.

* [1] https://www.unicode.org/reports/tr11/tr11-39.html


0-width characters should get a deprecation warning in Language 2.0, along with silent letters and ones that change pronunciation based on the characters surrounding them.


Honestly I mostly only want #1 on that list, not #2, since most of my stuff is fairly low level (systems programming, network stuff, etc.). So that’s not the best generalisation.


That's funny. As I read this I kept thinking "#1 is obviously the most relevant except for layout and rendering, where #3 is the important one."

But that's because I was thinking about allocation, storage, etc. For logic traversing, comparing, and transforming strings #4 is certainly more useful.

Oh, but you said #2? I guess that is important since it's what the user will most easily notice.


> Except almost everyone always means #2

Only those who think that the graphical part of an information system represents the whole of software development.


> No one asked for strings to be ruined in this way

They only got ruined for people who could not realize that the word is bigger than their backyard.


> Except almost everyone always means #2.

I don't think this is true. Certainly there is no string-length library I'm aware of that handles it that way. The usual default these days (correct or not) is #4 -- length is the number of unicode code points.


In Swift a character is a grapheme cluster, and so the length (count) of a string is in fact the number of grapheme clusters it contains.

The number of codepoints (which is almost always useless but what some languages return) is available through the unicodeScalars view.

The number of code units (what most languages actually return, and which can at least be argued to be useful) is available through the corresponding encoded views (utf8 and utf16 properties)


The more I learn about swift, the more I like it.


I do think #4 is worse than #2. There is extremely little useful information one can get by knowing the code points but not the grapheme clusters, and even for text editing these are confusing - if I move the cursor right starting before è, i definitely don't want to end up between e and `.


That’s what Rust does too, in the standard library offering #1 and #4, UTF-8 bytes and code points. Because they named a code point type a char, though, folks continue to assume a code point is a grapheme cluster. Which it frequently is, and that makes things worse.


Only in some languages.


Some languages that had the misfortune to be developed while people thought UCS-2 would be enough define string length as number of UTF-16 code units, which is not even on the list because it's not a useful number.


Yeah, and most languages/libraries which predate Unicode define string length as the number of bytes (#1). That's probably the most common interpretation, actually. But most new implementations count code point, in my experience. I believe this is also the Unicode recommendation when context doesn't determine a different algorithm to read. Been a while since I read those recommendations though, so I could be wrong.


My experience is that most new stuff is deliberately choosing and even exposing UTF-8 these days, counting in code units (which is there equivalent to bytes).

I’d say that not much counts in code points, and in fact that it’s unequivocally bad to do so, posing a fairly significant compatibility and security hazard (it’s a smuggling vector) if handled even slightly incautiously. Python is the only thing I can think of that actually does count in code points: because its strings are not even potentially ill-formed Unicode strings, but sequences of Unicode code points ('\ud83d\ude41' and '\U001f641' are both allowed, but different—though concerning the security hazard, it protects you in most places, by making encoding explicit and having UTF codecs decline to encode surrogates). String representation, both internal and public, is a thing Python 3 royally blundered on, and although they fixed as much as they could somewhere around 3.4, they can’t fix most of it without breaking compatibility.

JavaScript is a fair example of one that mostly counts in UTF-16 code units instead (though strings can be malformed Unicode, containing unmatched pairs). Take a U+1F641 and you get '\ud83d\ude41', and most techniques of looking at the string will look at it by UTF-16 code units—but I’d maintain that it’s incorrect to call it code points, because U+D83D has no distinct identity in the string like it can in Python, and other techniques of looking at the string will prevent you from seeing such surrogates.

It would have been better for Python to have real Unicode strings (that is, exclude the surrogate range) and counted in scalar values instead. Better still to have gone all in on UTF-8 rather than optimising for code point access which costs you a lot of performance in the general case while speeding up something that roughly no one should be doing anyway.

(I firmly believe that UTF-16 is the worst thing to ever happen to Unicode. Surrogates are a menace that were introduced for what hindsight shows extremely clearly were bad reasons, and which I think should have been fairly obviously bad reasons even when they standardised it, though UTF-8 did come about two years too late when you consider development pipelines.)


It's not just about optimizing Unicode access. It's also because system libraries on many common platforms (Windows, macOS) use UTF-16, so if you always store in UTF-8 internally, you have to convert back and forth every time you cross that boundary.


Most languages that seem to use UTF-16 code units are actually mixed-representation ASCII/UTF-16, because UTF-16’s memory cost is too high. I think all major browsers and JavaScript engines are (though the Servo project has shown UTF-16 isn’t necessary, coining and using WTF-8), and Swift was until they migrated to pure UTF-8 in 5.0 (though whether it was from the first release, I don’t know—Swift has significantly changed its string representation several times; https://www.swift.org/blog/utf8-string/ gives details and figures). Python is mixed-representation ASCII/UTF-16/UTF-32!

So in practice, a very significant fraction of the seemingly-UTF–16 places were already needing to allocate and reencode for UTF-16 API calls.

UTF-16 is definitely on the way out, a legacy matter to be avoided in anything new. I can’t comment on macOS UTF-16ness, but if you’re targeting recent Windows (version 1903 onwards for best results, I think) you can use UTF-8 everywhere: Microsoft has backed away from the UTF-16 -W functions, and now actively recommends using code page 65001 (UTF-8) and the -A functions <https://docs.microsoft.com/en-us/windows/apps/design/globali...>—two full decades after I think they should have done it, but at least they’re doing it now. Not sure how much programming language library code may have migrated yet, since in most cases the -W paths may still be needed for older platforms (I’m not sure at what point code page 65001 was dependable; I know it was pretty terrible in Command Prompt in Windows 7, but I’m not sure what was at fault there, whether cmd.exe, conhost.exe or the Console API Kernel32 functions).

Remember also that just about everything outside your programming language and some system and GUI libraries will be using ASCII or UTF-8, including almost all network or disk I/O, so if you use UTF-16 internally you may well need to do at least as much reencoding to UTF-8 as you would have the other way round. Certainly it varies by your use case, but the current consensus that I’ve seen is very strongly in favour of using UTF-8 internally instead of mixed representations, and fairly strongly in favour of using UTF-8 instead of UTF-16 as the internal representation, even if you’ll be interacting with lots of UTF-16 stuff.


I dunno about most languages - both JVM and CLR use UTF-16 for internal representation, for example, hence every language that primarily targets those does that also; and that's a huge slice of the market right there.

Regarding Windows, the page you've linked to doesn't recommend using CP65001 and the -A functions over -W ones. It just says that if you already have code written with UTF-8 in mind, then this is an easy way to port it to modern Windows, because now you actually fully control the codepage for your app (whereas previously it was a user setting, exposed in the UI even, so you coulnd't rely on it). But, internally, everything is still UTF-16, so far as I know, and all the -A functions basically just convert to that and call the -W variant in turn. Indeed, that very page states that "Windows operates natively in UTF-16"!

FWIW I personally hate UTF-16 with a passion and want to see it die sooner rather than later - not only it's an ugly hack, but it's a hack that's all about keeping doing the Wrong Thing easy. I just don't think that it'll happen all that fast, so for now, accommodations must be made. IMO Python has the right idea in principle by allowing multiple internal encodings for strings, but not exposing them in the public API even for native code.


Neither the JVM nor the CLR guarantee UTF-16 representation.

From Java 9 onwards, the JVM defaults to using compact strings, which means mixed ISO-8859-1/UTF-16. The command line argument -XX:-CompactStrings disables that.

CLR, I don’t know. But presuming it’s still pure UTF-16, it could still change that as it’s an implementation detail.

(As for UTF-16, not only is it an ugly hack, it’s a hack that ruined Unicode for all the other transformation formats.)

I don’t think Python’s approach was at all sane. The root problem is they made strings sequences of Unicode code points rather than of Unicode scalar values or even UTF-16 code units. (I have a vague recollection of reading some years back that during the py3k endeavour they didn’t have or consult with any Unicode experts, and realise with hindsight that what they went with is terrible.) This bad foundation just breaks everything, so that they couldn’t switch to a sane internal representation. I described the current internal representation as mixed ASCII/UTF-16/UTF-32, but having gone back and read PEP 393 now (implemented in Python 3.3), I’d forgotten just how hideous it is: mixed Latin-1/UCS-2/UCS-4, plus extra bits and data assigned to things like whether it’s ASCII, and its UTF-8 length… sometimes. It ends up fiendishly complex in their endeavour to make it more consistent across narrow architectures and use less memory, and is typically a fair bit slower than what they had before.

Many languages have had an undefined internal representation, and it’s fairly consistently caused them at least some grief when they want to change it, because people too often inadvertently depended on at least the performance characteristics of the internal representation.

By comparison, Rust strings have been transparent UTF-8 from the start—having admittedly the benefit of starting later than most, so that UTF-8 being the best sane choice was clear—which appropriately guides people away from doing bad things by API, except for the existence of code-unit-wise indexing via string[index] and string.len(), which I’m not overly enamoured of (such indexing is essentially discontinuous in the presence of multibyte characters, panicking on accessing the middle of a scalar value, making it too easy to mistake for code point or scalar value indexing). You know what you’re dealing with, and it’s roughly the simplest possible thing and very sane, and you can optimise for that. And Rust can’t change its string representation because it’s public rather than implementation detail.


> I believe this is also the Unicode recommendation when context doesn't determine a different algorithm to read.

Except that emojis are universally two "characters", even those that are encoded as several codepoints. Also, non-composite Korean jamo versus composited jamo.


Like this: “:)” ?

Japanese kana also count as two characters. Which they largely are when romanized, on average. Korean isn’t identical but the information density is approximately the same. Good enough to approximate as such and have a consistent rule.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: