Until the string has to be stored in a database. Or transmitted over HTTP. Or copy-pasted in Windows running Autohotkey. Or stored in a logfile. Or used to authenticate. Or used to authorize. Or used by a human to self-identify. Or encoded. Or encrypted. Or used in an element on a web page. Or sent in an email to 12,000,000 users, some of whom might read it on a Windows 2000 box running Nutscrape. Or sent to a vendor in China. Or sent to a client in Israel. Or sent in an SMS message to 12,000,000 users, some of whom might read it on a Nokia 3310. Or sent to my exwife.
English speaking world has developed intuition about strings due to ASCII which simply fails when it comes to Unicode and that basically explains a lot of these pitfalls.
String length when defined #2 is also fairly complex when it comes to some languages such as Hindi. There are some symbols in Hindi which are not characters and can never exist as their own character but when placed next to a character they create a new character. So when you type them out on a keyboard you have to bit two keys but only one character will appear on screen. Unicode too represents this as two separate characters but for human eye it is one.
I would consider ligatures a text rendering concept, which allows for but is distinct from the linguistic concept described by GP.
Edit: to further illustrate my point, in the ligatures I'm familiar with (including the ones in your link), the component characters exist standalone and can be used on their own, unlike GP's example.
In the example "Straße", the ß is, in fact, derived from an ancient ligature for sz.
Old German fonts often had s as ſ, and z as ʒ. This ſʒ eventually became ß.
We (completely?) lost ſ and ʒ over the years, but ß was here to stay.
Its usage changed heavily over time (replacing ss instead of sz), I think for the last time in the 90s (https://en.wikipedia.org/wiki/German_orthography_reform_of_1...), where we changed when to use ß and when ss.
So while we do replace ß with ss if we uppercase or have no ß available on the keyboard, no one would ever replace ß by sz (or even ſʒ) today, unless for artistic or traditional reasons.
Many people uppercase ß with lowercase ß or, for various reasons, an uppercase B. I have yet to see a real world example of an uppercase ẞ, it does not seem to exist outside of the internet.
For example, "Straße" could be seen capitalized in the wild as STRAßE, STRASSE, STRABE, with Unicode it could also be STRAẞE. It would not be capitalized with sz (STRASZE) or even ſʒ (STRAſƷE – there is no uppercase ſ) – at least not in Germany. In Austria, sz seeems to be an option.
So, for most ligatures I would agree with you, but specifically ß is one of those ligatures I would call an outlier, at least in Germany.
P.S.: Maybe the ampersand (&), which is derived from ligatures of the latin "et", has sometimes similar problems, alhough on a different level, since it replaces a whole word. However, I have seen it being used as part of "etc.", as in "&c." (https://en.wiktionary.org/wiki/%26c.), so your point might also hold.
P.P.S.: I wonder why the uppercasing in the original post did not use ẞ, but I guess it is because of the rules in https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.... (link taken from the feed). The wikipedia entry says we adopted the capital ẞ in 2017 (but it is part of unicode since 2008). It also states that the replacement SZ should be used if the meaning would otherwise get lost (e.g. "in Maßen" vs. "in Massen" would both be "IN MASSEN" but mean either "in moderate amounts" or "in masses", forcing the first to be capitalized as MASZEN). I doubt any programming language or library handles this. I would not have even handled it myself in a manual setting, as it is such an extreme edge case. And I when I read it, I would stumble over it.
Javascript's minimal library is of course not great, but there are libraries which can help, e.g. grapheme-splitter, although it's not language-aware by design, so in this instance it'll return 2.
We even already had something like this in pure ASCII: "a\bc" has "length" 3 but appears as one glyph when printed (assuming your terminal interprets backspace).
Except for SMS, #2 should work fine for all of those uses. And the reason you'd need a different measure for SMS is because SMS is bad.
And half the things on your list are just places that Unicode itself might fail, regardless of what the length is? That has nothing to do with how you count, because the number you get won't matter.
No, #2-length is a higher-level of abstraction to display for humans. E.g. calculating pixel widths for sizing a column in a GUI table, etc.
I don't think you understood the lower-level of abstraction in the examples that gp (dotancohen) laid out. (Encrypting/compressing/transmitting/etc strings.) In those cases, you have to know exact count of bytes and not the composed graphemes that collapse to a human visible "length".
In other words, the following example code to allocate a text buffer array would be wrong:
char *t = malloc(length_type_2); // buggy code with an undersized buffer
>is a higher-level of abstraction to display for humans
Am I missing something here? Who do you think is actually writing this stuff? Aliens?
I just want the first 5 characters of an input string, or to split a string by ":" or something.
I have plenty of apps where none of my strings _ever_ touch a database and are purely used for UI. I'll ask this again: Why break strings _globally_ to solve some arcane <0.0001% edge case instead of fixing the interface at the edges for that use case?
All this mindset has done is cause countless hours of harm/frustration in the developer community.
It's causing frustration because it's breaking people's preconceived notions about how text is simple. Previously, the frustration caused was usually to the end users of apps written by people with those preconceived notion. And there are far more end users than developers, so...
>Am I missing something here? [...] I have plenty of apps where none of my strings _ever_ touch a database and are purely used for UI.
Yes, the part you were missing is that I was continuing a discussion about a specific technical point brought up by parents' Dylan16807 and dotancohen. The fact that you manipulate a lot strings without databases or persistence/transmission across other i/o boundaries is not relevant to the context of my reply. To restate that context... Dylan16807's claim that length_type_#2 (counting human visible grapheme clusters) is all the information needed for i/o boundaries is incorrect and will lead to buggy code.
With that aside, I do understand your complaint in the following:
>I just want the first 5 characters of an input string, or to split a string by ":" or something.
This is a separate issue about ergonomics of syntax and has been debated often. A previous subthread from 6 years ago had the same debate: https://news.ycombinator.com/item?id=10519999
In that thread, some commenters (tigeba, cookiecaper) has the same complaint as you that Swift makes it hard to do simple tasks with strings (e.g. IndexOf()). The other commenters (pilif, mikeash) respond that Swift makes the high-level & low-level abstractions of Unicode more explicit. Similar philosophy in this Quora answer by a Swift compiler contributor highlighting tradeoffs of programmers not being aware if string tasks are O(1) fast -vs- O(n) slow:
And yes your complaint is shared by many... By making string api explicitly confront the "graphemes -vs- codeunits - codepoints -vs- bytes", you do get clunky cumbersome syntax such as "string.substringFromIndex(string.startIndex.advancedBy(1))" as in the Q&A answer:
I suppose Swift could have designed the string api to your philosophy. I assume the language designers anticipated it would have lead to more bugs. The tradeoff was:
- "cumbersome syntax with less bugs"
...or...
- "easier traditional syntax (like simple days of ASCII) with more hidden bugs and/or naive performance gotchas"
The main reason I care about the length of strings is to limit storage/memory/transmission size. 1) and to a lesser degree 4) achieve that (max 4 utf8 bytes per codepoint).
2) comes with a log of complexity, so I'd only use 2) in a few places where string lengths need to be meaningful for humans and when stepping through a string in the UI.
* One (extended) grapheme cluster can require a lot of storage (I'm not sure if it's even bounded at all), so it's unsuitable for length limitations
* Needs knowledge of big and regularly updated unicode tables. So if you use it in an API chances are high that both sides will interpret it differently, so it's unsuitable for API use.
> One (extended) grapheme cluster can require a lot of storage (I'm not sure if it's even bounded at all), so it's unsuitable for length limitations
You could take inspiration from one of the annexes and count no more than 31 code points together.
> Needs knowledge of big and regularly updated unicode tables. So if you use it in an API chances are high that both sides will interpret it differently, so it's unsuitable for API use.
It depends on what you're doing. That could cause problems, but so could counting bytes. Users will not be thrilled when their character limit varies wildly based on which characters they use. Computers aren't going to be thrilled either when replacing one character with another changes the length. Or, you know, when they can't uppercase a string.
And well, I don’t think anyone’s arguing that a “length” function needs to be exposed by the standard library of any language (I saw that Swift does btw).
Besides from byte count it’s the only interpretation of “length” that I think makes sense for users (like I mentioned above, validating that a name has at least length 1, for example)
One problem with grapheme clusters as length is that you can't do math with it. In other words, string A (of length 1) concatenated with string B (of length 1) might result in string C that also has length 1. Meanwhile, concatenating them in the reverse order might make a string D with length 2.
This won't matter for most applications but it's something you may easily overlook when assuming grapheme clusters are enough for everything.
> Grapheme clusters and no. Let’s see if anyone has a different interpretation or unsupported use-case.
Here's one. How many characters is this: שלום
And how about this: שַלוֹם
Notice that the former is lacking the diacritics of the latter. Is that more characters?
Maybe a better example: ال vs ﻻ. Those are both Alif-Lamm, and no that is not a ligature. The former is often normalized to the latter, but could be stored separately. And does adding Hamza ء add a character? Does it depend if it's at the end of the word?
I completely lack the cultural context to tell if this is trivial or not.
Naive heuristic: look at the font rendering on my Mac:
ء
: 1 (but it sounds like it can be 0 depending on context)
ﻻ
: 1
ال
: 2
שלום
: 4
שַלוֹם
: 4
Do you see any issues?
In Japanese, which I am more familiar with, maybe it’s similar to ゙, which as you can see here has width 1 but is mostly used as a voicing sound mark as in turning か (ka) into が (ga), thus 0 (or more properly, as part of a cluster with length 1)
The use-case I’m getting at is answering the question “does this render anything visible (including white space)”? I think it’s a reasonable answer on the whole falsehood-programmers-believe-about-names validation deal (after trimming leading and trailing whitespace).
It’s the most intuitive and practically useful interpretation of “length” for arbitrary Unicode as text. The fact that it’s hard, complex, inelegant and arguably inconsistent does not change that.
Unicode has a fully spelled out and testable algorithm that covers this. There's no need to make anything up on the spot.
You don't need cultural context. It's not trivial. The algorithm is six printed pages.
No individual part of it is particularly hard, but every language has its own weird stuff to add.
I feel strongly that programmers should implement Unicode case folding with tests at least once, the way they should implement sorting and network connections and so on.
.
> I think it’s a reasonable answer on the whole falsehood-programmers-believe-about-names validation deal
All these "false things programmers believe about X" documents? Go through Unicode just one time, you know 80% of all of them.
I get that you like the framing you just made up, but there's an official one that the world's language nerds have been arguing over for 20 years, including having most of the population actually using it.
I suggest you consider that it might be pretty well developed by now.
.
> The use-case I’m getting at is answering the question “does this render anything visible (including white space)”?
This is not the appropriate approach, as white space can be rendered by non-characters, and characters can render nothing and still be characters under Unicode rules.
.
> It’s the most intuitive and practically useful interpretation of “length” for arbitrary Unicode as text.
It also fails for Mongol, math symbols, music symbols, the zero-width space, the conjoiner system, et cetera.
.
> The fact that it’s hard, complex, inelegant and arguably inconsistent
Fortunately, the real approach is easy, complex, inelegant, and wholly consistant. It's also well documented, and works reliably between programming languages that aren't Perl.
At any rate, you're off in another thread instructing me that the linguists are wrong with their centuries of focus because of things you thought up on the spot, and that Unicode is wrong because it counts spaces when counting characters, so, I think I'm going to disconnect now.
The only real issue that I see is that ﻻ is considered to be two letters, even if it is a single codepoint or grapheme.
Regarding שַלוֹם I personally would consider it to be four letters, but the culture associated with that script (my culture) enjoys dissecting even the shape of the lines of the letters to look for meaning, I'm willing to bet that I'll find someone who will argue it to be six for the context he needs it to be. That's why I mentioned it.
> In Japanese, which I am more familiar with, maybe it’s similar to ゙,
Yes, for the Hebrew example, I believe so. If you're really interested, it would be more akin to our voicing mark dagesh, which e.g. turns V ב into B בּ and F פ into P פּ.
Wait'll you find out that whether ß is one letter or two varies based on which brand of German you're drinking.
Programmers should understand that it doesn't matter if they think the thousands of years of language rules are irrational; Unicode either handles them or it's wrong.
Unicode doesn't exist to standardize the world's languages to a representation that programmers enjoy. It exists to encode all the world's languages *as they are*.
Whether you are sympathetic to the rules of foreign languages isn't super relevant in practice.
> Unicode doesn't exist to standardize the world's languages to a
> representationthat programmers enjoy. It exists to encode all
> the world's languages *as they are*.
Thank you for this terrific quote that I'm going to have to upset managers and clients with. This is the eloquence that I've been striving to express for a decade.
If you give a Japanese writer 140 characters, they can already encode double the amount of information an English writer can. Non-storage “character”-based lengths have always been a poor estimation of information encoding so worrying about some missing graphemes feels like you’re missing the bigger problem.
I mean character count. The unicode standard defines that as separate from and meaningfully different than grapheme clusters or codepoints.
The confusion you're relying on isn't real.
.
> Do you want to count zero-width characters?
Are they characters? Yes? Then I want to count them.
.
> More specifically what are you trying to do?
(narrows eyes)
I want to count characters. You're trying to make that sound confusing, but it really isn't.
I don't care if you think it's a "zero width" character. Zero width space is often not actually zero width in programmers' fonts, and almost every font has at least a dozen of these wrong.
I don't care about whatever other special moves you think you have, either.
This is actually very simple.
The unicode standard has something called a "character count." It is more work than the grapheme cluster count. The grapheme cluster count doesn't honor removals, replacements, substitutions, and it does something different in case folding.
The people trying super hard to show how technically apt they are at the difficulties in Unicode are just repeating things they've heard other people say.
The actual unicode standard makes this straightforward, and has these two terms separated already.
If you genuinely want to undersatnd this, read Unicode 14 chapter 2, "general structure." It's about 50 pages. You can probably get away with just reading 2.2.3.
It is three pages long.
It's called "characters, not glyphs," because 𝐞𝐯𝐞𝐧 𝐭𝐡𝐞 𝐚𝐮𝐭𝐡𝐨𝐫𝐬 𝐨𝐟 𝐭𝐡𝐞 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐰𝐚𝐧𝐭 𝐩𝐞𝐨𝐩𝐥𝐞 𝐭𝐨 𝐬𝐭𝐨𝐩 𝐩𝐫𝐞𝐭𝐞𝐧𝐝𝐢𝐧𝐠 𝐭𝐡𝐚𝐭 "𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫" 𝐦𝐞𝐚𝐧𝐬 𝐚𝐧𝐲𝐭𝐡𝐢𝐧𝐠 𝐨𝐭𝐡𝐞𝐫 𝐭𝐡𝐚𝐧 𝐚 𝐟𝐮𝐥𝐥𝐲 𝐚𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐞𝐝 𝐬𝐞𝐫𝐢𝐞𝐬.
The word "character" is well defined in Unicode. If you think it means anything other than point 2, 𝒚𝒐𝒖 𝒂𝒓𝒆 𝒔𝒊𝒎𝒑𝒍𝒚 𝒊𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕, 𝒏𝒐𝒕 𝒑𝒍𝒚𝒊𝒏𝒈 𝒂 𝒅𝒆𝒆𝒑 𝒖𝒏𝒅𝒆𝒓𝒔𝒕𝒂𝒏𝒅𝒊𝒏𝒈 𝒐𝒇 𝒄𝒉𝒂𝒓𝒂𝒄𝒕𝒆𝒓 𝒆𝒏𝒄𝒐𝒅𝒊𝒏𝒈 𝒕𝒐𝒑𝒊𝒄𝒔.
All you need is on pages 15 and 16. Go ahead.
.
I want every choice to be made in accord with the Unicode standard. Every technicality you guys are trying to bring up was handled 20 years ago.
These words are actually well defined in the context of Unicode, and they're non-confusing in any other context. If you struggle with this, it is by choice.
Size means byte count. Length means character count. No, it doesn't matter if you incorrectly pull technical terminology like "grapheme clusters" and "code points," because I don't mean either of those. I mean 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫 𝐜𝐨𝐮𝐧𝐭.
If you have the sequence `capital a`, `zero width joiner`, `joining umlaut`, `special character`, `emoji face`, `skin color modifier`, you have:
1. Six codepoints
2. Three grapheme clusters
3. Four characters
Please wait until you can explain why before continuing to attempt to teach technicalities, friend. It doesn't work the way you claim.
Here, let's save you some time on some other technicalities that aren't.
1. If you write a unicode modifying character after something that cannot be modified - like adding an umlaut to a zero width joiner - then the umlaut will be considered a full character, but also discarded. Should it be counted? According to the Unicode standard, yes.
2. If you write a zero width anything, and it's a character, should it be counted? Yes.
3. If you write something that is a grapheme cluster, but your font or renderer doesn't support it, so it renders as two characters (by example, they added couple emoji in Unicode 10 and gay couples in Unicode 11, so phones that were released inbetween would render a two-man couple as two individual men.) Should that be counted as a single character as it's written, or a double character as it's rendered? Single.
4. If you're in a font that includes language variants - by example, Swiss German fancy fonts sometimes separate the S Tset into two distinct S characters that are flowed separately - should that be calculated as one character or two? Two, it turns out.
5. If a character pair is kerned into a single symbol, like an English cursive capital F to lower case E, should that be counted as one character or two? Two.
There's a whole chapter of these. If you really enjoy going through them, stop quizzing me and just read it.
These questions have been answered for literal decades. Just go read the standard already.
You just said there was one canonical length for strings then gave me 4: bytes, code points, grapheme clusters and characters, each of which applies in a different context.
I’m not 100% sure what context characters even apply in on a computer other than interest sake.
Invisible/zero-width characters are not interesting when editing text, and character count doesn’t correlate with size, therefore there’s no canonical length.
> You just said there was one canonical length for strings then gave me 4: bytes, code points, grapheme clusters and characters, each of which applies in a different context.
The other things have different labels than "length," which you've already been told.
.
> Invisible/zero-width characters are not interesting when editing text
> These words are actually well defined in the context of Unicode, and
> they're non-confusing in any other context.
John, you've set right so many misconceptions here, and taught much. I appreciate that. However, unfortunately, this sentence I must disagree with. Software developers are not reading the spec, and the terms therefore 𝒂𝒓𝒆 confusing them.
As with security issues, I see no simple solution to get developers to familiarize themselves with even the basics before writing code, nor a way to get PMs to vet and hire such developers. Until the software development profession becomes accredited like engineering or law, Unicode and security issues (and accessibility, and robustness, and dataloss, and performance, and maintainability issues) will continue to plague the industry.
It absolutely is, unless you allow xenophilia to overcomplicate the issue. You don't need to completely bork your string implementation to support every language ever concocted by mankind. No one is coding in cuneiform.