String-matching is really scary in Unicode, especially since the exact form of the string matters with respect to composition — and that’s before you even consider that some characters just plain look like others or even are the same glyph. And strings can contain things like zero-width spaces that look like nothing at all.
Sure, there are recommended practices but there have been enough mistakes already (or lazy programmers) that it is hard to be confident that any string with “interesting” symbols in it is exactly what it appears to be. And there have been security problems related to the fact that many interfaces expect the user to know exactly what they’re reading, much less the programmer.
It almost sounds like there could be a lot of benefit from a "homophones check" system but for Unicode glyphs (perhaps with a variable amount of "closeness") being built into Unicode handling libraries.
Like how "е" looks identical to "e", which looks close to "ė" if you aren't careful, which might be mistaken for "é" in smaller fonts even though all 4 letters are different unicode glyphs.
Being able to say "ɡrеɡ" is the same as "greg" even though 3 of the 4 characters are actually different would be extremely useful in some cases, and in others would be extremely incorrect, so giving the developer the ability to say how "exact" they need their checks to be in a "native" and easy way might go a long way toward not only making this problem more "obvious" but also toward forcing them to be explicit about what they are checking for.
There's also NFKC_Casefold, which is a technique used by RFC 5892 (among others) to limit the characters allowable in a domain name. The problem is that it also disallows 'A', because Casefold(NFKC('A')) != 'A'. I'm sure that's equally annoying in other languages. And in any event makes it problematic for usages like parsing URLs from free-form Unicode text.
Unicode specifications are incredibly thorough and well thought-out. The problem is that the Unicode spec isn't shippable software. It's not an implementation.
And there's no singular implementation. Worse, nobody uses any particular implementation the same way, and rarely to its fullest extent. Compounding the problems, so much code is _proprietary_. You have no way to verify and track how such code will behave, so interoperability is difficult. For example, good luck trying to reproduce the behavior of Outlook, Mail.app, and gmail.com in terms of how each will highlight URLs in free-form text.
The only saving grace appears to be that the rest of the world, I assume, has grown accustomed to how broken American software is in terms of dealing with I18N issues. And Americans remain blissfully naive. I keep waiting for the other shoe to drop; when managers will finally crack the whip at the behest of international customers and demand that engineers begin taking I18N seriously. But it hasn't happened yet. I've been waiting almost 15 years, accumulating skills and best practices that my employers don't seem to value very much. Oh well....
It's also font-specific. Is т a homoglyph of T or m, for example? There isn't really a good way to solve this because restricting systems to only use ASCII (which also has homoglyphs, e.g. 0/O, 1/l, I/l, ...) is very user-unfriendly.
I got to write an address database matcher once covering all of scandinavia and poland.
For suggest functions, it turns out people fully expect to be able to write some sort of ascii normalisation and still get a match, i.e. the address has ż in it, but people want to type a plain z.
And the rules for this are not entirely obvious. A swede would totally expect being able to write ö instead of the norwegian ø when doing routing across the border.
т is a homoglyph of T—because one could mistake one for the other. They're in a visual equivalence-class. That doesn't mean that you should normalize т into T, though. Those are separate considerations.
If you were granting e.g. domain names, or usernames, you'd be able to map each character in the test string to its homoglyph equivalence-class, and then ask whether anyone has previously registered a name using that sequence of equivalence-class values. So someone's registration of "тhe" would preclude registering "the", and vice-versa; but when you normalized "тhe", you'd still get "mhe".
Of course, to use such a system properly, you'd have to keep the original registered variant of the name around and use it URL slugs and the like (even if that means resorting to punycode), rather than trying to "canonicalize" the person's provided name through a normalization algorithm. Because they have "[the equivalence class of т]he", not "mhe"; someone else has "mhe".
> т is a homoglyph of T—because one could mistake one for the other.
I believe gp is talking about the font. In some fonts (especially italic/cursive), the letter "т" looks like "m", and nothing like "T" -- so it's really hard to say with which one it's "visually equivalent".
I was just trying to think of a way to actually deal with this in a way that it's up to the developer to decide what they should be matching.
Even in those languages you still might want to treat an OCR of a "c" with "ç" in some cases, or you might want to treat them identical when moving from a system that only accepted ASCII in the past to one that is fully unicode compliant. And even in English there are situations where "ç" should not be treated as "c" (like in URL resolving or other scenarios where an exact match is needed).
It's not technically correct, but it's "more correct". And giving the developer the ability to determine where along that "scale" of exact-ness they want to be might help.
C# has a great set of options for setting that, it seems like a good candidate to implement such a system - most string manipulation functions take in a culture/use current culture enum.
I wouldn't think so, because the languages I know of using ç do not use č and vice versa. There are definitely pairs with "plain" C in a bunch of those.
Such a check would be very useful for checking domains. Right now, a malicious attacker could register google.com, but with one of the characters replaced by a (nearly) identical looking unicode symbol. Then, the attacker could use HTTPS and create a login page identical to gmail's. No, all it takes to get someones credentials are a link to mail.google.com.
It would be nice if the authorities handling the registration of the domain names could forbid domains that look to much like each other.
It would clearly help for the system we have now but the real solution is to push for stricter authentication across the board. As convenient as URL strings can be, we need E-mail clients and other tools to be able to force at least a 2nd layer of authentication (e.g. E-mail claims link is from domain #1; user must counter by selecting from a list of sites actually visited previously, and E-mail client refuses to open link if they don’t match). You could imagine much more elaborate solutions too based on certificates, etc.
I don't think that particular solution would be good from a user experience point of view, but it is indeed a nice idea to filter out domains that you have received emails from (and are not deleted or in the spam folder).
However, there are ways around this too. I think the fundamental mistake was to allow (all?) unicode strings as urls. However, I can't come up with an elegant solution on the spot (since it would be unfair and unpractical to use ASCII for this).
Isn't this basically what PRECIS (RFC 7564) is about? There are open source implementations of that, like golang.org/x/text/secure/precis for Go (including the predefined profiles for e.g. usernames) or Unicode::Precis for Perl.
In the search engine context, the problem to be solved is that both French and English speakers are likely to type [cafe] and not [café] -- the French speaker because they might be on an English keyboard, or because they know it's not ambiguous.
In the search space, therefore, when you index the word 'café', you also index 'cafe' with a smaller weight. And when you see the query [café], you expand the query to ('café' OR 'cafe'-with-smaller-weight)
And you don't want to do either of this if the two words are actually different!
I --THINK-- offhand, that NFKC is what you want to use when preparing a password input for processing/comparison (it's lossless, but to a specific point). I also --THINK-- that NFC is the form you want to use when retaining source glyph language distinctions.
I agree with the destructive (pre computation/comparison) operation and that either of the NFKD or NFKC forms should be used (since they destroy non-printing differences for visually compatible characters; a more user friendly approach).
The 'C' forms are always more condensed (accents are packed in to a single character where possible), and thus of higher entropy per input byte. It is my belief that this form is likely to be less susceptible to attacks.
The 'D' forms seem like good choices for /editors/ where the precise nature of a character might be altered by adding or removing accents. (Most human input boxes; during the input/edit process)
> The 'C' forms are always more condensed (accents are packed in to a single character where possible), and thus of higher entropy per input byte. It is my belief that this form is likely to be less susceptible to attacks.
What sort of attacks are you talking about?
> The 'D' forms seem like good choices for /editors/ where the precise nature of a character might be altered by adding or removing accents. (Most human input boxes; during the input/edit process)
Editors shouldn't care about C vs D. The reason being, once you've typed the grapheme cluster, it's supposed to act in an editor as if it's a single "character" regardless of whether it's made from one codepoint or several. This means that if I type é then arrow keys and the delete key will operate it on it exactly the same whether it's composed or decomposed.
It's a huge mistake for Unicode to have two different code points having the same glyph. There should not be semantic meaning that disappears when something is printed.
We're going to be suffering for that mistake for a looong time.
Disagree. A good example of the opposite mistake is the “Turkish i” problem. Basically they have a version of I with and without a dot — for both lowercase and uppercase — so algorithms that uppercase i to I break Turkish by removing the dot. If the Turkish i were a unique code point, the algorithms would not mess it up.
Then you have the german ß (sharp S) which does not have an upper case version. While ISO added one for whatever reason the official upper case is two letters consisting of either "SS" or "SZ". So you have three different ways to upper case ß one which is guaranteed to be wrong in any official context and two which lower case to "ss" or "sz" and not back to ß. That is one big ouch, especially to the ISO standard adding that invalid upper case variation. Languages are messy, best don't try to transform your input text in any way.
>It's used in typesetting sometimes, and if a character is used then it should have an encoding.
IMO there's little semantic difference so it doesn't deserve a character. We should have drawn the line between content and formatting, but it's too late and what we have now is emoji and one-use glyphs. [1]
This then makes the CJK unification decision even more perplexing. Surely Japanese characters should not be treated the same as Mandarin ones, even if they look the same?
I can't speak Japanese, only some Chinese, but I'm wondering if whether to use the (Chinese) Onyomi or (Japanese) Kunyomi pronunciation in Japanese is related in any way to whether the 山 comes first or last in the compound. If it comes last as in 富士山 "Fuji san", the grammar matches the Chinese, and so does the pronunciation ("Fushishan"). If it comes first as in 山登り "yamanoboru", the grammar is opposite to the Chinese (which would also have the 山 last, i.e. 跑山).
PS: Isn't り pronounced "ri" and る pronounced "ru"?
To some extent you're right, but the problem is much more than can be blamed on font designers. How would you distinguish a Turkish i from an English i with a font?
Sure, there are recommended practices but there have been enough mistakes already (or lazy programmers) that it is hard to be confident that any string with “interesting” symbols in it is exactly what it appears to be. And there have been security problems related to the fact that many interfaces expect the user to know exactly what they’re reading, much less the programmer.