String-matching is really scary in Unicode, especially since the exact *form* of...

Klathmon · on May 29, 2017

It almost sounds like there could be a lot of benefit from a "homophones check" system but for Unicode glyphs (perhaps with a variable amount of "closeness") being built into Unicode handling libraries.

Like how "е" looks identical to "e", which looks close to "ė" if you aren't careful, which might be mistaken for "é" in smaller fonts even though all 4 letters are different unicode glyphs.

Being able to say "ɡrеɡ" is the same as "greg" even though 3 of the 4 characters are actually different would be extremely useful in some cases, and in others would be extremely incorrect, so giving the developer the ability to say how "exact" they need their checks to be in a "native" and easy way might go a long way toward not only making this problem more "obvious" but also toward forcing them to be explicit about what they are checking for.

cxcorp · on May 29, 2017

Yuup, that's in the standard: http://www.unicode.org/reports/tr39/tr39-13.html#Confusable_...

Data file mentioned in the standard: http://www.unicode.org/Public/security/9.0.0/confusables.txt

wahern · on May 30, 2017

There's also NFKC_Casefold, which is a technique used by RFC 5892 (among others) to limit the characters allowable in a domain name. The problem is that it also disallows 'A', because Casefold(NFKC('A')) != 'A'. I'm sure that's equally annoying in other languages. And in any event makes it problematic for usages like parsing URLs from free-form Unicode text.

Unicode specifications are incredibly thorough and well thought-out. The problem is that the Unicode spec isn't shippable software. It's not an implementation.

And there's no singular implementation. Worse, nobody uses any particular implementation the same way, and rarely to its fullest extent. Compounding the problems, so much code is _proprietary_. You have no way to verify and track how such code will behave, so interoperability is difficult. For example, good luck trying to reproduce the behavior of Outlook, Mail.app, and gmail.com in terms of how each will highlight URLs in free-form text.

The only saving grace appears to be that the rest of the world, I assume, has grown accustomed to how broken American software is in terms of dealing with I18N issues. And Americans remain blissfully naive. I keep waiting for the other shoe to drop; when managers will finally crack the whip at the behest of international customers and demand that engineers begin taking I18N seriously. But it hasn't happened yet. I've been waiting almost 15 years, accumulating skills and best practices that my employers don't seem to value very much. Oh well....

asveikau · on May 29, 2017

This comparison is language specific.

Do you group c with ç and č? In English you would. In France, Portugal, Serbia, the Baltic States or Czech republic you may not.

ygra · on May 29, 2017

It's also font-specific. Is т a homoglyph of T or m, for example? There isn't really a good way to solve this because restricting systems to only use ASCII (which also has homoglyphs, e.g. 0/O, 1/l, I/l, ...) is very user-unfriendly.

algesten · on May 29, 2017

I got to write an address database matcher once covering all of scandinavia and poland.

For suggest functions, it turns out people fully expect to be able to write some sort of ascii normalisation and still get a match, i.e. the address has ż in it, but people want to type a plain z.

And the rules for this are not entirely obvious. A swede would totally expect being able to write ö instead of the norwegian ø when doing routing across the border.

derefr · on May 30, 2017

т is a homoglyph of T—because one could mistake one for the other. They're in a visual equivalence-class. That doesn't mean that you should normalize т into T, though. Those are separate considerations.

If you were granting e.g. domain names, or usernames, you'd be able to map each character in the test string to its homoglyph equivalence-class, and then ask whether anyone has previously registered a name using that sequence of equivalence-class values. So someone's registration of "тhe" would preclude registering "the", and vice-versa; but when you normalized "тhe", you'd still get "mhe".

Of course, to use such a system properly, you'd have to keep the original registered variant of the name around and use it URL slugs and the like (even if that means resorting to punycode), rather than trying to "canonicalize" the person's provided name through a normalization algorithm. Because they have "[the equivalence class of т]he", not "mhe"; someone else has "mhe".

Wyverald · on May 30, 2017

> т is a homoglyph of T—because one could mistake one for the other.

I believe gp is talking about the font. In some fonts (especially italic/cursive), the letter "т" looks like "m", and nothing like "T" -- so it's really hard to say with which one it's "visually equivalent".

irishsultan · on May 30, 2017

Look at the image on https://en.wiktionary.org/wiki/%D1%82 and you will see that it is also a homoglyph of m.

Klathmon · on May 29, 2017

I was just trying to think of a way to actually deal with this in a way that it's up to the developer to decide what they should be matching.

Even in those languages you still might want to treat an OCR of a "c" with "ç" in some cases, or you might want to treat them identical when moving from a system that only accepted ASCII in the past to one that is fully unicode compliant. And even in English there are situations where "ç" should not be treated as "c" (like in URL resolving or other scenarios where an exact match is needed).

It's not technically correct, but it's "more correct". And giving the developer the ability to determine where along that "scale" of exact-ness they want to be might help.

carlob · on May 29, 2017

And what's capital i, in English and most other countries it's I, but in Turkish it's İ, while I is uppercase ı.

_mhyx · on May 29, 2017

C# has a great set of options for setting that, it seems like a good candidate to implement such a system - most string manipulation functions take in a culture/use current culture enum.

wumpus · on May 29, 2017

Are there any word pairs that switch ç for č or vice versa?

asveikau · on May 29, 2017

I wouldn't think so, because the languages I know of using ç do not use č and vice versa. There are definitely pairs with "plain" C in a bunch of those.

The ė mentioned is distinct from e in Lithuanian.

kutkloon7 · on May 29, 2017

Such a check would be very useful for checking domains. Right now, a malicious attacker could register google.com, but with one of the characters replaced by a (nearly) identical looking unicode symbol. Then, the attacker could use HTTPS and create a login page identical to gmail's. No, all it takes to get someones credentials are a link to mail.google.com.

It would be nice if the authorities handling the registration of the domain names could forbid domains that look to much like each other.

makecheck · on May 29, 2017

It would clearly help for the system we have now but the real solution is to push for stricter authentication across the board. As convenient as URL strings can be, we need E-mail clients and other tools to be able to force at least a 2nd layer of authentication (e.g. E-mail claims link is from domain #1; user must counter by selecting from a list of sites actually visited previously, and E-mail client refuses to open link if they don’t match). You could imagine much more elaborate solutions too based on certificates, etc.

kutkloon7 · on May 31, 2017

I don't think that particular solution would be good from a user experience point of view, but it is indeed a nice idea to filter out domains that you have received emails from (and are not deleted or in the spam folder).

However, there are ways around this too. I think the fundamental mistake was to allow (all?) unicode strings as urls. However, I can't come up with an elegant solution on the spot (since it would be unfair and unpractical to use ASCII for this).

wumpus · on May 29, 2017

Search engines usually have such a system, but I've never run into one in open source.

slrz · on May 29, 2017

Isn't this basically what PRECIS (RFC 7564) is about? There are open source implementations of that, like golang.org/x/text/secure/precis for Go (including the predefined profiles for e.g. usernames) or Unicode::Precis for Perl.

wumpus · on May 29, 2017

In the search engine context, the problem to be solved is that both French and English speakers are likely to type [cafe] and not [café] -- the French speaker because they might be on an English keyboard, or because they know it's not ambiguous.

In the search space, therefore, when you index the word 'café', you also index 'cafe' with a smaller weight. And when you see the query [café], you expand the query to ('café' OR 'cafe'-with-smaller-weight)

And you don't want to do either of this if the two words are actually different!

As an example of this in the wild, the ElasticSearch docs talk about the issue: https://www.elastic.co/guide/en/elasticsearch/guide/current/...

PRECIS appears to be aimed more at figuring out if 2 usernames are 'the same'.

amptorn · on May 29, 2017

> String-matching is really scary in Unicode

Isn't this why NFKC normalization exists?

mjevans · on May 29, 2017

Unicode has four different normalization forms. Different forms are useful for different intended outcomes.

You, as the programmer, need to understand each of them and why you want to use them.

Brief overview: https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat... More technical details: http://www.unicode.org/reports/tr15/

I --THINK-- offhand, that NFKC is what you want to use when preparing a password input for processing/comparison (it's lossless, but to a specific point). I also --THINK-- that NFC is the form you want to use when retaining source glyph language distinctions.

From the stackoverflow hits:

https://stackoverflow.com/questions/16173328/what-unicode-no...

I agree with the destructive (pre computation/comparison) operation and that either of the NFKD or NFKC forms should be used (since they destroy non-printing differences for visually compatible characters; a more user friendly approach).

The 'C' forms are always more condensed (accents are packed in to a single character where possible), and thus of higher entropy per input byte. It is my belief that this form is likely to be less susceptible to attacks.

The 'D' forms seem like good choices for /editors/ where the precise nature of a character might be altered by adding or removing accents. (Most human input boxes; during the input/edit process)

lilyball · on May 30, 2017

> The 'C' forms are always more condensed (accents are packed in to a single character where possible), and thus of higher entropy per input byte. It is my belief that this form is likely to be less susceptible to attacks.

What sort of attacks are you talking about?

> The 'D' forms seem like good choices for /editors/ where the precise nature of a character might be altered by adding or removing accents. (Most human input boxes; during the input/edit process)

Editors shouldn't care about C vs D. The reason being, once you've typed the grapheme cluster, it's supposed to act in an editor as if it's a single "character" regardless of whether it's made from one codepoint or several. This means that if I type é then arrow keys and the delete key will operate it on it exactly the same whether it's composed or decomposed.

skookumchuck · on May 29, 2017

It's a huge mistake for Unicode to have two different code points having the same glyph. There should not be semantic meaning that disappears when something is printed.

We're going to be suffering for that mistake for a looong time.

alanh · on May 29, 2017

Disagree. A good example of the opposite mistake is the “Turkish i” problem. Basically they have a version of I with and without a dot — for both lowercase and uppercase — so algorithms that uppercase i to I break Turkish by removing the dot. If the Turkish i were a unique code point, the algorithms would not mess it up.

josefx · on May 29, 2017

Then you have the german ß (sharp S) which does not have an upper case version. While ISO added one for whatever reason the official upper case is two letters consisting of either "SS" or "SZ". So you have three different ways to upper case ß one which is guaranteed to be wrong in any official context and two which lower case to "ss" or "sz" and not back to ß. That is one big ouch, especially to the ISO standard adding that invalid upper case variation. Languages are messy, best don't try to transform your input text in any way.

Dylan16807 · on May 30, 2017

> Then you have the german ß (sharp S) which does not have an upper case version. While ISO added one for whatever reason

It's used in typesetting sometimes, and if a character is used then it should have an encoding.

mcz · on May 30, 2017

>It's used in typesetting sometimes, and if a character is used then it should have an encoding.

IMO there's little semantic difference so it doesn't deserve a character. We should have drawn the line between content and formatting, but it's too late and what we have now is emoji and one-use glyphs. [1]

[1] https://en.wikipedia.org/wiki/Multiocular_O

skookumchuck · on May 29, 2017

Uppercasing is heavily dependent on context. Even the ASCII characters are context dependent.

a+b=d+d

Shouldn't be uppercased.

It's an insoluble problem to put contextual semantic info into Unicode characters, because individual characters have no context.

yen223 · on May 30, 2017

This then makes the CJK unification decision even more perplexing. Surely Japanese characters should not be treated the same as Mandarin ones, even if they look the same?

Sukotto · on May 30, 2017

>Surely Japanese characters should not be treated the same as Mandarin ones

No. In Japanese, how you read/pronounce a character depends on context. Sometimes they are the same as Chinese, sometimes not.

Take mountain (山) for example.

Using the Chinese pronouncation it is "san". 富士山 (Mount Fuji) is ふじさん "Fuji san"

Using Japanese pronouncation it is "yama". 山登り (Mountain Climbing) is やまのぼり "yamanoboru"

(and don't call me Shirley)

vorg · on May 30, 2017

I can't speak Japanese, only some Chinese, but I'm wondering if whether to use the (Chinese) Onyomi or (Japanese) Kunyomi pronunciation in Japanese is related in any way to whether the 山 comes first or last in the compound. If it comes last as in 富士山 "Fuji san", the grammar matches the Chinese, and so does the pronunciation ("Fushishan"). If it comes first as in 山登り "yamanoboru", the grammar is opposite to the Chinese (which would also have the 山 last, i.e. 跑山).

PS: Isn't り pronounced "ri" and る pronounced "ru"?

Sukotto · on May 31, 2017

As a rule of thumb, I have learned that Onyomi is usually when a kanji is part of a compound word and Kunyomi is usually when the kanji is by itself.

Yes, I typoed that and it's too late to fix it. り is "ri", not "ru".

kps · on May 29, 2017

That's up to font designers. As an extreme example, run 'xterm -fn nil2' and you'll have a whole lot of code points having the same glyphs.

skookumchuck · on May 29, 2017

To some extent you're right, but the problem is much more than can be blamed on font designers. How would you distinguish a Turkish i from an English i with a font?

worldsayshi · on May 30, 2017

You wouldn't? You would need to know by context?

skookumchuck · on May 30, 2017

> by context

exactly!

scrollaway · on May 29, 2017

> There should not be semantic meaning that disappears when something is printed.

I can see why you'd say that, but who decides whether it should?

skookumchuck · on May 29, 2017

The Unicode Committee. I think they lost the point of what Unicode was somewhere along the line.