This comparison is language specific. Do you group c with ç and č? In English yo...

ygra · on May 29, 2017

It's also font-specific. Is т a homoglyph of T or m, for example? There isn't really a good way to solve this because restricting systems to only use ASCII (which also has homoglyphs, e.g. 0/O, 1/l, I/l, ...) is very user-unfriendly.

algesten · on May 29, 2017

I got to write an address database matcher once covering all of scandinavia and poland.

For suggest functions, it turns out people fully expect to be able to write some sort of ascii normalisation and still get a match, i.e. the address has ż in it, but people want to type a plain z.

And the rules for this are not entirely obvious. A swede would totally expect being able to write ö instead of the norwegian ø when doing routing across the border.

derefr · on May 30, 2017

т is a homoglyph of T—because one could mistake one for the other. They're in a visual equivalence-class. That doesn't mean that you should normalize т into T, though. Those are separate considerations.

If you were granting e.g. domain names, or usernames, you'd be able to map each character in the test string to its homoglyph equivalence-class, and then ask whether anyone has previously registered a name using that sequence of equivalence-class values. So someone's registration of "тhe" would preclude registering "the", and vice-versa; but when you normalized "тhe", you'd still get "mhe".

Of course, to use such a system properly, you'd have to keep the original registered variant of the name around and use it URL slugs and the like (even if that means resorting to punycode), rather than trying to "canonicalize" the person's provided name through a normalization algorithm. Because they have "[the equivalence class of т]he", not "mhe"; someone else has "mhe".

Wyverald · on May 30, 2017

> т is a homoglyph of T—because one could mistake one for the other.

I believe gp is talking about the font. In some fonts (especially italic/cursive), the letter "т" looks like "m", and nothing like "T" -- so it's really hard to say with which one it's "visually equivalent".

irishsultan · on May 30, 2017

Look at the image on https://en.wiktionary.org/wiki/%D1%82 and you will see that it is also a homoglyph of m.

Klathmon · on May 29, 2017

I was just trying to think of a way to actually deal with this in a way that it's up to the developer to decide what they should be matching.

Even in those languages you still might want to treat an OCR of a "c" with "ç" in some cases, or you might want to treat them identical when moving from a system that only accepted ASCII in the past to one that is fully unicode compliant. And even in English there are situations where "ç" should not be treated as "c" (like in URL resolving or other scenarios where an exact match is needed).

It's not technically correct, but it's "more correct". And giving the developer the ability to determine where along that "scale" of exact-ness they want to be might help.

carlob · on May 29, 2017

And what's capital i, in English and most other countries it's I, but in Turkish it's İ, while I is uppercase ı.

_mhyx · on May 29, 2017

C# has a great set of options for setting that, it seems like a good candidate to implement such a system - most string manipulation functions take in a culture/use current culture enum.

wumpus · on May 29, 2017

Are there any word pairs that switch ç for č or vice versa?

asveikau · on May 29, 2017

I wouldn't think so, because the languages I know of using ç do not use č and vice versa. There are definitely pairs with "plain" C in a bunch of those.

The ė mentioned is distinct from e in Lithuanian.