About 20 years ago, I implemented a hash table that used binary trees for buckets. It was supposed to support Unicode strings as keys, and I used the Windows comparison functions to figure out how to compare the keys, in order to handle case insensitive lookup IIRC.
I tested the table with randomly generated strings, and was puzzled to discover expected lookups would fail with some table constructions. I narrowed it down to inconsistencies in collation.
It turns out that the collation order implemented by Windows was not transitive. You could have three code points, a, b and c, where a > b and b > c and c > a. Sort order is well defined within blocks, but there isn't necessarily a meaningful sort order across blocks; how should your Cyrillic letter compare with your Greek letter vs your Armenian letter?
Case transformations are locale-dependent. That is, in French lower case of "I" would be "i", and upper case of "i" would be "I". In Turkish, which uses largely the same letters, lower case of "I" would be "ı", and upper case of "i" would be "İ". Also, in German, upper case of "s" is "S", but upper case of "ß" would be "SS", and you have to guess what lower case of "SS" would be.
Universal case insensitivity is hard if not impossible. It's best to preserve both a "canonical case" version and the raw data in a search index, if different.
Nice! German has funnier problems, though, because a vowel with an umlaut can be represented as that vowel + e, that is, "fuer" is a legit representation of "für". How do you normalize that? Turning to one canonical representation works most of the time, but sometimes you also need letter-to-letter correspondence.
It also gets interesting with proper names. It's perfectly legitimate to transliterate "Herr Schröder" to "Herr Schroeder" (and if you don't have ö at your disposal you have to), but a proper name that starts out with the transliteration, like "Dr. Oetker" can usually not be transliterated to "ö".
In some cases that might be because the "oe" was never an "ö" in this case, even if people might have shifted to pronouncing it that way. But in other cases I imagine that it was intended and did start out as an "ö" sound, but people just decided to write the name with "oe".
Is that really legit in German? I know Finnish umlauts (ä, ö) get sometimes mangled to ae or oe, but they are definitely not valid alternative spellings nor are they pronounced even close to similar.
This brings up something that is perhaps easy to forget: the interpretation of diacritics is far from universal. The diacritic in the character 'ä' can be one of two semantically different diacritics. It can be a diaeresis, a diacritic whose function is to mark that vowel starts the next syllable rather than existing as part of a diphthong; or it can be an umlaut, whose purpose is to indicate that it is a different vowel sound altogether. As far as any charset is concerned, though, despite those very different semantics, the two things are the same diacritical mark [1].
Even beyond the issue with two concepts using the same glyph, the interpretation of the same diactrics among different languages is inconsistent. English tends to drop diacritics to the point that many people think that English doesn't use them; German uses expansion (so ä becomes ae). As you mention, some languages are incomprehensible either way, so they need to be preserved. And sorting and collation is even more fun!
[1] This does mean that Unicode's insistence that characters represent semantic differences rather than graphical differences can come across as rather arbitrary. The original purpose of Unicode was to unify different character sets together, so it preserves character differentiation that existed in antecedent charsets but tends to otherwise unify characters in practice.
Yeah, completely legit and not uncommon at all (though advances in locale support have probably made it less necessary in recent decades). As an official transliteration, documented and taught in school, it is definitely pronounces the same.
It seems like German could actually be at fault for your bogus transliteration issues in Finnish, then. Sorry about that! 8)
The thing I most remember is the surprise of intransitive collation order. I also recall implementing case insensitive lookup.
It might have been an early version of this one: http://codecentral.embarcadero.com/Item/15171 - from 2001 - but I've written quite a few hash tables over the years, and may be blending the different recollections. I never used Vista, and I'm pretty sure my experience predated Windows 7.
In the first section points 16, 22, and 23 are relevant, in the second section look at points 8, 9, 10, 11, 12, 13, and 40.
Sorting based on Unicode strings is tricky, especially if you want cross platform compatibility. It's better today but there are still a lot of edge cases to consider if you have any hope of getting the same sort twice from "interesting" data.
That's one way to implement case-insensitive lookup. OTOH, for a programming language symbol table, maybe you want to look up with case sensitivity first, then a second time with case insensitivity so you can emit a warning about the discrepancy (for a case-insensitive language) or an error with suggested spelling correction (for a case-sensitive language) and you don't want to maintain two separate hash tables for every scope.
This isn't a glibc issue; really it's a story that "character collations are not stable over time" (mentioned downthread in the article in a message quoting Unicode Technical Standard #10).
Which is a cautionary tale the solution for which is to use icu, the solution adopted by Postgres.
We ran headfirst into this issue at my company and we've actually been recommending the opposite (use the "C" locale on the database, treat collation as a render level concern).
It's easy to say "treat collation as a render level concern", but this doesn't really work efficiently when the rendering component wants to query the database using this index, does it?
That is, to do anything you want to do a locale sensitive way, such as querying the database for a given case insensitive string, or pagination, you'll need to have the DB index be locale/collation aware or else return every possible value and the renderer sort it out.
As an example, how you would you repeatedly return a range of results based on a string in a locale-aware way, e.g., to display paged results, if you defer the work to the renderer?
The only general solution I'm aware of which lets you "bypass" the DB collation is to use a locale-aware collation library like ICU to generate a binary sort key, which can be compared using plain binary comparison and storing those in the DB. This still means the overall index is locale aware, but the DB doesn't need to be aware of any collation rules: only the code that generates queries and handles the results needs to do the sort key transformation.
It means that you have a single library that does all the conversion, which you can probably control more easily, rather than delegating this to the database, where you might need to support several vendors or at least various versions (and problems can arise even within a single DB version as this postgres issue shows).
If you build a plain index over Unicode strings, chances are you're doing it wrong.
Normal DB indexes are mostly for numbers and (short) ASCII strings. (Something like canonicalized UTF-8 is an edge case.) For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch. It addresses the oddities of human-oriented texts better.
It's a pretty common use case (so common that it is usually taught in an introduction to databases) that you might want an index that allows you to efficiently search by non-ASCII strings like say, a person's last name.
I'd posit that this is not universally true. It may be true for English names.
It may be implementable for e.g. French names: a name like "François" may be stored as UTF-8 with "ç" always represented as a composite pair (or always a single character), and the application layer knows and uses this.
It must be a dubious idea for German names when you may need to see "Müller" and "Mueller" as the same name, but also keep e.g. only "Mueller" as the form referenced in legal papers.
Once you start considering anything non-ASCII, things become hairy. (Of course, if you only work for the US market and don't care about any international travelers as customers, you can continue using these introductory courses as a guidance.)
Of course it's hairy - that's kind of the point of this whole discussion!
People who use databases naturally end up wanting to store Unicode strings, and people naturally want to query them in locale-sensitive ways. Your Mueller example is a fine example of this. Databases have deep support for this, and you can also build your own compromise, so to speak, on top of the database with ICU or another library if it doesn't meet your needs.
It's wrong to claim indexes are "for" numbers and ASCII. They're for names of things, and names fill in the propositions the relational model is built on.
If you're using surrogate keys for everything, it means your tools are broken or you don't trust them. It may be the least worst available option to plaster over them, but you shouldn't accept that as a best practice.
> For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch.
Really, a reverse index, which, as its name implies, does the opposite of what you want if you're trying to index something.
Those products "work" by doing text preprocessing (stemming and such), which anyone can do, and then they toss a bunch of garbage back and let the user sort through it.
That doesn't solve the problem a database index is trying to address.
Indexes are not "for", they will happily index anything. But they are "best used for" certain things.
Tools are not broken; reality is. So you have to work around assumptions that are pertinent to blind usage of technology.
You need to think how to use a tool (such as a Unicode-capable DB text type) to get a correct result. And first you need to understand what constitutes the correct result. Else you may end up with strings that are equally correct and nominally equal, but bytewise different, so search by index fails.
So in summary it's a hard problem and requires some smarts to get right? Then we agree.
Retreating to ASCII is not the answer, and "full text search engines" don't even solve the same class of problem, such as say paginating a list of local towns by name.
The most important job of a DB is store records which largely consist of numeric (including boolean) values and strings. Strings often need to be outside the ASCII range, and even in existing cases where they aren't supported, it's a limitation of the application and the users would really prefer unicode strings.
Even if you just target English speaking people from the US or whatever, there are a variety of non-ASCII characters that come up all the time, including accented characters, various currency signs, etc.
The motivations are strong here. You want your database to be as sane as possible.
But this will make some features really hard, if you need them. Pagination based on sorted values subject to collation wouldn't be queryable, you would need to either get all the data and sort it, or query for a key and the sorted column, sort and then query for details on the displayed items.
Selecting a range would also be potentially difficult.
This has the advantage that your database operations run a heck of a lot faster, but has the potential disadvantage that primary key uniqueness may not be maintained, if you think that alternate ways of writing the same characters in unicode matters for that.
Sure but either you are talking about a fixed normalization algorithm which is not locale aware, in which case it doesn't solve the locale-specific unique key issue, or it is locale-aware and hence suffers the same problem with time-varying behavior.
Well I am not doing any of the things in the comment chain leading up to this, but probably the mention of primary key was a red herring. The GP's [1] point was that some index constraint (they mentioned PK uniqueness, but it could really be any constraint) might not be correctly maintained if the DB was not aware of the correct collation order. So from the point of view of the renderer, which is locale aware and uses locale-based collation, the DB is violating the constraints.
ICU isn’t the solution. If a bug in its collation routines is found, it likely will be fixed. You don’t want that if retrieving your data requires perfect stability of string comparisons.
The only real solution is to have your own collation routines, test the hell out of them for every release (you don’t want a compiler bug or the fixing of a compiler bug to introduce a subtle change in your collation code. The truly paranoid also worry about bug fixes in hardware), and to never stop supporting a collation, once you have ever released a product using it.
That’s what many databases do and why, for example, SQL Server has two binary collations. https://docs.microsoft.com/en-us/sql/relational-databases/co...: ”There are two types of binary collations in SQL Server; the older BIN collations and the newer BIN2 collations. In a BIN2 collation all characters are sorted according to their code points. In a BIN collation only the first character is sorted according to the code point, and remaining characters are sorted according to their byte values. (Because the Intel platform is a little endian architecture, Unicode code characters are always stored byte-swapped.)”)
Or they could just drag Windows in to the world of sanity and store everything as UTF-8 (at last) instead of the 16 variant... Then endien-ness and a bunch of other bugs no longer matter.
This, BTW, is a great example why many filesystems are case sensitive -- ignoring case requires collation support, and this can change all the time. Treating filenames as opaque byte strings, on the other hand, makes the filesystems, databases and so on Always Work.
Of course, PostgreSQL is one the few programs which actually cares about collation and case-insensitivity, so it has to work the hard way.
It has always amused me that NTFS and NT Kernel are both case sensitive when it comes to filenames. But WIN32 emulates the case insensitive legacy behavior because people won't fix, or can't fix legacy code. That said applications can properly opt into POSIX semantics.
IIRC, the NTFS filesystem has its own collation tables on disk that are fixed when the volume is formatted, so its indexes are forever consistent but don’t necessarily match the ongoing changes to collations. Thus the shell has to re-sort everything before showing it, even though you’d think the filenames were already sorted by NTFS.
ICU collations are not used by default, you have to explicitly enable them, and you can't set them as the database default. And even if you use ICU collations, I'm not sure if indices actually detect if the ICU versions don't match.
Maybe they improved some of the issues in PostgreSQL 11 (haven't been following the ICU issue in detail), but I don't think this issue is fixed.
Even if you use ICU, you need to make sure you link with the same version of ICU!
Edit: Why the downvote? New versions of Unicode come out every year, and ICU collations are explicitely versioned. If you use ICU 51.1 and ICU 52.1 you are going to have the same kind of issues
ICU collations in PG are versioned, and ICU has support for accessing the different collation versions. So you might be stuck on an older collation version without explicit action, but it'll not yield wrong results.
To rely on anything non trivial provided by libc is always a risk, but here the problem is that collation is not the right tool for the job (implementing data structures were there is to compare elements). In Redis memcmp() is used, because byte-to-byte comparison is a much more stable story. This leaves the user with the problem of finding a suitable representation that is lexicographically sortable by the first bytes in a way that makes sense for the application, and that also contains, in the farest bytes, what the user really wants to return. Not viable for something higher level than Redis I guess.
It's not that simple. SQL in particular actually requires lexicographical sorting, so you need to do it somehow. The problem is relying on lexicographical sorting remaining fixed when it isn't. Per the unicode standard, lexicographical sorting rules are subject to change. So now you need a mechanism to detect/manage those changes.
Right... so properly detecting when U_UNICODE_VERSION in the replication target differs from U_UNICODE_VERSION in the source, and adjusting for it is what's needed. PostgreSQL wasn't doing this in 2014.
Exactly. I'm just not happy with the types of these defines.
Only major updates need to be tracked, so the PCRE format seems to be the best. But all others are using strings, not an integer.
Yes I understand that, that's why I said that it's not applicable to higher level systems. Btw to put my comment in context, also Redis has a similar feature to SQL DBs where it shows results in lexicographical sorting directly to the user, but it that case it just provides memcmp() byte-to-byte sorting. So what you do in order to index "Aèèé" is to apply a transformation to that string and store a transformed prefix and later the string itself, like "aeee:Aèéé" (ref: https://redis.io/commands/zrangebylex)
You’re talking about sort keys, no? That allows you to deal with collation in a higher layer (which generates the sort keys per desired collation), but delegate the actual sorting to a dumb lower layer (i.e., memcmp).
PostgreSQL has the beginnings of a solution to this problem: version tracking for collations. I think it should be extended to track versions in a more fine grained way so that it can detect and reject this scenario (and related scenarios), and I think that libc implementations should provide a way to expose the version. I have proposed this for FreeBSD libc, and I hope someone proposes something similar for glibc (maybe me eventually). I think PostgreSQL should continue to support both ICU and libc collations (I don't think it's reasonable for every piece of software to use its own collation system, or for everyone to switch to ICU, I think libc should do a slightly better job).
Postgresql 10+ has added icu as a collation provider. If you select icu as your provider this is not an issue. If you select libc as your provider than this is still an issue you need to account for.
Not just different glibc's, also Unicode versions.
I constantly have to update my towupper() function for my safelibc. musl is hopelessly behind, and glibc is not much better with Unicode support. ICU constantly has breaking changes, but at least keeps its tables up to date.
Unicode itself changes its tables every year. This year we had 6 changes there.
DB or Web clients really need to pass the unicode version in its API: like "sorted according to Unicode 11.0"
May I ask a stupid question? What's the difference between collation and encoding? Some people in this thread have suggested "use byte strings"; is it "enough" to "just" use UTF-8, like you should "just" use UTC (unless you know/Should know better)?
To get a bit into the theory: a string is a sequence of symbols drawn from some "alphabet", and we can define operations like concatenation, counting, matching and substring on it. In particular, a natural operation on strings is lexicographic comparison, whereby you compare the first characters of two strings, if they are different, the two strings compare that way, if they match, you proceed to look at the second character, etc.
Encoding translates a string from one language to another, and it's often a reversible translation. The obvious case is to translate from Unicode to binary and back. In Unicode, the alphabet is the set of all Unicode codepoints. In binary, you have a stream of octets and the alphabet is the set of integers between 0 and 255. So UTF-8 is a standard that specifies how to perform such a translation.
Since we're often going back and forth between Unicode and binary, and because real processors work with bytes, it's reasonable to call "encoding" translating a string to binary and "decoding" translating binary to a string. The math doesn't require that, though.
Collation represents a string as an expression (usually also a string) that is trivially comparable. For instance, supposed I wanted an English-friendly way to compare band names. I might map all codepoints that look like A or E to "A", then K, S and C to C, etc., discarding any accents and maybe throwing out leading junk like "The" or "A".
That's why a case-insensitive collation won't let you have Foo and FOO, because they would both collate as FOO. It also means, unlike encoding, collation is expected to be one-way.
> Collation represents a string as an expression (usually also a string) that is trivially comparable. For instance, supposed I wanted an English-friendly way to compare band names. I might map all codepoints that look like A or E to "A", then K, S and C to C, etc., discarding any accents and maybe throwing out leading junk like "The" or "A".
So how does collation algorithmic instability break an application?
Edit: What I mean is-- what are a set of reasonable assumptions a dev would make in the application layer that would be good practice on the one hand, yet also be wholly dependent on collation algorithm stability to keep from breaking?
All the assumptions that make a database index work will break.
If you think about any kind of binary tree based structure, it will have invariants like "the left child node will be less than the parent node, and the right child node will be greater than the parent node."
If those are violated just a little bit, at best data will simply disappear from the index. It's possible for updates to delete stuff, or for operations on the index to hang.
Collation is basically a fancy way of saying "sorting" and should be independent of encoding (or course, certain encodings lend themselves more naturally to certain collation algorithms). In general collation needs to be locale-aware while encoding (in Unicode) largely doesn't care.
PostgreSQL should use its own internationalization library, or ICU, but not the C library's I18N components. This would allow PG to ensure that any one version of PG also uses the same version of Unicode.
Of course if you have two applications communicating with each other that are statically linked with different version of the library you can still have inconsistency problems.
I tested the table with randomly generated strings, and was puzzled to discover expected lookups would fail with some table constructions. I narrowed it down to inconsistencies in collation.
It turns out that the collation order implemented by Windows was not transitive. You could have three code points, a, b and c, where a > b and b > c and c > a. Sort order is well defined within blocks, but there isn't necessarily a meaningful sort order across blocks; how should your Cyrillic letter compare with your Greek letter vs your Armenian letter?