Hacker News new | past | comments | ask | show | jobs | submit login
Unicode is hard (shkspr.mobi)
337 points by edent on May 29, 2017 | hide | past | favorite | 194 comments



> The £ is printed just fine on some parts of the receipt!

> ⨈Һ𝘢ʈ ╤ћᘓ 𝔽ᵁʗꗪ

Assuming the printer uses ESC/POS[0] (which is likely), the codepage is part of the printer's state. To change the code page, the driver sends a specific ESC command (<ESC t x> aka <1B 74 XX> where x/XX is the desired codepage byte) (none of which is "UTF8" incidentally) and you can change the codepage before each actually displayed character.

So it's the driver software fucking up and either misencoding its content (most likely) or selecting the wrong codepage. The £ might be displayed correctly on the right side because it's e.g. hard-coded (properly encoded) while the product label is dynamic and when that was added/changed no care was taken with respect to properly transcoding. The printer absolutely doesn't care, it just maps a byte to a glyph according to the currently selected codepage.

[0] ESC because the protocol is based on proprietary ESCape codes[1], POS because the entire thing's a giant piece of shit

[1] https://en.wikipedia.org/wiki/Escape_character#ASCII_escape_...


Thirty five years ago I started working for a company that made label printers and one of the common technical issues I had to deal with was pound signs not printing. It's quite funny that we still have the same issues after all that time.


Otoh twenty-seven years ago I spent a lot of time messing around with termcap and/or terminfo on various customer sites trying to get their terminals to display something other than gibberish. That's no longer really a thing, thank god, although I wryly note that there is, inevitably, a relevant O'Reilly book that I would have sold my grandmother for had I known about it.


Alternatively, they might have received their item file from a system which bungled the encoding. Source: POS system developer


> none of which is "UTF8" incidentally

Why not? Shouldn't all printers made nowadays support UTF8? Probably by default?


Backward compatibility. It will refuse to die /forever/.


Also simplicity (for the printer vendor), codepages means you just take your input byte and index into an array, UTF8 means you need to implement UTF-8 decoding (which is simple but not free and now you need to deal with decoding errors) and automatic check/swap of your raster maps in the printer. This is all logic which codepages foist onto the driver developer.


Dealing with decoding errors is simple, you just substitute U+FFFD any time you hit one.


Sadly, my Oneplus 3T rendered the A and the F of your "what the fuck" as black boxes.

Mind boggling that this is still a problem today.


I believe that means it's correctly interpreting the Unicode, but there isn't a font that contains a character for that code. I think this is because the "official" Android font is patented, another layer of absurd crap that leads to many Unicode issues.


On my OP3T it is fine in Chrome and two squares in Firefox.


My name is Léon, with the acute accent on the e. I usually leave the accent when I need to enter my name somewhere digitally, since in about 50% of the cases it's not handled correctly. It usually ends up as L'on.

Even in the travel world it goes wrong all the time. You'd expect large international travel organisations (yes, talking to you Tui!) to be able to handle UTF8 names since many of their customers and locations will have special characters, but no. I once was nearly refused to board an airplane because the name on my ticket did not match the one on my passport...


My shipping address is "Roņu street, Liepāja" and in the other end of our city is "Rožu street, Liepāja"

If the address gets somehow mangled, as it often does, the shipping label could read "Ro?u street" and the postal offices guess on the address being "Roņu" or "Rožu" is as good as anyones.

I too usually just write "Ronu street"


How about adding the zip code for disambiguation?

Because even with no characters messed up, many cities have 2 or more streets with the same name.


At lest in that particular case, the postal code is the same (LV-3401) for both streets.

Coincidentally, at least here, the official addresses are guaranteed to be unique, a city/town can not have two streets with the same name (at least at any single point of time, renaming over centuries gets weird) - naming streets and assigning numbers to houses is not an arbitrary decision that the landowner can make, it is managed/coordinated centrally.


Same code for the two addresses. Liepāja is a major city by local standards but still only has some 80k people and about 15 postal codes cover the entire city. No two streets in the same city are supposed to have the same name per our national rules though.


in many parts of the world zip codes are city-wide (regardless of city size), so that wouldn't help


And some countries don't use zip-codes at all.


And at least one country (Singapore) has building-level zip codes


Ireland goes even further and each apartment in an apartment block would have its own unique code.


Something like https://map.what3words.com/broken.rare.share might be useful


Depending on the size of the city, there's a good chance these two differently-named streets are in the same postal code.


I'm more peeved that we can't just use GPS co-ordinates for an address.


In addition to an address at most. One missing/misprinted digit and your package is dropped into siberian forest. Sorting is also trouble unless fully automated.


If you're printing 14 digits of coordinate it's easy to add two more of checksum.


So what do you do if the checksum doesn't match? Could be off by a meter or half an Earth. So the only recourse is to send it back. Normal addresses are far more resilient. Wrong post code? It'll get re-routed and just takes a few days longer. Wrong house number? In most cases the mailman knows the names in their area and will still deliver it. Typo in the address, name, city? In most cases it still works fine, the example of enkrs notwithstanding. Letters get mangled all the time, and mail still arrives.

Also I'm much better remembering addresses of people instead of their GPS coordinates, especially when they're close together, like in a city, where the coordinates would all be almost the same.


I specified adding multiple more so you can use the checksum to repair the number. If they get three or more digits wrong, well, they could have put the wrong city name on there too. In the realm of transcription errors, it's easy to make coordinates work. It's remembering these addresses that causes the real problems.


It's far more likely for someone to screw up writing digits than it is for them to write the wrong city name. People handle words well, but they don't handle numbers all that great.

And if they do write the wrong city, well, that's what the zip code is for.


Then there's a major earthquake and your GPS co-ordinates change.


If a major earthquake moves your GPS co-ordinates enough to change your effective address I'd claim that you have bigger problems than getting your mail delivered. Houses aren't typically designed to change coordinates via the earth snapping beneath it.


So? Just maintain a GPS Offset Table :-P


String-matching is really scary in Unicode, especially since the exact form of the string matters with respect to composition — and that’s before you even consider that some characters just plain look like others or even are the same glyph. And strings can contain things like zero-width spaces that look like nothing at all.

Sure, there are recommended practices but there have been enough mistakes already (or lazy programmers) that it is hard to be confident that any string with “interesting” symbols in it is exactly what it appears to be. And there have been security problems related to the fact that many interfaces expect the user to know exactly what they’re reading, much less the programmer.


It almost sounds like there could be a lot of benefit from a "homophones check" system but for Unicode glyphs (perhaps with a variable amount of "closeness") being built into Unicode handling libraries.

Like how "е" looks identical to "e", which looks close to "ė" if you aren't careful, which might be mistaken for "é" in smaller fonts even though all 4 letters are different unicode glyphs.

Being able to say "ɡrеɡ" is the same as "greg" even though 3 of the 4 characters are actually different would be extremely useful in some cases, and in others would be extremely incorrect, so giving the developer the ability to say how "exact" they need their checks to be in a "native" and easy way might go a long way toward not only making this problem more "obvious" but also toward forcing them to be explicit about what they are checking for.



There's also NFKC_Casefold, which is a technique used by RFC 5892 (among others) to limit the characters allowable in a domain name. The problem is that it also disallows 'A', because Casefold(NFKC('A')) != 'A'. I'm sure that's equally annoying in other languages. And in any event makes it problematic for usages like parsing URLs from free-form Unicode text.

Unicode specifications are incredibly thorough and well thought-out. The problem is that the Unicode spec isn't shippable software. It's not an implementation.

And there's no singular implementation. Worse, nobody uses any particular implementation the same way, and rarely to its fullest extent. Compounding the problems, so much code is _proprietary_. You have no way to verify and track how such code will behave, so interoperability is difficult. For example, good luck trying to reproduce the behavior of Outlook, Mail.app, and gmail.com in terms of how each will highlight URLs in free-form text.

The only saving grace appears to be that the rest of the world, I assume, has grown accustomed to how broken American software is in terms of dealing with I18N issues. And Americans remain blissfully naive. I keep waiting for the other shoe to drop; when managers will finally crack the whip at the behest of international customers and demand that engineers begin taking I18N seriously. But it hasn't happened yet. I've been waiting almost 15 years, accumulating skills and best practices that my employers don't seem to value very much. Oh well....


This comparison is language specific.

Do you group c with ç and č? In English you would. In France, Portugal, Serbia, the Baltic States or Czech republic you may not.


It's also font-specific. Is т a homoglyph of T or m, for example? There isn't really a good way to solve this because restricting systems to only use ASCII (which also has homoglyphs, e.g. 0/O, 1/l, I/l, ...) is very user-unfriendly.


I got to write an address database matcher once covering all of scandinavia and poland.

For suggest functions, it turns out people fully expect to be able to write some sort of ascii normalisation and still get a match, i.e. the address has ż in it, but people want to type a plain z.

And the rules for this are not entirely obvious. A swede would totally expect being able to write ö instead of the norwegian ø when doing routing across the border.


т is a homoglyph of T—because one could mistake one for the other. They're in a visual equivalence-class. That doesn't mean that you should normalize т into T, though. Those are separate considerations.

If you were granting e.g. domain names, or usernames, you'd be able to map each character in the test string to its homoglyph equivalence-class, and then ask whether anyone has previously registered a name using that sequence of equivalence-class values. So someone's registration of "тhe" would preclude registering "the", and vice-versa; but when you normalized "тhe", you'd still get "mhe".

Of course, to use such a system properly, you'd have to keep the original registered variant of the name around and use it URL slugs and the like (even if that means resorting to punycode), rather than trying to "canonicalize" the person's provided name through a normalization algorithm. Because they have "[the equivalence class of т]he", not "mhe"; someone else has "mhe".


> т is a homoglyph of T—because one could mistake one for the other.

I believe gp is talking about the font. In some fonts (especially italic/cursive), the letter "т" looks like "m", and nothing like "T" -- so it's really hard to say with which one it's "visually equivalent".


Look at the image on https://en.wiktionary.org/wiki/%D1%82 and you will see that it is also a homoglyph of m.


I was just trying to think of a way to actually deal with this in a way that it's up to the developer to decide what they should be matching.

Even in those languages you still might want to treat an OCR of a "c" with "ç" in some cases, or you might want to treat them identical when moving from a system that only accepted ASCII in the past to one that is fully unicode compliant. And even in English there are situations where "ç" should not be treated as "c" (like in URL resolving or other scenarios where an exact match is needed).

It's not technically correct, but it's "more correct". And giving the developer the ability to determine where along that "scale" of exact-ness they want to be might help.


And what's capital i, in English and most other countries it's I, but in Turkish it's İ, while I is uppercase ı.


C# has a great set of options for setting that, it seems like a good candidate to implement such a system - most string manipulation functions take in a culture/use current culture enum.


Are there any word pairs that switch ç for č or vice versa?


I wouldn't think so, because the languages I know of using ç do not use č and vice versa. There are definitely pairs with "plain" C in a bunch of those.

The ė mentioned is distinct from e in Lithuanian.


Such a check would be very useful for checking domains. Right now, a malicious attacker could register google.com, but with one of the characters replaced by a (nearly) identical looking unicode symbol. Then, the attacker could use HTTPS and create a login page identical to gmail's. No, all it takes to get someones credentials are a link to mail.google.com.

It would be nice if the authorities handling the registration of the domain names could forbid domains that look to much like each other.


It would clearly help for the system we have now but the real solution is to push for stricter authentication across the board. As convenient as URL strings can be, we need E-mail clients and other tools to be able to force at least a 2nd layer of authentication (e.g. E-mail claims link is from domain #1; user must counter by selecting from a list of sites actually visited previously, and E-mail client refuses to open link if they don’t match). You could imagine much more elaborate solutions too based on certificates, etc.


I don't think that particular solution would be good from a user experience point of view, but it is indeed a nice idea to filter out domains that you have received emails from (and are not deleted or in the spam folder).

However, there are ways around this too. I think the fundamental mistake was to allow (all?) unicode strings as urls. However, I can't come up with an elegant solution on the spot (since it would be unfair and unpractical to use ASCII for this).


Search engines usually have such a system, but I've never run into one in open source.


Isn't this basically what PRECIS (RFC 7564) is about? There are open source implementations of that, like golang.org/x/text/secure/precis for Go (including the predefined profiles for e.g. usernames) or Unicode::Precis for Perl.


In the search engine context, the problem to be solved is that both French and English speakers are likely to type [cafe] and not [café] -- the French speaker because they might be on an English keyboard, or because they know it's not ambiguous.

In the search space, therefore, when you index the word 'café', you also index 'cafe' with a smaller weight. And when you see the query [café], you expand the query to ('café' OR 'cafe'-with-smaller-weight)

And you don't want to do either of this if the two words are actually different!

As an example of this in the wild, the ElasticSearch docs talk about the issue: https://www.elastic.co/guide/en/elasticsearch/guide/current/...

PRECIS appears to be aimed more at figuring out if 2 usernames are 'the same'.


> String-matching is really scary in Unicode

Isn't this why NFKC normalization exists?


Unicode has four different normalization forms. Different forms are useful for different intended outcomes.

You, as the programmer, need to understand each of them and why you want to use them.

Brief overview: https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat... More technical details: http://www.unicode.org/reports/tr15/

I --THINK-- offhand, that NFKC is what you want to use when preparing a password input for processing/comparison (it's lossless, but to a specific point). I also --THINK-- that NFC is the form you want to use when retaining source glyph language distinctions.

From the stackoverflow hits:

https://stackoverflow.com/questions/16173328/what-unicode-no...

I agree with the destructive (pre computation/comparison) operation and that either of the NFKD or NFKC forms should be used (since they destroy non-printing differences for visually compatible characters; a more user friendly approach).

The 'C' forms are always more condensed (accents are packed in to a single character where possible), and thus of higher entropy per input byte. It is my belief that this form is likely to be less susceptible to attacks.

The 'D' forms seem like good choices for /editors/ where the precise nature of a character might be altered by adding or removing accents. (Most human input boxes; during the input/edit process)


> The 'C' forms are always more condensed (accents are packed in to a single character where possible), and thus of higher entropy per input byte. It is my belief that this form is likely to be less susceptible to attacks.

What sort of attacks are you talking about?

> The 'D' forms seem like good choices for /editors/ where the precise nature of a character might be altered by adding or removing accents. (Most human input boxes; during the input/edit process)

Editors shouldn't care about C vs D. The reason being, once you've typed the grapheme cluster, it's supposed to act in an editor as if it's a single "character" regardless of whether it's made from one codepoint or several. This means that if I type é then arrow keys and the delete key will operate it on it exactly the same whether it's composed or decomposed.


It's a huge mistake for Unicode to have two different code points having the same glyph. There should not be semantic meaning that disappears when something is printed.

We're going to be suffering for that mistake for a looong time.


Disagree. A good example of the opposite mistake is the “Turkish i” problem. Basically they have a version of I with and without a dot — for both lowercase and uppercase — so algorithms that uppercase i to I break Turkish by removing the dot. If the Turkish i were a unique code point, the algorithms would not mess it up.


Then you have the german ß (sharp S) which does not have an upper case version. While ISO added one for whatever reason the official upper case is two letters consisting of either "SS" or "SZ". So you have three different ways to upper case ß one which is guaranteed to be wrong in any official context and two which lower case to "ss" or "sz" and not back to ß. That is one big ouch, especially to the ISO standard adding that invalid upper case variation. Languages are messy, best don't try to transform your input text in any way.


> Then you have the german ß (sharp S) which does not have an upper case version. While ISO added one for whatever reason

It's used in typesetting sometimes, and if a character is used then it should have an encoding.


>It's used in typesetting sometimes, and if a character is used then it should have an encoding.

IMO there's little semantic difference so it doesn't deserve a character. We should have drawn the line between content and formatting, but it's too late and what we have now is emoji and one-use glyphs. [1]

[1] https://en.wikipedia.org/wiki/Multiocular_O


Uppercasing is heavily dependent on context. Even the ASCII characters are context dependent.

a+b=d+d

Shouldn't be uppercased.

It's an insoluble problem to put contextual semantic info into Unicode characters, because individual characters have no context.


This then makes the CJK unification decision even more perplexing. Surely Japanese characters should not be treated the same as Mandarin ones, even if they look the same?


>Surely Japanese characters should not be treated the same as Mandarin ones

No. In Japanese, how you read/pronounce a character depends on context. Sometimes they are the same as Chinese, sometimes not.

Take mountain (山) for example.

Using the Chinese pronouncation it is "san". 富士山 (Mount Fuji) is ふじさん "Fuji san"

Using Japanese pronouncation it is "yama". 山登り (Mountain Climbing) is やまのぼり "yamanoboru"

(and don't call me Shirley)


I can't speak Japanese, only some Chinese, but I'm wondering if whether to use the (Chinese) Onyomi or (Japanese) Kunyomi pronunciation in Japanese is related in any way to whether the 山 comes first or last in the compound. If it comes last as in 富士山 "Fuji san", the grammar matches the Chinese, and so does the pronunciation ("Fushishan"). If it comes first as in 山登り "yamanoboru", the grammar is opposite to the Chinese (which would also have the 山 last, i.e. 跑山).

PS: Isn't り pronounced "ri" and る pronounced "ru"?


As a rule of thumb, I have learned that Onyomi is usually when a kanji is part of a compound word and Kunyomi is usually when the kanji is by itself.

Yes, I typoed that and it's too late to fix it. り is "ri", not "ru".


That's up to font designers. As an extreme example, run 'xterm -fn nil2' and you'll have a whole lot of code points having the same glyphs.


To some extent you're right, but the problem is much more than can be blamed on font designers. How would you distinguish a Turkish i from an English i with a font?


You wouldn't? You would need to know by context?


> by context

exactly!


> There should not be semantic meaning that disappears when something is printed.

I can see why you'd say that, but who decides whether it should?


The Unicode Committee. I think they lost the point of what Unicode was somewhere along the line.


Pro tip, write your name as it is in your passport on the <<<<<<< line


I've seen this a lot too; but it's not the weirdest thing I've seen on a receipt... We once ate at The Boot Room at Cheshire Oaks and when we got the bill, the numbers didn't add up! (I don't add these things up but since the things we ordered were fairly round numbers and should've been just below £20 and the bill was just over, it was obvious something was fishy).

I totalled the numbers up again and the total was exactly £1.50 less than the total shown on the bill! My wife (having no faith in my basic adding skills) pulled out her phone telling me "don't be silly" and added them up to get the same result as I had.

We asked the waiter about it, who disappeared off to get his own calculator.. He added things up, looked confused and then took it off to the manager. She then repeated the process on the calculator and also looked confused, unable to explain what had happened. They gave us £1.50 in cash, apologised and then kept the receipt (I guess they didn't want us posting that on twitter!).

To this day I've no idea what happened. You could suggest that some programmer somewhere is getting rich off this, but it seems rather unlikely to me. I'd really love to know what the cause was (and whether the manager ever reported it further up the chain; because this seems like a rather serious error to me.. how often does it happen? is it always £1.50? did the issue get found/fixed?).


I develop software algorithms for automatic processing of invoices and receipts. I have analysed hundreds of them and you'd be surprised to see how many contain errors like totals not matching the products, VAT breakdowns not matching the percentages, rounding errors etc.

In my experience this is usually because 'financial software' systems used to create invoices is sold by companies with 90% sales people, and maybe one or two developers. There seems to be little to no quality control. No one seems to care, since 'financial software' is very lucrative anyway.

In the beginning I tried to report the faulty invoices to the suppliers, thinking that the'd immediately press the big red emergency button and fix it, but in most cases the servicedesk employee does not care or even understand what I am talking about. Most of them send the 'thank you for your report, we are working very hard to fix the problem' email, but never actually fix it.


I could understand if something was a few pence out and it seemed like it could be a rounding error; but exactly £1.50 on a bill of around £20 doesn't seem like it could be rounding/tax/etc.

(In the UK, all prices shown in the menus are inclusive of VAT and then the total on the bill shows how much of the total is the VAT, so both the amount you can add up yourself from the menu and the line items shown on the bill should match the total you're paying perfectly).


I've seen a register app coded in Javascript.


I doubt the floating point errors are going to make much of a difference with the numbers the average register has to deal with. Javascript can represent integers up to 9007199254740991 accurately, so if you do all your calculations in cents your register can process a little over 90 trillion dollars before things get problematic.


As dragonwriter points out, there are many, many places where non-integer math ends up taking place - taxes, discounts, coupons, three-for-ones, etc.

> if you do all your calculations in cents

That would certainly have been a good idea also, I bet.


How would a "three-for-one" action end up with floating point inaccuracies? I imagine the price for three items would tend to be dividable by three.


Floating point is intrinsically inaccurate. You can't use it to handle money.

With floating point, the assumption (x*y)/y = x does simply not hold.


Floating point numbers can accurately represent integers, so if you have all your prices in cents, you end up with (x * 3) / 3, where each number in that calculation is an integer. No inaccuracy there. Of course there is no reason for a register to actually perform this division in a three-for-one action, it can just replace (x * 3) with x, or subtract (x * 2).

I agree as much as the next person that you shouldn't represent money with floating point numbers, but I disagree that cash register software written in Javascript must automatically be incorrect (or more so that similar software in a different language). And I don't even like Javascript.


Floating point can accurately represent more than just integers. That does not mean that they can accurately perform mathematical operations on them.

I'd assume that division is implemented as multiplying with the reciprocal (because thats faster). If that's correct, then any division by 3 (or any other number that is not of the form 2^i) breaks your cash register.

Because 1/3 represented as a floating point number equals 0.33333333.. and so on but not indefinitely, 3 * 1/3 = 0.99999999.. - which would be equal to 1 if you were using real numbers. But you don't.


> I imagine the price for three items would tend to be dividable by three.

"Buy one scroodad for a dollar, get two free."

"Three widgets for $2, limited time only!"

"$5 each, or three for $10"


Or the first time you need to multiply by a non-integer, for tax or other purposes, and floating point imprecision affects the side of a rounding boundary you end up.


And then, in that very very rare case, the customer pays one cent more, or one cent less, and nothing bad happens? And since all amounts are in a whole number of cents, you can round to an integer after each such multiplication, and the floating point imprecision doesn't slowly add up. You could even subtract some epsilon before rounding to ensure no customers are never negatively affected by rounding.

Maybe I'm getting this all wrong, as I've never had to work with a register, nor do I have any experience with the software involved, but I hardly see the problem in accumulating a few cents of inaccuracy over time.


This is exactly the scheme from Office Space - minus a decimal shifted the wrong direction...

The problem is accumulating that inaccuracy when there are well-known, battle-tested libraries and data-types available specifically to avoid that problem. Except, maybe not as well-known as they ought to be...

People these days tend to have no idea how numbers are actually stored in bits and bytes. I'm more likely to find an English major who would be able to remember Big-Endian vs Little-Endian (albeit in the original Swiftian context) than a web developer who has any idea what that means. Forget about mantissas and exponents in floating point, or two's complement integers.


Paying taxes on your purchase is a very rare case?

Anyway, yes, your main thesis right: it's entirely possible to do this with Javascript without screwing it up. But it wasn't in this one particular case.


> Paying taxes on your purchase is a very rare case?

I think the point is that the situation where floating point error changes the result when rounded to cents is rare. Which is true, except when it is not (e.g., if you happen to have a price that tickets commonly hit that results in an error.)


So what? There are ways to do precise calculations in JS; usually you just use some library.


This one did not.


Oh man, having read this now every obsessive like me is going to be hand-adding every bill they ever get from now on. Adding up numbers is just about the only thing you'd think you could rely on even the most dunderheaded computer system to get right :(


I know; it's the sort of thing that if someone told me, I just would not believe them. This is such basic crap that the idea that some production till could get it wrong seems inconceivable.

They could have that £1.50 back if they could give me an explanation! =)


If it was off by 5%, there could be some weird reduced-rate VAT @ 5% (https://www.gov.uk/vat-rates) not being printed out separately but included in the total nonetheless.

If I remember correctly, most booze is at the full 20% rate. I can imagine a mistake like this happening at least.

Of course, if it wasn't off by 5%...


VAT is clearly marked on the receipt so it wasn't that. If it was something simple like this, I think it would've been obvious (if not to us, at least the manager working there!).

The bill was totally itemised, the total just wasn't the sum of all the numbers above it. I really wish I'd taken a picture of it now (I didn't really expect them to take it off me, I really expected them to explain that I was being a doofus) :(


I'm pretty sure the reason only some of the currency symbols aren't correct has to do with the database.

If you think about it, the item names are most likely coming from a database that just might not be in the right encoding (latin1 is still the default in MySQL I think). The symbols that do work are probably hard coded into the receipt's template, and hence don't have this problem.

Why a shop owner would store the price and currency symbol in an item's description is beyond me, but having worked in the POS world and seeing what shop owners do with their items I'd definitely believe it.


Note that the encoding that MySQL calls "latin1" (and uses as its default) is not, in fact, latin1. It is windows cp1252 except with 8 random characters swapped around. I wish I was joking.


Haha, and what mysql calls "utf8" is not, in fact, all of utf8. That's called "utf8mb4".


It also can't sort unicode correctly according to the standard UCA algorithm. The ticket for this is closed as a wontfix.


Jesus Christ.

Is there a reason for this? Or is it just another case of lol@mysql


So in 2017, whats the correct character set to use in MySQL?


"utf8mb4", though really in 2017 the correct thing is to use Postgres.


but not everyone can use postgres


Anyone can if they want to enough. Some don't want it enough, or want other things more, but not using postgres is always a choice.


Which ones? I can't find info on that.


https://dev.mysql.com/doc/refman/5.7/en/charset-we-sets.html

Looks like I misremembered - 5 rather than 8 characters. But it isn't standard cp1252 and this can matter.


That's just describing how it will handle erroneous data. If you give it cp1252 text, it will work exactly as expected. If you give it certain invalid characters, it will treat it as those code points.


There's nothing eronneous about u0081. MySQL's encoding functions are documented to behave in particular ways when a given character cannot be represented in a given encoding, and its handling of "latin1" violates that documentation unless you take into account the nonstandard extra mappings MySQL uses.


1252 does not have a character assigned to 0x81. If you store 0x81 in the database as 'latin1', then it needs to error or do something weird. If you store u0081 in the database, that's a control character in the C1 block that doesn't exist in latin1, so it needs to error or do something weird.

If it violates the documentation about invalid characters, that's a problem, but that's not latin1 being incompatible with 1252.


If mysql handled columns declared as "ascii" as utf8 that would be "compatible" in the sense you're describing. I think it would be fair to say "mysql ascii isn't actually ascii" in that case though.


I see your point, but in that case I certainly wouldn't say it mishandles ascii. Defining some undefined behavior is very different from changing existing behavior.

Plus that's a different scale of change because it's going from fixed width 7-bit to a variable width scheme.


> Why a shop owner would store the price and currency symbol in an item's description is beyond me

If you've worked in the POS world you know exactly why: bad software. It either doesn't support the use case the owner has or isn't easy to use.


Shop owners (almost) invariably just bash their price lists into a spreadsheet or a Word file. A lot of this can be reasonably blamed on their only tools being hammers, so to speak.


> Why a shop owner would store the price and currency symbol in an item's description is beyond me...

Have you never watched Pulp Fiction?

https://www.youtube.com/watch?v=zoJAc_aSM7E

This video will explain exactly why a vendor might put a price in the description field.


Indeed, latin1 is currently the default for MySQL. But it changes to utf8 in MySQL 8.0 (in development).


Hopefully to utf8_mb4 instead of the broken one ...




"The £ is printed just fine on some parts of the receipt!"

That's probably a hint that it isn't the printers fault.

I would guess that some other system that is used to enter what's available on the menu is using CP 437 and somewhere an encoding step (CP 437 to Unicode) is missing so we get the ú character.

I wonder what character we would get if it was a "5€ cocktail" instead.


I suspect the cause is that the £-sign on the right-hand side is "hardcoded" as part of the receipt and sent correctly (using CP 437), whereas the name of items probably accepts input as unicode and then the printer assumes it's CP 437 (because who needs more than basic A-Z + numbers for names, right?)


Yes. I'd say this points more to an issue with how the product name was stored in a database, rather than the printer itself.

Bad collation.


Bad collation? That just changes alphabetical order, no?


    In order to maintain backwards compatibility with existing
    documents, the first 256 characters of Unicode are identical to
    ISO 8859-1 (Latin 1).
This isn't true in a useful sense. It does look like it's true in Unicode codepoint space [1] but in any specific encoding of Unicode it can't be the case because latin1 uses all 0-255 byte values. For example, in utf8 it's only an exact overlap for bytes 0-127 (7 bit ascii).

(Though maybe this means you could convert latin1 to utf-16 by interleaving null bytes with the latin1 bytes?)

[1] https://en.m.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_...


> (Though maybe this means you could convert latin1 to utf-16 by interleaving null bytes with the latin1 bytes?)

Yes. In fact, things like JS JITs end up storing strings as either UTF-16 strings or Latin1 strings internally to take advantage of this fact.


JavaScript uses (used, until a recent version) UCS-2, not UTF-16!


Most JavaScript implementations have a bunch of different string types used internally, depending on what you're doing with the string. In-memory representation has no bearing on the API visible to the outside world.

And while the JavaScript APIs only allow you to deal with UCS-2, the string contents themselves are, in fact, usually UTF-16.


It is useful for hacks in languages or APIs that don't distinguish between uint8[] and unicode string types. When you need to handle binary data you can create a string with unicode codepoints 0-255 and pass it to various IO things as latin-1 to create the byte sequences you want.

And of course it also works in the other direction. To safely read binary data into unicode strings just decode as latin-1 instead of utf8 and you won't run into validation errors since all byte sequences are valid in latin-1 while not all are in utf8.


> So ASCII gradually morphed into an 8 bit language - and that's where the problems began.

Oh sweet summer child. No, ASCII itself was a problem. Before we had 8-bit character sets, we had 7-bit character sets:

https://en.wikipedia.org/wiki/ISO_646

This is why IRC considers [\] and {|} to be lowercase and corresponding uppercase letters respectively: it was made by a Scandinavian, and in their character sets, some accented characters occupy the same positions as ASCII [\]{|} would.

The story of character sets is the story of evolving common subsets: ISO 646 within ASCII, ASCII within “extended ASCII” (or at least, some variants thereof), Latin-1 within the Unicode BMP, the Unicode BMP within Unicode.

Oh and by the way, before we had 7-bit character sets, we had 6-bit (e.g. IBM BCD). And before those, we had 5-bit (e.g. Baudot code). And before that, we had different telegraph codes (variations of Morse code)…


Even 5-bit Baudot-type codes were not standardized. A-Z and 0-9 are the same on all Teletype machines, but there's (at least) ITA2, USTTY, Fractions (⅛,¼,⅜,½,⅝,¾,⅞ only, for stock market use) and Weather Symbols (8 direction arrows and 4 cloud cover symbols). I own five Teletype machines and only two of them are 100% compatible.


The code that sends the price to the printer was written with currency symbols in mind, and selects the correct code page before sending the code for the £ symbol.

The code that sends the "product name" was not, and doesn't correctly translate its input to the code page that the printer is using.

When I made a homemade POS system for a bar, years ago, I ran all the printers in bitmap mode and rendered the receipts in software, to sidestep this and other problems. The performance was still acceptable, but I think the reason many POS systems don't go this route is compatibility; they have to work with many models of printer and bitmap support is not universal, and even among those printers that support it I am not sure if it is standardised.


>Each language needed its own code page. For example Greek uses 737 and Cyrillic uses 855.

Cyrillic is not a language, it's an alphabet/script. Codepage 855 was used for Cyrillic mostly in IBM documentation. In Russia codepage 866 was adopted on DOS machines, because in codepage 855 characters were not ordered alphabetically.

>Even today, on modern windows machines, typing alt+163 will default to 437 and print ú.

It's only true for machines where so called "OEM codepage" is configured as codepage 437. But in Russia it's codepage 866 by default, so typing alt+163 prints г.


That's a good point. I've updated the post to reflect that it's an alphabet. Thank you!


It's worth noting that ALT+X gives you the default OEM code page for compatibility with DOS sigh whereas ALT+0X gives you Unicode. So typing ALT+0163 will give you £.


>sigh whereas ALT+0X gives you Unicode. So typing ALT+0163 will give you £.

This is incorrect. It gives you an ANSI codepage. On old Windows version it would be a default ANSI codepage, on modern Windows it's a codepage associated with your input language. So if I type ALT+0163 with English keyboard layout I get £ from Win-1252, but the same combo after switching to Russian gives me Ј from Win-1251.

Entering numbers bigger than 255 just causes wraparound. For example, ALT+0835 also will give you £ instead of ₣.


>8859-1 defines the first 256 symbols and declares that there shall be no deviation from that. Microsoft then immediately deviates with their Windows 1252 encoding.

>Everyone hates Microsoft.

If only it was only that... Microsoft has even worse encoding schemes. The ugliest I encountered was an "encoding" based on glyph indexes in ttf files.

Conversion is a pain in that case, and is uncertain... it also leads me to not so beautiful code...

https://raw.githubusercontent.com/kakwa/libemf2svg/master/in...

Even between Microsoft products (namely Office on Mac and Office on Windows), this scheme is not handled properly (the string is incorrectly handled as an UTF-16LE string on Office on Mac).


Some of what's written here is not quite right. ASCII was developed in cooperation with ISO, ECMA (European Computer Manufacturer's Association), and BSI (British Standards Institute), and CCITT (International Telegraph and Telephone Consultative Committee), and it was clear from the start that there would be national/linguistic versions — this was the origin of ‘code pages’, to use the IBMism. ISO 2022 / ECMA 35 had defined the means of designating character sets (both 7-bit and 8-bit) by 1971, a decade before the IBM PC chose to ignore the standard.


In fact, original version of ASCII even left some of the codes under 128 undefined or available for local redefinitions. This is why Smalltalk uses _ for assignment (it was left arrow on Alto) and why some still used encodings have local currency symbol (eg. ¥) in place of \.


Along with leaving codes 96–123 undefined, and significant differences in control codes, 1963 ASCII had ← and ↑ in the positions used for _ and ^ in 1967.


ISTR that a previous version of ASCII standardized the back-arrow in the underscore space. The same with up-arrow in place of caret, which is why the caret is used as an exponentiation operator in e.g., BASIC.


The receipt also has the time in 24 hour format, then a zero-padded AM/PM format a couple of lines below. Shoddy software, with no attention to detail.

In Britain, it would be easy not to notice the incorrect symbol when setting up the machine. Elsewhere in Europe, it ought to get noticed quickly — but I occasionally get receipts in Denmark where the shop's address (or even name!) is corrupted, like "SkrÉdderi, LÎvstrÉde" instead of "Skrædderi, Løvstræde".


Since begining of may I had some issues with NFC payments which manifested themselves by terminals printing out receipts with error messages that consisted of box drawing characters interspersed with random letters. I though that it was caused by the card expiring in may and not allowing NFC payments in the last month and that it was somehow feature, only after seeing the message in correct encoding I realized, that it was user error as I had both old and unactivated new card in my wallet.


In the UK, there is plenty of computer systems that refuse to acknowledge that people or places can have non-ascii chars in their name, and ask you to correct them.. or even refuse to work possibly because they are comparing broken different encodings of them. Even paying tax by debit card seems to be impossible with their chosen payment processor if the name on your card or parts of your address does not match the constraints of the english alphabet.


My Danish street address includes non-ASCII letters.

There are easy transliterations, but I input them on my British accounts to make a point. About half work correctly.


Avoding any international characters, both when registering addresses and when inputting them in forms ends up being the path towards hopefully being able to spend money with credit/debit cards though. For various other non financial forms fuzzing the systems with what should be common enough european text and laughing as it fails is safer fun though.


The salient property of all flavors of ASCII is that each character fits nicely in an 8-bit word. This word size was commonly used in computer memory at the time, and memory was very expensive.

My first programming job was writing software for the MUMPS operating system on a DEC PDP-15, which had an 18-bit word size. PDP-15 MUMPS used 6-bit ASCII (which was uppercase only) because three characters fit nicely in an 18-bit word.


The problem here is not the printer.

I'm willing to bet the problem here is that the descriptions of the items are stored in the database as ascii and not unicode.


Interesting Article - Reminds me of a recent experience when I registered a few companies, one of them included R&D in the name. No problem for UK companies house, online registration within minutes. But has been surprising how much grief the & character causes with other systems. Banking systems refuse to accept it, they only accept a very limited number of characters for names. Should have used RnD like AirBnB - is ridiculous though, that gymnastics like this are still required in 2017! In the EU most banks are relaxed about account names, they just rely on IBANs but in places like Serbia they are annoyingly anal and reject payments if the name does not match exactly.


We don't always take the time to understand Unicode.

I wrote the following article for Node.js to try and clarify the intersection of Unicode and filesystems, especially with regard to different normalization forms, and using normalization only for purposes of comparison:

https://nodejs.org/en/docs/guides/working-with-different-fil...


Not enough love for code page 437! If we had proper support for it I wouldn't have so much trouble displaying proper smiley faces in the console. Linux, I'm looking at you.


Problem with supporting all of cp437 is that it assigns character for every byte including control codes.

Even in DOS this caused issues:

1) NUL is same as space in cp437, but is all ones in many other DOS code pages. This causes strings output by some software written in C(++) to end in black rectangle (Notably in C++ version of Turbo Vision, including Turbo C++ IDE), background in many TUI applications consisting of thin 8px spaced lines is caused by same thing (see below for why it is rendered as thin lines)

2) DOS and language runtimes for DOS tend to ignore most control codes but still not all of them. In particular 0x07 BEL is useful character (often used as the dot in selected radio button), only way to get it on screen is to directly write into framebuffer.

3) MDA-style character generator (present on essentially anything but CGA) has special hardware logic for making cp437 box drawing characters one pixel wider. This means that all "right facing" box drawing characters have to reuse codes with this magic behavior and you cannot use these magic slots for normal characters that are wider than 7px. (And is reason for the thin 8px spaced lines)


On the other hand, it was sometimes tremendously useful. I remember a DOS terminal emulator which could operate in "normal mode" (control codes were interpreted normally) or "diagnostic mode" (control codes were printed as their CP437 characters). Came in real handy when attempting to debug terminal output with screen-clobbering characters in it.


Character encoding is hard. Unicode is not hard, at least not that hard, certainly compared to character encoding before unicode. Unicode is the solution, not the problem. The problem here is that something got confused about what charcter was encoding in use somewhere -- debugging this is hard, but the best solution is almost always "just make it a unicode encoding, ideally UTF-8, at every stage of the pipeline you can".


I am author of a CJK language library for python called cihai (https://cihai.git-pull.com).

So as part of this, and after years, I eventually realized the only way to make a scalable tool to lookup Han glyphs is to build upon UNIHAN: The Unicode Consortium's Han Unification effort.

I write about Unicode and UNIHAN in my own words here: http://unihan-etl.git-pull.com/en/latest/unihan.html

The challenge with Unicode and hanzi is there are many historical and regional variants to a single source Han grapheme of the same meaning.

So, each glyph or variant gets its own codepoint, or number, reserved. In fact, this years when Unicode 10.0 is cut, the new CJK Extension F will introduce 7,473 characters (http://unicode.org/versions/Unicode10.0.0/).

Thankfully, my only task is to make the database accessible in as friendly a way as possible. Which is actually a mammoth task, see, there are over 90 fields which are used to denote dictionary indices, regional IRG [1] indices (which are national-level workgroups that convene to add new characters), phonetics (mandarin, cantonese jyutping, and more).

The fields are dense. They pack in objects that are most easily split up by regular expressions. https://github.com/cihai/unihan-etl/blob/master/unihan_etl/e...

So a UNIHAN field for kHanyuPinyin (http://www.unicode.org/reports/tr38/#kHanyuPinyin):

U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī

U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng

U+5364 is two values (separated by the space), then a list of items either of the colon (:), which are separated by commas.

You may wonder where this all comes from. The effort is global, but a good deal of it is thanks to people who took their time to contribute it, organizationally or personally. Take a look in the descriptions of the fields at http://www.unicode.org/reports/tr38/ for bibliographic info.

In any event, the hope is to create a successor to cjklib (https://pypi.python.org/pypi/cjklib) and have datasets for CJK available in datapackages (http://frictionlessdata.io/data-packages/). That way, sources of data are sustainable and not tied down to any one library.

[1] https://en.wikipedia.org/wiki/Ideographic_Rapporteur_Group


"The printer doesn't know which code page to use, so makes a best guess."

The printer probably use a default code page, and that's all. BTW, Unicode is not hard. The "hard" part is reading the device manual, and implementing encoding conversion properly. Also, in cases where no character selection is possible, in most cases you can use the printers in graphic mode.


What if in the distant future, the actual spelling of people's surnames drift due to normalization of this? I'd liken it to immigrants having their names transliterated to a latin alphabet at Elis Island, or something like that.


Łukasz Langa recently gave a PyCon talk [1] on the subject.

[1] https://www.youtube.com/watch?v=7m5JA3XaZ4k


That talk is proof as to just how difficult Unicode is in practice:

* @15:32, "UTF-32 uses the same amount of bytes for (almost) all code points" — there is no "almost" about it; UTF-32 always uses 4 octets per code point.

* There was some amount of conflation between code points and characters.

* It was implied that len() will always give you length-in-code-points in Python 3, whereas it doesn't in Python 2. In Python < 3.3, it's code units (just like it is in Python 2), which on a narrow build will be 16-bit and thus wrong for strings w/ code points outside the BMP. This particular problem wasn't solved until 3.3 with the introduction of PEP-393.

The author's main points regarding the difference between text, and how you encode it, is good.


Related PyCon talk about this: https://youtu.be/bx3NOoroV-M


> Unicode was born out of the earlier Universal Coded Character Set

Unicode was started independently and later harmonized with UCS.


Some parts of Unicode is hard like many characters looking almost exactly alike.


Unicode isn't hard, dealing with software that doesn't use it is.


Software doesn't use it because our language and our system's support for effectively dealing with this stuff is utter garbage. For example:

* The overwhelming majority of languages don't give you code-point level iteration over strings by default (and you probably want grapheme), most opting for code units — which is what an unsigned char ptr in C containing UTF-8 data will give you. (C++, Java, C#, Python < 3.3, and JavaScript all fall in this bucket)

* Linux, and most (all?) POSIX OSs store filenames as a sequence of bytes. What human chooses a sequence of bytes to "name" their files?

* Things like "how wide will this character display as in my terminal" are either impossible, or done with heuristics. Usually, it's not done at all; most DB CLIs I've used that output tabular data will corrupt the visual if any non-ASCII is output.

(Yes, some of this is in the name of "backwards compatibility".)


> * The overwhelming majority of languages don't give you code-point level iteration over strings by default (and you probably want grapheme), most opting for code units — which is what an unsigned char ptr in C containing UTF-8 data will give you. (C++, Java, C#, Python < 3.3, and JavaScript all fall in this bucket)

Saying for (let ch of str) in JavaScript iterates over the codepoints, not UCS-2 codepoints.


TIL! (Though, note that both indexing and .length operate in code units in JS.)



The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) from 2003:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...


That is not the absolute minimum. Unicode is complex beast and oversimplification is dangerous.

When these absolute minimum intros talk just about encoding it misleads people to think that that's enough. I can't count the number of people who have read Joel's article and have the misconception that all user perceived characters are mapped to code points. I was one of those people. Just because ASCII and Latin-1 character sets can be mapped to code points does not mean that's how Unicode works.

At minimum every software developer must know four different levels:

* bytes,

* code points,

* combining character sequence

* grapheme clusters, extended grapheme clusters

Joel stops at the second level. He never gets into point where he explains how encode user perceived characters, how to detect grapheme cluster boundaries in the Unicode encoding.

examples: 각 , नी , நி


Maybe the perfect is enemy of the good?

If a developer knows and understands the concept of character encoding ("It does not make sense to have a string without knowing what encoding it uses"), at least they will know how to read a string from one system and move it to a different system that expects a different encoding. They'll know that they need to call the relevant conversion routine in a specialized library that knows how to handle the conversion.

With this, maybe they won't be able to correctly build or modify their own strings directly. But being able to handle strings from an external system that produces them, and passing them to another system that consumes them, without breaking them in the process, IMHO does qualify as "the absolute minimum" they should know.


Thanks, but I'm happy to be ignorant about levels three and four. I know they exist, and that's enough for me. Something is really wrong if programmers of most applications domains have to care about that complexity.


Knowing that their exist is the minimum you must know. Knowing what you don't know is already knowledge.

Joel gives the impression that he don't know that he don't know.

Knowing that you can't break a Unicode text string or insert text into the middle of Unicode string unless you know what language it uses is usually enough. They are just binary blocks you can't modify unless you have some extra info or uses specific libraries.


You have to know they exist, and know when you should worry about them (and call in to the appropriate APIs)

https://manishearth.github.io/blog/2017/01/15/breaking-our-l... gives an overview of most of the different things scripts do. Being aware of those helps a lot. It also gives a brief idea of how to deal with this stuff (usually it's just calling an API)

https://manishearth.github.io/blog/2017/01/14/stop-ascribing... gives an idea of how grapheme clusters work. You don't need to know the algorithm, just the stuff around it.


What Joel covers is just fine for software developers who a) work in a language written in Latin script and b) aren't specifically responsible for internationalization of their product. Deeper issues can be left to specialists.


At minimum every software developer must also know about normalization, and must know what pre-composed and decomposed forms are (especially normal forms, NFC and NFD).

These things are things one can mostly ignore, until one can't.

Most input modes produce pre-composed (but not necessarily NFC) output. But some things will decompose (e.g., HFS+ will decompose filenames). So... if you cut-n-paste non-ASCII Unicode from an HFS+ file picker UI... you'll get into trouble if the software you paste into is unaware of these things.

Ultimately, every software developer needs:

- UTF validator (at least for UTF-8)

- UTF converters (unless only supporting UTF-8)

- case mapping (probably)

- normalization (almost certainly)

- collation (probably)

That's... not too bad.

Networking software )may_ also need:

- IDNA2008 implementation

- UTS#46 implementation

Word processing / typesetting software also absolutely needs to know about grapheme clusters in order to determine the size of each grapheme. Also: modern fonts.


So you're just basically listing everything you know about Unicode and then state that every software developer MUST know the same. That's bullshit of course.


WAT? I gave a breakdown of when you might need support for specific aspects of Unicode. That is not an exhaustive list, just a list that will get 95% of developers covered in most cases. I didn't mention bi-di, for example, though I probably should have.


Unicode is not hard. What's hard is the conversions between all these different systems. That's the hard part. Unicode is simple enough to be done flawlessly as long as you stick to unicode for everything.


If you only need to receive, store, and send text, Unicode is easy enough and you can just treat it as a byte stream. Once you get into things like manipulating text, comparisons and searches, or displaying text, things get hairy and all kinds of fun algorithms from the various Unicode Technical References and Notes make their appearance. Those parts are the ones that increase complexity.

Also, a major reason why Unicode is large and complex is because languages and scripts are large and complex. Unless we all agree on using simple computer-friendly languages and scripts that complexity is not going to change, and the need of working with older scripts (e.g. for historians and researchers) still requires something like Unicode. Unicode is the kind of thing that emerges from a messy world, and unsurprisingly it's messy as well.


Unicode is still _way_ less hard than anything else for manipulating text. Global human written language is complicated, unicode is a pretty ingeniously designed standard, it's got solutions that work pretty darn well for almost any common manipulation you'd want to do. Now, everything isn't always implemented or easily accessible on every platform, and people don't always understand what to do with it -- because global human written language is complicated -- but unicode is a pretty amazing accomplishment, quite successful in various meanings of 'succesful'.


It is actually hard.

https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/

But one's sticking with only some part of Unicode support that they understand/need is easy, sure.


Meh.

It's hard, because there's a lot more to learn and to do than if you stick to (say) ASCII and ignore the problems ASCII can't handle.

It's easy, because if you want to solve a sizable fraction of all the problems ASCII just gives up on, Unicode's remarkably simple.

In the eyes of a monoglot Brit who just wants the Latin alphabet and the pound sign, unicode probably seems like a lot of moving parts for such a simple goal.


Something as simple as moving the insertion point in response to an arrow key requires a big table of code point attributes and changes with every new version of Unicode. Seemingly simple questions like "how long is this string?" or "are these two strings equal?" have multiple answers and often the answer you need requires those big version-dependent tables.

I think Unicode is about as simple as it can possibly be given the complexity of human language, but that doesn't make it simple.


A Brit hoping to encode the Queen's English in ASCII is, I'm afraid, somewhat naïve. An American could, of course, be perfectly happy with the ASCII approximation of "naive", but wouldn't that be a rather barbaric solution? ;)


For anything resembling sanely typeset text you’d also want apostrophes, proper “quotes” — as well as various forms of dashes and spaces. Plus, many non-trivial texts contain words in more than one language. I’d rather not return to the times of in-band codepage switching, or embedding foreign words as images.


This is why the development of character sets requires international coördination from the beginning. :)


Yeah. And then you'll get Latin-1, because everyone using computers is in Western Europe or uses ASCII ;)


But comparing something to something else and it being easy, doesn't make it easy by itself.

Paraphrasing the joke about new standards: we had a problem, so we created a beatiful abstraction. Now we have more problems. One of the new problem being normalization.

It doesn't undermine the good that Unicode brought, but you can't say to have included some unilib.h and use its functions without understanding all the Unicode quirks and its encodings, because some of the parameters wouldn't even make sense to you, like the same normalization forms.


Wait. There are two possible cases:

1. Either your restrict yourself to the kind of text CP437/MCS/ASCII can handle (to name the three codecs in the blog posting). In that case unicode normalisation is a noop, and you can use unicode without understanding all its quirks.

2. Or you don't restrict the input, in which case unicode may be hard, but using CP437/MCS/ASCII will be incomparably harder.


A rocket can take you to the Moon. Is it easy to operate? Or to learn how to? To maintain it and prepare on the ground?

Not just it would be harder, but you couldn't get into space without it at all, so that got comparatively easier.

Is it still all easy, though?


Unicode IS hard. It's hard because concepts that exist in ASCII don't really extend to Unicode, and many of them depend on what locale you're operating in. Things like case conversion (in Turkish, ToUpper("i") should be "İ", not "I"), comparison (where do you put é, ê, and e?), what constitutes a character, word, whitespace, what direction do you write text in, how many spaces do characters take up in the terminal, etc.


Some of these concepts exist when limited to ASCII.

For example, in olden times, or when restricted to ASCII, the Nordic letter "å" is written "aa", but it is still sorted at the end of the alphabet — "Aarhus" will be close to the end of a list of towns.

In Welsh there are several digraphs, single letters written with two symbols. The town "Llanelli" has 6 letters in Welsh. (There are ligatures, but I don't think they're often used: Ỻaneỻi.)


Indeed, collation, case-insensitive string matching, and probably a bunch of other things must be used with an appropriate locale. That was the case before Unicode and is still the case with Unicode. The only difference is that the tables for how to do it are slightly larger now, but the operation itself isn't (much) more complex.


I would edit that to say "as long as you stick with UTF-8 for everything." Unicode defines more than one encoding, not just UTF-8, but also UTF-16 and UTF-32.


Yeah sorry you are right. Stick to one of the unicode dialects.


It's hard in command line tools like coreutils since there is no setting (afaict) for making sure all string comparisons are normalized. So you end up trying to compare which files using composed vs precomposed glyphs is painful. e.g. make; if the files generated us composed glyphs but you type precomposed glyphs into your makefile then nothing will work despite the filenames appearing to be the same.


In ancient times we tried to build the Tower of Babel, that would reach to God and Heaven. God said "Nah," made us all speak different languages and scattered us around.

Now it looks like we're up to our old tower building ways again, except this time with computers and data. So God smirked and gave us Unicode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: