Hacker News new | past | comments | ask | show | jobs | submit login

Note that the encoding that MySQL calls "latin1" (and uses as its default) is not, in fact, latin1. It is windows cp1252 except with 8 random characters swapped around. I wish I was joking.



Haha, and what mysql calls "utf8" is not, in fact, all of utf8. That's called "utf8mb4".


It also can't sort unicode correctly according to the standard UCA algorithm. The ticket for this is closed as a wontfix.


Jesus Christ.

Is there a reason for this? Or is it just another case of lol@mysql


So in 2017, whats the correct character set to use in MySQL?


"utf8mb4", though really in 2017 the correct thing is to use Postgres.


but not everyone can use postgres


Anyone can if they want to enough. Some don't want it enough, or want other things more, but not using postgres is always a choice.


Which ones? I can't find info on that.


https://dev.mysql.com/doc/refman/5.7/en/charset-we-sets.html

Looks like I misremembered - 5 rather than 8 characters. But it isn't standard cp1252 and this can matter.


That's just describing how it will handle erroneous data. If you give it cp1252 text, it will work exactly as expected. If you give it certain invalid characters, it will treat it as those code points.


There's nothing eronneous about u0081. MySQL's encoding functions are documented to behave in particular ways when a given character cannot be represented in a given encoding, and its handling of "latin1" violates that documentation unless you take into account the nonstandard extra mappings MySQL uses.


1252 does not have a character assigned to 0x81. If you store 0x81 in the database as 'latin1', then it needs to error or do something weird. If you store u0081 in the database, that's a control character in the C1 block that doesn't exist in latin1, so it needs to error or do something weird.

If it violates the documentation about invalid characters, that's a problem, but that's not latin1 being incompatible with 1252.


If mysql handled columns declared as "ascii" as utf8 that would be "compatible" in the sense you're describing. I think it would be fair to say "mysql ascii isn't actually ascii" in that case though.


I see your point, but in that case I certainly wouldn't say it mishandles ascii. Defining some undefined behavior is very different from changing existing behavior.

Plus that's a different scale of change because it's going from fixed width 7-bit to a variable width scheme.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: