Note that the encoding that MySQL calls "latin1" (and uses as its default) is no...

0x0 · on May 29, 2017

Haha, and what mysql calls "utf8" is not, in fact, all of utf8. That's called "utf8mb4".

Joeri · on May 29, 2017

It also can't sort unicode correctly according to the standard UCA algorithm. The ticket for this is closed as a wontfix.

s_kilk · on May 29, 2017

Jesus Christ.

Is there a reason for this? Or is it just another case of lol@mysql

vatotemking · on May 29, 2017

So in 2017, whats the correct character set to use in MySQL?

lmm · on May 29, 2017

"utf8mb4", though really in 2017 the correct thing is to use Postgres.

vatotemking · on May 31, 2017

but not everyone can use postgres

lmm · on June 1, 2017

Anyone can if they want to enough. Some don't want it enough, or want other things more, but not using postgres is always a choice.

Dylan16807 · on May 30, 2017

Which ones? I can't find info on that.

lmm · on May 30, 2017

https://dev.mysql.com/doc/refman/5.7/en/charset-we-sets.html

Looks like I misremembered - 5 rather than 8 characters. But it isn't standard cp1252 and this can matter.

Dylan16807 · on May 30, 2017

That's just describing how it will handle erroneous data. If you give it cp1252 text, it will work exactly as expected. If you give it certain invalid characters, it will treat it as those code points.

lmm · on May 30, 2017

There's nothing eronneous about u0081. MySQL's encoding functions are documented to behave in particular ways when a given character cannot be represented in a given encoding, and its handling of "latin1" violates that documentation unless you take into account the nonstandard extra mappings MySQL uses.

Dylan16807 · on May 30, 2017

1252 does not have a character assigned to 0x81. If you store 0x81 in the database as 'latin1', then it needs to error or do something weird. If you store u0081 in the database, that's a control character in the C1 block that doesn't exist in latin1, so it needs to error or do something weird.

If it violates the documentation about invalid characters, that's a problem, but that's not latin1 being incompatible with 1252.

lmm · on May 31, 2017

If mysql handled columns declared as "ascii" as utf8 that would be "compatible" in the sense you're describing. I think it would be fair to say "mysql ascii isn't actually ascii" in that case though.

Dylan16807 · on May 31, 2017

I see your point, but in that case I certainly wouldn't say it mishandles ascii. Defining some undefined behavior is very different from changing existing behavior.

Plus that's a different scale of change because it's going from fixed width 7-bit to a variable width scheme.