Hacker News new | past | comments | ask | show | jobs | submit login
In MySQL, use “utf8mb4” instead of “utf8” (2016) (adamhooper.medium.com)
643 points by goranmoomin on Jan 12, 2022 | hide | past | favorite | 351 comments



This is also a security vulnerability because MySQL/MariaDB silently truncate strings at the first invalid character. This can result in data manipulation attacks where a higher level layer validates the complete input as UTF-8 for insertion into the DB, but the database only stores half the string.

https://bugzilla.mozilla.org/show_bug.cgi?id=1253201


Silent truncation only occurs if strict sql_mode is disabled. Strict sql_mode has been the default since MySQL 5.7, released over 6 years ago. All prior versions have hit EOL support, so there is literally no supported version of MySQL which has this behavior out-of-the-box.

The topic of MySQL utf8 vs utf8mb4 was discussed in depth just six days ago on HN: https://news.ycombinator.com/item?id=29793916


That's good to know, however I was encountering this MySQL truncation just last year in COTS software, so clearly there's a long tail. As a related example, log4j 1.x has been deprecated for years, but 100% of atlassian uses it with their own maintained version.

In fact, when the log4j thing came up, we audited all our existing COTS/internal and virtually all of it was still on the "EOL" log4j 1.x.

Since you're familiar with this. What happens with an existing upgrade - is strict mode enabled by a typical distro package upgrade or does it only impact new installs?


I'm going to use this as an opportunity to vent.

Wouldn't it be a nice world where you can just declare something EOL and it disappears and somehow magically it's not a problem anymore for anyone end-to-end? Thing's just don't work that way. Like, suddenly, a product that was working perfectly fine for who knows how long suddenly needs to modified because some ass-hats thinks that EOL isn't suitable for production.

There are a number of reasons to upgrade a dependency or a product; an EOL label shouldn't be a part of the calculus. EOL may imply certain conditions that may influence the decision to upgrade but for the sake of EOL you can eat a bag of dicks. Think harder why you're proposing an upgrade is needed and how it'll add value to the business. EOL software is perfectly fine especially because it's been battle tested.

We also use log4j 1.x all over the place and I feel zero need to change that.


It is perfectly reasonable to continue to use EOL software, but you have to assume maintenance costs in that case. It would be irresponsible to continue to use EOL software as is. EOL is a perfectly natural concept. Nothing in this world lasts forever, including our machines, electronics, cars, batteries, lives, and yes, software.


If software lasted a lifetime, life would be so much easier.


For the end user, not the maintainers. Who's gonna sign up to keep that ENIAC, PDP-11, ALGOL 58, python2.3, win3.1, etc, code running, stable, and not a source of exploits?

Or do you mean in the abstract, fairy-tale ending sense?


I think they meant in the tongue-in-cheek sense.


> EOL may imply certain conditions

And "EOL" is a convenient shorthand for those conditions.

Most use cases care about never getting patches for security problems ever again.


> There are a number of reasons to upgrade a dependency or a product; an EOL label shouldn't be a part of the calculus.

If the end user of the product or their needs don't matter, sure.


> We also use log4j 1.x all over the place and I feel zero need to change that.

I dunno, a widly known RCE seems a pretty good reason to take something out of production.


This is a case of “too old to be vulnerable”: the exploit only affects log4j 2.x


Ah makes sense. I should've looked it up and double checked before commenting, my bad.

Honestly, in that case I tend to agree with GP. As long as you recognize EOL means you are in charge of supporting it and possibly backporting security fixes and you're okay with that - why not use it?


I agree EOL is too often interpreted to mean 'Insecure' - in the case of closed solutions this is absolutely the case (presumably, it is often difficult to audit closed and obfuscated closed source solutions).

In many cases it is not, it is purely a 'ticking time bomb' - if new CVEs or requirements arise which cause issues in EOL versions, you may not even become aware of the vuln.

In counterbalance, new versions of code have new, unknown CVEs, new bugs, etc.

The issue is that of support, the fallacy and process issue is that 'current' support is of greater quality than 'historical' support, or that old software has a disproportionately higher support costs than 'current' software.

The true issue is overdependence on FOSS software and unwillingess to contribute - all software has costs, major vendors are 'coasting' on work provided by free volunteers and panic when those volunteers decide to no longer work for free.

If vendors were doing the legwork necessary to properly inhouse their dependencies, I would wager that much less software would end up discarded in 'fear', for a lack of a better word than this - modern development is supposedly built on features and patches, any codebase should be maintainable for a much longer period than its supposed EOL.


I don't see a problem, customer/employer pays and modifications happen.


For sure, there's a huge long tail. However, if a company is using 6+ year old EOL software that has flaws, that's outside of the vendor's control.

> What happens with an existing upgrade

sql_mode is a setting, so it depends on whether the default is being overridden: e.g. does the my.cnf config file specify a value explicitly? Does your application's connection logic set a session-level override for each connection? DBaaS vendors add an additional wrinkle since they tend to have custom UIs and APIs for global settings.

> typical distro package upgrade

Many Linux distros made it so that attempting to install "mysql" actually installs MariaDB instead, so when going this route, it's unfortunately always very hard to predict what you'll get or what will happen!


> Many Linux distros made it so that attempting to install "mysql" actually installs MariaDB instead, so when going this route, it's unfortunately always very hard to predict what you'll get or what will happen!

MariaDB is indeed pretty pleasant to use and for most use cases can be a drop in replacement for MySQL, should you not want to use MySQL, but also wouldn't want/need/be able to use PostgreSQL or something else: https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/

I might be in the minority here, but i actually rather enjoy MySQL/MariaDB, since MySQL Workbench is still the best DBA software that i've used, especially because of the ER modelling and schema sync functionality, and their approach to schemas/users within a DB feels more reasonable than that of Oracle/PostgreSQL.

That said, both PL/SQL and PL/pgSQL feel more powerful than whatever MySQL/MariaDB has. Also, in regards to MariaDB in particular, their versions have diverged from MySQL somewhat, so you might still want to look into the differences, if migrating over: https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/

As for predictability - nowadays i just run specific versions in containers (e.g. 10.7.1 or maybe 10.7 which corresponds to the patest PATCH release), to sidestep the distro differences altogether. It also lets me easily launch new instances, give resource limits to them, have multiple versions in parallel and mount the data directories wherever i need them, for easier backups etc.


Same. If I'm building and managing any special instances/VMs these days I avoid installing any packages but those required to run containers. There's certainly tradeoffs but it makes software management so much easier. Upgrading is a matter of stopping and starting a container using an upgraded image and if it's supported rolling back is just a matter of stopping the newer container and restarting the old. A container hosting service/platform and it's abstractions are better but if you have to run some one offs outside of that it's so much nicer to use containers. I call these pet cattle.

I still bind mount directories though. Docker volumes seem really convenient but a few years back I dealt with helping out a coworker with some data loss (pre-volumes) in a container's filesystem so I have some trust issues with how they implement storage.


> I still bind mount directories though. Docker volumes seem really convenient but a few years back I dealt with helping out a coworker with some data loss (pre-volumes) in a container's filesystem so I have some trust issues with how they implement storage.

This mirrors my experience and despite how Docker seems to suggest that non-abstracted storage is a liability and we should use volumes instead in many cases, i disagree: https://docs.docker.com/storage/bind-mounts/ In my eyes, the current implementation for volumes is for data that i don't really care much about, but would still consider extracting/moving later and have it be persistent, like when trying out new software.

Personally, bind mounts also make it extremely easy to operate with whole installs of software - for example, should i want to upgrade something like Nextcloud (and have enough space), i can just stop the containers, make a copy at /docker/nextcloud_stack/data/nextcloud_mysql_old and run a new container version against /docker/nextcloud_stack/data/nextcloud_mysql and see if it all works.

If the update has corrupted everything, i can just delete the broken folder, rename my backup and proceed with restoring the older version. If i want to mess around with the broken version, i can do it with 0 worries about what else could break. It's amazing, way better than relying on the POSIX approach to a file system, where your software, its configuration and its persistent data is all over the place.

I once had to manage a Tomcat install which had the webapps, config, logs directories in their idiomatic locations rather than one folder, which was pretty horrible from a management perspective. I had to constantly jump around and wasn't entirely sure how it all fits together (probably lots of symlinks).

It's curious that some/many Kubernetes distros lock down the local storage driver, though - there's a preset data directory, where all of your mounted data ends up, which feels like a step backwards from a customization perspective. I had to change the data directory for the entire Kubernetes install to be able to store my data in a painfully obvious /app or /data mount point. Then again, their ability to limit the sizes of these persistent volumes is pretty cool.


> Silent truncation only occurs if strict sql_mode is disabled. Strict sql_mode has been the default since MySQL 5.7, released over 6 years ago.

What happens if you upgrade from older version? Are defaults from previous version used, or is it default only on fresh installs?


If you don't have an explicit value set in your my.cnf config and the client sends no explicit value upon connecting you get the current built in default of the version you are using. If you set something somewhere it will be respected.



The strict sql_mode is often manually disabled. I bet it is off on most shared hostings.


Sure, but that's not MySQL's fault. MySQL/Oracle has no control over what ill-conceived setting overrides are used by DBaaS providers or shared hosting providers.

There are similarly dumb misconfigurations that can theoretically be made in most hosted open source software.


Implementing silent string truncation in the first place (by default(!)) is an excellent indication of how fast and loose and out of touch the original MySQL developers were with the database world. It is the worst database "feature" I have ever seen.


It made more sense at the time. Earlier days of the web, dynamic scripting languages used everywhere, etc. No emoji, barely any use of 4-byte characters at all. Large corps weren't using open source DBs, or sometimes open source anything. Simpler times!

And for sake of comparison -- if I understand correctly, prior to just 2 months ago, sqlite didn't even have the option for strict type enforcement [1]. Do you judge sqlite as harshly, or think its creator is out of touch with the database world?

[1] https://www.sqlite.org/datatype3.html


Relational databases have been used in production since the late 1970s. You would be hard pressed to find any that silently corrupted user data like that, ever, until MySQL came along.

Sqlite is more of an embedded database, but if it silently truncated character data under some conditions that would make it equally unusable. Better to have no type enforcement than defective type enforcement.


I think you misunderstood my point. Yes, relational databases have been around a long time. But in the early days of MySQL, most of the companies using MySQL differed quite a bit from the companies typically using DB2, Sybase, Oracle, etc. I've been using MySQL professionally since 2003 and am speaking from first-hand experience here: the web development world was largely separate from the traditional enterprise world at that time. It was considered a different use case, it's not like today where the biggest companies in the world are tech companies built on open source software.

Regarding "silent", MySQL at least emits a warning when this happens. Still not great / obviously wrong in a modern context, but just for clarity's sake it wasn't completely silent.

As for "defective type enforcement", that's a subjective opinion. In C, atoi("123hello456") returns 123 -- once you hit an invalid character, everything from that point is truncated. Is C atoi defective? Perhaps, but it's an opinion.

With strict mode disabled, four-byte characters worked the same way in MySQL utf8mb3: four-byte characters are invalid in this data type, and everything gets truncated at the first invalid character. Would simply stripping the invalid characters, but keeping valid chars after them, necessarily be more "correct"? No, it's an implementation decision, especially considering that emoji did not exist at the time and barely any non-BMP (4-byte) utf8 characters were used anywhere.


It is certainly true that most users of relational databases before MySQL used them for applications where throwing away data is a potential catastrophe, and most uses of MySQL were for toy applications where if that happened no big deal. It was about a decade before MySQL even supported transactions.

atoi is deficient by modern standards, but at least it clearly documents what the behavior is and always has. It doesn't truncate anything. For a database, truncating data is unforgivable. That is what errors are for, so your database doesn't act as a black hole where your data may silently disappear forever.

And yes, if you can't return an error for some reason, you should preserve as much data as possible. Throwing away hundreds or thousands of characters after an initial anomaly is insane. But a database should never do that anyway, it should either store what you give it, or return an error. Same with a filesystem or any other trustworthy means of data storage. Sorry we were too lazy to tell you we threw your data away isn't acceptable, nor is relying on anyone to check for warnings that should be hard errors.


I would argue yes, that atoi is defective. Stripping characters is equally defective.

A QA walks into a bar and orders "123hello456" beers. The correct answer is not for the barkeep to give them 123 beers, its to ask "I beg your pardon?" So the truncation in general is the problem. If the operation fails - too many bytes, wrong encoding, string too long - the operation should fail. It's not the db's job, that is the business logic's job.


All competent software tokenizes a input string before calling atoi on an extracted piece of it. Otherwise you can't even tell the difference between "0" and an empty string.


Thanks for the info, I've always wondered when SQLite would get type enforcement, lets hope more awesome datatypes gets added over time.

The world is moving towards strict typing, and since this is what my brain is wired for I'm very happy.


On error resume next was the idea about error resilience back then. MSSQL has a similar feature: when you run a script and an error happens it just keeps running statements. If you want to interrupt it, you should wrap the whole thing in the try/catch statement and also roolback a transaction if any or it will commit a partial result.


Well, the things it would have been nice for MySQL to have done would have been not implementing it this way in the first place (yes, that's probably not Oracle's fault), and moving utf8 to utf8mb4 a bit more quickly. (good to know that's finally happening).

Also, I gotta say, now I have to look into this strict mode and what else it breaks, when in this particular case a "don't silently truncate strings" flag would be preferable and probably lower impact, since "strict" vs "security flaw mitigation" is not quite the same thing (does such a more granular flag exist?). ... unless that's all the strict mode does.

Hm, and since you brought this up elsewhere. Is this "strict by default" in MariaDB - which is what most (linux) people are (probably) thinking of.


> moving utf8 to utf8mb4 a bit more quickly

I definitely agree!

> now I have to look into this strict mode and what else it breaks

It's a mode to ensure nothing happens silently: illegal data type conversions, attempting to insert invalid values for the data type (too large, too long, invalid characters, etc), attempting to do an INSERT which doesn't specify some columns that lack defaults, etc. With strict mode enabled, all of these things throw hard errors, like in most other relational databases. Generally, it covers the entire class of things that are typically the most common complaints about MySQL on HN.

Interesting wording re: "breaks" -- for practical purposes this is a real concern for existing applications, which may be doing bad things which previously didn't throw errors with strict mode off. Do you want your DB to suddenly fail loudly for these things (strict mode), or do you want your existing application to continue working in these cases? Can't have it both ways.

> Is this "strict by default" in MariaDB - which is what most (linux) people are (probably) thinking of.

MariaDB made it default in 10.2, ~5 years ago.

I don't agree that most Linux people conflate MySQL and MariaDB though. At large US tech companies that use MySQL, they're almost all literally using Oracle MySQL, or a patch-set like Percona Server which directly tracks MySQL, or an internal patch-set that does the same.

In recent years, MySQL and MariaDB have diverged a bit on feature set and focus; each has some interesting features that the other lacks. They are not drop-in replacements for each other.


So... while I'm broadly in agreement with you, and a huge fan of strict typing, I will note people use languages like Javascript and usually are ok with things like string to integer coercion.

It seems to be that there's a world of difference between "silently truncate string" and "type coercion" even if I readily agree both are sources of security vulnerabilities.

Thanks for the replies though, even though I clearly annoyed at least 2 people due to the downvotes.


I enabled strict sql mode in MariaDB for an website just last year, after noticing the database was silently corrupting data. What made the transition much less painful was noticing that when strict sql mode is disabled, MariaDB produces a warning for the queries that corrupted the data, but you must run SHOW WARNINGS; to see the warnings after running such a query. Knowing that, I made the website run SHOW WARNINGS; after every database query and logging if a warning occurred. This allowed me to deal with the most of the warnings before they were promoted to errors by enabling strict sql mode.


> and moving utf8 to utf8mb4 a bit more quickly. (good to know that's finally happening).

That happened in 2010 - I'm not sure 12 years counts as "finally happening", or just how much quicker you would like than 12 years ago?

Other than a time machine, I'm not sure what else you would like them to do.


I mean the announced plan to make the "utf8" default be utf8mb4 instead of utf8mb3.

See https://news.ycombinator.com/item?id=29910023


They should have made "utf8" have UTF-8 semantics by default as soon as the defect was discovered, not twenty plus years later.


They couldn't. The storage size is different internally. The article covers this in the section in migrating from utf8 to utf8mb4.

Do you actually use MySQL or are you just complaining from a theoretical POV?


Pretty sure the name "utf8" isn't metaphysically wired to a defective physical representation. All they have to do is wait until the next minor release, add a compatibility name, make the generic name refer to a non-defective representation, and update the documentation and release notes accordingly.

If they were really conservative, they could wait until the next major release to repoint the name like that. Haven't there been four major releases since 2003?


This would be problematic for existing applications. Even if you leave existing binary data as-is (in existing tables utf8 becomes utf8mb3), changing the meaning of "utf8" in only new tables is still very problematic. Think about logical dumps, e.g. CREATE TABLE statements which use the "utf8" alias. For example, any self-host application which runs a .sql file in its install process. Or spinning up a dev environment for a new hire. Basically, anything that contains a CREATE TABLE with the utf8 alias now has a landmine:

In older versions of MySQL, InnoDB indexes were limited to 767 bytes. Meanwhile, VARCHAR(N) can store N characters. So with utf8mb3, that's a max of 3N bytes worst-case; but with utf8mb4 ("proper" utf8), it's now 4N. This meant that VARCHAR(255) CHARACTER SET utf8mb4 could not be indexed (in its entirety) in these older versions; the CREATE TABLE would fail if you try.

When you have a massive installed base of users, and those users have lots of dumps / install scripts containing "VARCHAR(255) CHARACTER SET utf8", you simply cannot change the meaning of "utf8" in a minor release.

I fully agree that MySQL could have made this transition faster/earlier than they did, but I disagree that it could be done trivially or terribly quickly.


Advertising the ability to store UTF-8 characters when you can do no such thing is a serious problem. Obviously you have to do whatever it takes to fix it, including fixing a deficient indexing implementation if necessary.

I believe the limitation you mention was fixed in MySQL 5.7, which would have been a reasonable opportunity to make "utf8" live up to its name.


Keep in mind that prior to 2010, MySQL and InnoDB were owned by different companies. Unilaterally deciding to increase the InnoDB index size limitation to permit utf8mb4 varchar(255) was not physically possible for the MySQL team prior to that point.

Anyway, I fully agree that the utf8 alias swap should have been done in MySQL 5.7, or perhaps even 5.6. I'm just disagreeing with your assertions that this would have been a trivial change.

Nor do I think it should have been a "stop the world" high-priority change prior to emoji existing in unicode (~late 2010). Non-BMP characters were very rare before that point. Hard to justify a business need for a complex migration just to store archaic hieroglyphs. For another point of comparison, MS SQL Server only added support for non-BMP characters in 2012, and only added support for utf8 encoding in 2019!


It kind-of is "wired" to a "physical" representation, but in a different sense.

https://blog.koehntopp.info/2022/01/12/utf8mb4.html#so-what-...


It's less a question of fault and more a question of whether developers need to be aware and look out for those gotchas. A security problem doesn't stop being a problem because your hosting provider did something vs. MySQL doing it by default.


Lots of people turn it off because Warnings are easier to ignore than Errors are


I do find this behaviour rather incomprehensible actually. I understand that if the "utf8" is BMP it can't have invalid characters in it. What puzzles me is that the approach they went for seems to be almost the worst possible one from a security and data integrity perspective.

Assuming a SQL error can't be thrown for some legacy reason (and you'd think a flag to opt into that would be an option), why not replace the invalid parse with U+FFFD like most things do? Or at a bare minimum, only strip the invalid character which at least makes attacks a bit harder and results in a little less data loss.


Quietly corrupting data and shipping with the wrong default settings is what MySQL is all about! Keeps the consultants busy.


It’s also easy to imagine how it happens… const char *buffer = malloc(…; memcpy;… etc… pretty fast pretty straight forward


Just wanted to reply to myself that I just learned from evanelias' comment a "strict mode" flag is indeed an option. I suppose it's time to audit all my mysqls to see which ones have this enabled, and what other mysql-isms this flag might break.

I know it's definitely not enabled on one updated and maintained BSD - this could be because they didn't want to break existing systems during upgrade.

I suspect this problem is still extremely common. But good to know there's some form of defense.


Another fun one that came up at my first job: If you run up against the size limit of a text column in MySQL, it silently truncates.

This came up when a developer used YAML for some configuration file. There was all sorts of confounding behaviour because YAML is still valid when truncated.

Every database has limits, but at least alert the user! I’d rather a query fail than alter data.

Edit:

This is something that might be useful in CS classes, to reflect what being a real-world engineer can often entail. Set an assignment with some silly nonsense like this and students have to debug it.


There's a setting for this, but like many settings MySQL has (along with InnoDB, the storage engine) historically chosen exceptionally bad defaults. It's possible to tune the thing to be screaming fast and pretty robust, but it takes a lot of expertise and TBH these days most people would be much better off going either NoSQL or Postgres instead.


Why were you putting YAML in a relational database?


iirc it was to allow customization of an e-commerce item. Basically a column as a document (in the sense of a document database) with custom attributes that could be…large.

It wasn’t my implementation. This was another team


It's really weird that they've chosen YAML for it, since its sole (questionable) benefit over the alternatives is that it's geared more towards being edited directly by humans... which is generally not something that you need to do with a database field. JSON is more usual for scenarios like this, not the least because most databases actually provide tools to parse and query it.


It's pretty weird to put any configuration-languages in a relational database in general. Large variable-length objects is not what relational databases are good at. You're mostly getting the drawbacks, and very few of the benefits in that use case. Even a filesystem is probably more suitable.


Drupal seems keep a large amount of PHP serialisation of configuration in its database (a JSONish thing), probably because it's the standard data store for the site. Noticed this in fixing a dump one time. Got to use vim's function syntax for regex to ensure the character length count of the string matched the value of the new string.

I think virtually everything generically configurable in the UI that isn't part of a small core with dedicated fields (or in the sitewide PHP config include) is handled that way.


Most databases have some kind of special support for XML and/or JSON these days. And sometimes you just need to associate structured hierarchical data with each record without knowing the precise schema in advance. If it's non-hierarchical, you can always use a separate key/value table; but tree queries are not exactly convenient (or fast) in SQL.


Postgres is really happy with xml or json, as you please, and saves you from the complexity of having another data store.


Maybe also do a checksum or something, seriously it could just be CRC32. I get that it's not something you instinctively think to do. For configurations that are most likely parsed directly by something else though, may check that nothing broke in transit.


If the point of CS education is to produce programmers and software engineers as opposed to computer scientists, yes.


What percentage of CS graduates go onto become computer scientists? Seems like vast majority of CS degree holders aren't practicing "pure" computer science, as you would seemingly hope.


Probably a very low amount. When I was studying I participated as a student representative at the CS programme I was attending and this was something that was actively being addressed.

The problem was multi-pronged. The courses had become increasingly geared toward industry, and students that had an interest weren't really fit for continuing their studies.

In the end, the longer 5 year program (MSE) was allowed to remain more vocational while those seeking a bachelor's were brought along a much more theoretical path so they could then do a master's followed by a PhD.


It's like MySQL and PHP were meant for each other.


How dare you. Also accurate.


PHP has improved a lot.


yeah... it also has a strict mode... that is off by default.


The day I realized that is the day I stopped using MySQL. It was a very long time ago, I was burned by this exact issue. I don't know what's more fundamental to a database than storing the data you ask it to. If it's storing something else, then it failed at its job spectacularly. At the time I experienced this bug I think it didn't even give a warning, but I think that's actually worse with one : it knows it's storing incorrect data but does so anyways.

And it still has not been fixed after all those years. I don't know what use is a character encoding that partially supports Unicode (you just have to try to find out in what way) and is used by nobody else. I don't really care that at some point in the future utf8 is going to become an alias to utf8mb4, a DB capable of this behaviour cannot be trusted, and it allowed me to discover postgres which seems superior in just about every way, so I don't need to go back.


Any data in your database that can come from external input should be treated as untrusted and validated before it's used. Otherwise validation bugs or bypasses will result in bad data and exploits that persist beyond the fix. Edit: I’m not arguing against the need to use utf8mb4


While that’s true, I trust PostgreSQL to store exactly what I’ve asked it to store. At some point, you have to trust something to do its job, or else everything built on top of it is a castle of sand.

Imagine a bug like this in ext4. No one would reasonably contend that the layers on top of it should be validating that the files you write out are the ones you’ll read back in. We write unit tests for all kinds of stuff, but we’re not that thorough.


Agreed. The confusing part here as I see it is where validation layer A (correctly) asserts the data is valid UTF-8 and safe then assumes the database persists what it passes to it, since no error is reported.

Then, subsystem B trusts reading the database field (since it passed validation layer A).

Obviously more validation layers can be added, but at this point validation layer C called by subsystem B needs to know what the initial input from layer A is in order to differentiate it from the db value which was manipulated - a rather tricky thing to do sometimes. (I guess you could add a hash to the db to check the db is storing your strings, but really.. come on)

Upgrading to utf8mb4 is probably safer than hoping enough validation layers thrown at it solves the problem.


Yes. I’ve been burned by this exact issue in the past, not realizing that utf8 was not really utf8 in mysql-land. That is a major, major WTF IMO.

But I’ve also seen people relying on validation at the time of insertion so many times that I wanted to warn against that, too. Not in argument against the need for utf8mb4.


Sure, but the issue here is that the default mysql "utf8" encoding is not actually utf8. You can write as many validation layers as you want, but if they are assuming that utf8 actually means utf8, they won't help, and mysql will potentially screw it up when it gets stored.


I need to share something as literally today I fixed a bug in our project that's somewhat related.

MS SQL encodes everything with UTF-16. Emojis (code points) require up to 4 bytes. If it's a grapheme (emoji constructed from emojis), it will be even more.

We are using Django. If you check length of an emoji, it will give you `1` and Django assumes utf8 everywhere. If you try to save it to PostgreSQL with char field `max_length=1` it will work just fine, but on MS SQL it will fail as it requires 2 characters (4 bytes) in the DB!

I tried it with MS SQL 2017, 2019 with different collations on nvarchar and I'm pretty sure there's no way around it.

> Because in CHAR(n) and VARCHAR(n) or in NCHAR(n) and NVARCHAR(n), the n defines the byte storage size, not the number of characters that can be stored, it's important to determine the data type size you must convert to, in order to avoid data truncation.

https://docs.microsoft.com/en-US/sql/relational-databases/co...


Generically, anything by Microsoft will historically have used UCS-2, and will use UTF-16 these days, so this is utterly unsurprising to me as an experienced Windows dev. Conversely, Linux (and POSIX, more generally) deciding that filename encoding is a per-filename and untracked thing is a bit lit, from my perspective. Point being: when it comes to handling unicode and foreign characters, just, like...always read the documentation. Assume nothing.


Yep. Microsoft is the main reason for 2 of my favourite unicode links. https://utf8everywhere.org/ and https://simonsapin.github.io/wtf-8/

And, apparently it's mildly inaccurate to say it uses UTF-16... it's more like UCS-2 with UTF-16 hacked in, with no validation. Thus WTF-8.


No, it's UTF-16 with no validation at the kernel level. Invalid UTF-16 is also invalid UCS-2 as those code points were explicitly barred from use.

In practice, only malware will create such broken names. High level software (e.g. Microsoft's own VSCode) will not handle broken UTF-16. And indeed the in-built UTF-8 code page will lossily decode UTF-16 (unpaired surrogates are replaced with the Unicode replacement character).


Hm. Are you sure? Because the utf8everywhere article (and various microsoft related framework discussions) seem to suggest there's no validation anywhere. You can easily create partial codepoints and just hitting backspace in a text field can do it. That seems to imply there's no UTF-16 validation even at a higher level.

But I will readily defer to your expertise on this. I've not coded in microsoft land for like 18 years. MFC was my last experience in this, where I still have this vague memory of being shocked by an API returning an int32 and instructing on casting to a void pointer (overloaded response message). No wonder they had issues with 64 bit migration at the time.

Edit: cite on the utf8everywhere thing. "in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16. On Windows 7, the console displays such characters as two invalid characters, regardless of the font being used."

Maybe they've improved since though. But surely there's a lot of that baggage in the libraries.


I mean, Vista is ~15 years old at this point. If anything that's still part of Windows makes the backspace mistake then if nothing else it's impressive it's survived this long without being noticed.


Windows Vista came out fifteen years ago.


With SQL Server 2019, you can write UTF-8 data to `VARCHAR` and `CHAR` fields, if the table collation is `LATIN1_GENERAL_100_CI_AS_SC_UTF8`.

I feel the pain of UTF-16. When I was writing the Rust crate for the TDS protocol, all N-columns for strings always require you to do a full copy from `Vec<u16>` to `Vec<u8>` and back.


I used this collation on 2019 when testing and the only thing that has changed was the error message (more verbose).


This kind of bug gives me nightmares. I'm afraid to ask how long it took you to solve.


2 days to identify and understand the problem because I wasn't able to reproduce at all, only on prod. Then ~1 day to fix but I've decided to handle not emojis per se, but unicode characters (which makes more sense but requires more knowledge) which was finished within the next 2 days. In the meantime I set up a test project with PostgreSQL, 3 MySQL databases and 2 MS SQL Server so all operations were executed on all 6 at once.

The worst thing is Django won't raise an exception on max_length exceeded even with MS SQL because on the application level its length is 1, so only the DB (DataError on constraint created by Django) will complain.


Note:

> The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. utf8 is currently an alias for utf8mb3, but it is now deprecated as such, and utf8 is expected subsequently to become a reference to utf8mb4. Beginning with MySQL 8.0.28, utf8mb3 is also displayed in place of utf8 in columns of Information Schema tables, and in the output of SQL SHOW statements.

* https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-...


Maybe they should rename/alias it to brokenutf8, to make it more likely that people will notice a problem.


That's not needed, what is needed is what they said they would do:

> and utf8 is expected subsequently to become a reference to utf8mb4

Once utf8 points to the right thing, the mistake will stop happening as no one is explicitly using utf8mb3 by accident.

Not sure why this wasn't done right away, and only will be "subsequently". Has it been done yet?


Hmm, I wonder if they'll deprecate "LATIN1" as an alias for "CP1252 + 8 arbitrary characters" (a charset that literally only exists in MySQL) as well.


I really wish they do this in a new, major version release. But they dont seems to be keen on releasing 9.0.


Fun story: When I was working at one of the FAANG companies, I have placed the (Night with stars, it does not seem to render here [0]) character in my preferred name. Some time later, I was notified by HR that I must change my preferred name and I can only use UTF-8 characters with maximum length of 3 bytes. I was quite confused why such specific demand, I am pretty sure it was exactly this issue. I replaced it with ᗧ···ᗣ···

[0] https://emojipedia.org/night-with-stars/


Are you called Bobby Tables by any chance?


Fantastic response


Is Medium secretly mining crypto in the background? I have no idea what it is but their site makes my CPU spike and my fans turn on.

https://jdon.at/JTHj2G

Update - here's a video, it's making hundreds of graphql requests a minute…

https://jdon.at/z05ImC


In the past, Unicode was assumed to be 64k of codepoints, so a 3-byte UTF-8 sequence was considered "long enough", especially since there were surrogate pairs for the rare cases where you have to encode higher code points.

Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.

Yes, it could have been avoided if they had allowed arbitrary-length UTF-8 sequences from the beginning, but I can see that they probably just wanted to optimize a bit.

What I don't understand is why they had to create a different encoding (the utf8mb4) instead of just extending the existing utf8 encoding, since 4-byte UTF-8 is obviously backward-compatible with 3-byte UTF-8... (unless they always used 3 bytes for every character, which would be stupid as UTF-8 has been explicitly designed for variable-length encodings)

Bonus: Many filesystems also do not allow 4+ byte UTF 8 code points in filenames. Test your company's file server to see if it allows you to save a file as "(some random emoji).doc". A few very expensive storage systems also have problems with that (and they have the same workaround: convert your filesystem to a different encoding, instead of simply extending the existing encoding to allow 4+ bytes)


While non-BMP Unicode has only become popular somewhat recently due to emoji, it's been part of the relevant official specs for a quarter century.

I know you were kinda joking by using scare quotes for "recently", but let me fill that in for people who might be less familiar with the history. Unicode broke out of the 16-bit limit as of version 2.0, released in July 1996, so yeah, a quarter century. The first RFC for UTF-8 (RFC 2044, Oct 1996) explicitly supports UCS-4. As far as I know, there has never been a time where UTF-8 restricted to the BMP (ie 3 bytes) has been in widespread use. In other words, I am questioning your assertion that 3-byte UTF-8 has ever been considered "long enough," except perhaps in sloppy implementations made without much care for following specs.


Indeed. It somewhere between sad and amusing for those of us who care about i18n that encoding issues in various software that we'd been pointing out for the better half of the last 20+ years suddenly became no big deal to fix in the last ~5 years as 4-byte Unicode Emojis got widely deployed for users whose languages would otherwise fit in US-ASCII or Latin-1.

It's also a lesson for future generations. If you want an encoding specification to be fully implemented make sure that doing so is synonymous with the ability to display something like dog poop emoji. Clearly "this allows millions of people to read their native language" was too boring of a reason to care.


Now if we could just come up with an application of accessibility APIs that appeals to the average non-disabled user, maybe we could indirectly advance the state of accessibility as well.


> except perhaps in sloppy implementations made without much care for following specs.

Describes MySQL perfectly ;)


Almost everything you claim here is wrong. UTF-8 has never been a 3 byte encoding, the spec initially specified [1] an up-to-six-byte encoding after Unicode had already gone past 16 bits [2] and then was reduced to 4 bytes by RFC3629 [3] in 2003. MySQL is the only piece of software that I know of where they invented their own length. (Though a lot of software often failed to validate it at all.)

There's nothing recent about it - it was 22 years ago! There's no significant optimisation advantage in the different length, and you can see from the patch that reduced the max-length [4] that it wasn't about optimisation. I don't think you can name a single file system that restricted UTF-8 to three bytes.

[1] https://datatracker.ietf.org/doc/html/rfc2044 [2] https://unicode.org/faq/utf_bom.html [3] https://datatracker.ietf.org/doc/html/rfc3629 [4] https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...


It is you who does not know what he's talking about:

> UTF-8 has never been a 3 byte encoding

I never claimed that

> There's nothing recent about it

The non-BMP characters are "recent" because 10 years ago the non-BMP was not allocated except for some small areas. Also I said it "became popular recently", due to emoji. Before that, non-BMP codepoints were rarely used

> I don't think you can name a single file system that restricted UTF-8 to three bytes.

WAFL[1] (unless you "format" it as utf8mb4, which was only implemented a few years ago...)

[1] https://docs.netapp.com/ontap-9/topic/com.netapp.doc.cdot-fa...


Okay, perhaps a more detailed breakdown of the false claims will help, so that nobody is misled by revisionist/apologist history.

> Unicode was assumed to be 64k of codepoints

Not in 2002, when MySQL restricted their utf8 to three bytes [1]. Before 1997, Unicode specified clearly that 21 bits was the limit [2]. By 2002, there were 94,205 characters, including CJK characters beyond the 16 bit range, and clearly more to come. [3]

> so a 3-byte UTF-8 sequence was considered "long enough"

Not by many. The MySQL developers chose very badly, here. I and plenty of other developers managed to implement UTF-8 more correctly around that time. It wasn't hard, as the specs are very straightforward.

> especially since there were surrogate pairs for the rare cases where you have to encode higher code points

Surrogate pairs have never been supported in UTF-8. The RFCs are explicit about that. [4] [5] (search for D800)

Maybe you're thinking of CESU-8, though that's not intended for interchange. [6]

> Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.

Not supporting Unicode properly has always been problematic; it's just that bug reports from affected users rarely reached the right people. Emojis have done the world a favour in making less competent developers actually notice their bugs in basic text handling.

> Yes, it could have been avoided

And was, by most developers.

> they probably just wanted to optimize a bit.

They apparently altered a config number [1], so it wasn't an optimization decision; the code at the time still had support for 6-byte utf8 [7]. I would guess that they found a bug in their support for longer utf-8 sequences/conversion and took the hacky way out.

[1] https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...

[2] https://unicode.org/faq/utf_bom.html

[3] https://en.wikibooks.org/wiki/Unicode/Versions

[4] https://datatracker.ietf.org/doc/html/rfc2044

[5] https://datatracker.ietf.org/doc/html/rfc3629

[6] https://www.unicode.org/reports/tr26/tr26-4.html

[7] https://github.com/mysql/mysql-server/blob/43a506c0ced0e6ea1...


> Many filesystems also do not allow 4+ byte UTF 8 code points in filenames.

I can't even think of a single example. Most filesystems just offer 255 8-bit units to filenames, where 4-byte UTF-8 sequences are totally a non-issue. ZFS supports utf8only=on, which enforces that filenames do conform to UTF-8, in which case... 4-byte sequences are still not a problem.

What's the filesystem that doesn't allow it?


Apple filesystems are infamous for breaking this convention. IIUC, they will support codepoints which expand to 4 bytes in UTF-8, but their implicit normalization rules can trip up programs which expect behavior closer to POSIX norms.


They do decomposed normal form, which is a little funky, but that does not mean they don't allow codepoints outside the BMP.


  > their implicit normalization rules can trip up programs which expect behavior closer to POSIX norms
does POSIX define any uniform normalization rules?


A POSIX filename is a sequence of up to 255 bytes excluding NULL and /

A reasonable person[1] might argue that this is too liberal. IMO the exclusion list should be larger, but silently and automatically rewriting an application's filename into something else at creation time is too far.

[1]: https://dwheeler.com/essays/fixing-unix-linux-filenames.html


No, but I read "POSIX norms" as what commonly happens on Unix systems: the filenames you get back are what you put into it.

Most of them don't concern over character encoding and UTF-8 sequences are just as-is. ZFS is again the special case, the normalization property will normalize names on lookup (they are stored as-is).


Its way more than Emojis, people copy pasting from Excel will inadvertently add UTF8mb4 characters to your database. This is handled extremely poorly by MySQL and results in ugly characters being displayed.


It's not backward-compatible if someone relies on the errors. I'm guessing it's a case of spacebar heating.


For those unfamiliar with the spacebar heating reference: https://xkcd.com/1172/


> why they had to create a different encoding (the utf8mb4) instead of just extending the existing utf8 encoding, since 4-byte UTF-8 is obviously backward-compatible with 3-byte UTF-8

Because CHAR and VARCHAR columns have max length specified, and this translates to the corresponding amount being reserved in storage. For a variable-length encoding, this is normally computed assuming the largest possible value for every codepoint.


And here's the commit that changed it: https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e.... It was originally set to max 6 bytes but for some reason someone dropped it to 3.

They also seem to be slowly phasing it out. Internally it's utf8mb3 and utf8 is just an alias. The idea is to eventually make utf8 an alias to utf8mb4.


Back then MySQL used the memory engine for implicit temporary tables, and memory engine has no variable width data types. So a "varchar" becomes a "char", and a "varchar(255) charset utf8" becomes an allocation of six times 255 = 1530 bytes (filled with a lot of padding).

Reducing this to 3 bytes is still bad, but only half as bad as before.

Only MySQL 8 introduced innodb typed temporary tables (unlogged, even), so with MySQL 8 this is no longer a problem.


That was set up when Microsoft and Java had standardized on UTF-16. So this can represent the UTF-16 subset of Unicode, which is Unicode Plane 0, the Basic Multilingual Plane (BMP). The higher-numbered "astral planes" of UTF-8 were rarely used. All modern languages with a significant user base are covered in Plane 0. Plane 1, the Supplementary Multilingual Plane, with Cretan Linear B, Egyptian hieroglyphics, and such, was seldom needed. Few people had fonts for those, anyway.

Because of the way UTF-8 is encoded, it takes 3 bytes to represent the UTF-16 set. That's because it only takes one byte for ASCII characters. Hence, 3-byte MySQL data.

Emoji, though, were put in Unicode Plane 1. That's where mass demand for the astral planes came from. More four byte UTF-8 characters started showing up in data.


A UTF-16 "subset" of Unicode doesn't exist. all UTF-* encodings are just that: encodings. They can all represent the entire Unicode character set. (Except for UTF-7, which never was an official standard).

What you're referring to is UCS-2 [0], which is UTF-16 without support for the high surrogates that make up the rest of the Unicode character set. But to imply that UTF-16 is a subset of Unicode is just not true.

[0] https://en.wikipedia.org/wiki/UCS-2


That they can all represent all of unicode is true, but they're not "just" encodings either. The current code space of unicode is built around UTF-16. That's why there are 17 planes.


And why surrogates are a thing, and why code points and scalar values are different concepts.

I hate UTF-16. It’s the worst thing that ever happened to Unicode. It ruined Unicode.


Code points, scalar values, glype, surrogates, encodings

Unicode should be called multicode really. If we want something "uni-" then all of them should better be unified to one encoding. Hopefully UTF-8 do the job.


There's so much wrong with this answer. It completely confuses Unicode, with Unicode encodings, and with non-Unicode encodings.

> So this can represent the UTF-16 subset of Unicode

There is no UTF-16 subset of Unicode. UTF-16 has surrogate pairs, it can represent all of Unicode. You are talking about UCS-2, a format which has no surrogate pairs.

> That was set up when Microsoft and Java had standardized on UTF-16

Java standardized on UCS-2 initially (which is what you're talking about). In the mid 2000s they moved to actual UTF-16.

> The higher-numbered "astral planes" of UTF-8

"astral planes" are not a UTF-8 concept. They're a Unicode concept. A Unicode code point has 6 hex digits. The plane is just the first 2 of those 6 digits. So code points 0000– FFFF are in the BMP (Basic Multilingual Plane), etc. People nickname planes above the 00 plane, BMP, astral planes.

> Plane 1 ... was seldom needed. Few people had fonts for those, anyway.

So "rarely", that the majority of humanity writes in languages that aren't in BMP. CJK (Chinese, Japanese, Korean and sometimes Vietnamese) languages need characters outside of BMP.

> Emoji, though, were put in Unicode Plane 1. That's where mass demand for the astral planes came from. More four byte UTF-8 characters started showing up in data.

That's totally untrue. The demand always existed by the majority of humanity, you just never cared about this massive problem until it impacted you.


> That's totally untrue. The demand always existed by the majority of humanity, you just never cared about this massive problem until it impacted you.

I think you swung so far in the other direction you also landed in untrue territory. Originally the intent was to encode all characters in "modern use" and 16 bits was probably enough to do that for the entire world.

Also CJK is about one quarter of the world population.


"originally" is 20 years ago. People realized this mistake almost immediately before fixing it in Unicode 2.0. The mistake existed for something like 5 years last millennium.

There's no excuse for not taking seriously a large part of the world's communication needs in 2022 as something esoteric that no one but people in 5000BC needed.


UTF-16 can represent the full range of Unicode codepoints by using couples of surrogates


Technical amendment: UTF-16 can represent the full range of Unicode scalar values with surrogate pairs. Code points includes the surrogates U+D800–U+DFFF, scalar values don’t. Like all other Unicode encodings, UTF-16 cannot represent surrogates.

That’s where the real problem lies: almost nothing that uses UTF-16 actually uses UTF-16, but rather potentially ill-formed UTF-16.


You're right. Replace UTF-16 with UCS-2 and the comment sounds at least slightly more correct.


Sort of. Applications using UTF-16 have to be aware of pairs at the application level. Many are not.


This isn't a consequence of using UTF-16 as such - Java, .NET etc could totally have an API around UTF-16 strings that handles surrogate pairs. The problem, rather, is that those languages introduced a 16-bit type that they called "character", even though it wasn't even a Unicode codepoint. And then used that type throughout all string APIs, including strings themselves (indexing etc).

In .NET land you're now supposed to use https://docs.microsoft.com/en-us/dotnet/api/system.text.rune instead. It transparently handles surrogate pairs, so the app needn't be aware of anything - and yet the internal encoding is still UTF-16.


This might be a tad unrelated to the original post, but the following article is one of my favorite primers on Character sets/Unicode :

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...


Excellent article, even though I already knew most of it, given it was written in 2003 must have been one of the most insightful and enlightening article written.

Wonder if you have any more links like this? Please do share!


A "hello devops" article from Jan 3, 2022 discussing this topic was posted by /u/flokoe a week ago:

https://www.hellodevops.blog/posts/database-character-sets-a...

And the ensuing discussion:

https://news.ycombinator.com/item?id=29793916


The author mentioned MariaDB in the intro and the went on about MySQL, not sure which he actually used.

I think the latest recommendation for MySQL was utf8mb4_0900_ai_ci and utf8mb4_general_ci for MariaDB.

While I agree about the Postgres recommendation, it's a non-starter in the PHP world, unless doing something from scratch. Even then, the support for MySQL/MariaDB is much better.


What's that mean, non-starter in the PHP world? Postgres is well supported by frameworks and libraries in PHP, not to mention PHP itself.


For one, it's out of reach for lots of folks who use shared hosting services, where there's a managed LAMP stack. It's rare if they have Postgres. If you use AWS, of course, everything is up to you. There's also a connection pooling and different startup for MySQL/Mariadb vs Postgres. PHP engine generates the page and closes itself and connections. Postgres assumes you are more persistent. Also, there's a difference in how GROUP BY works in MySQL/Mariadb vs Postgres, and wrapping your head around it, after using it the MySQL way for years, can take time. So not a drop-in replacement.


ONLY_FULL_GROUP_BY is the default in MySQL since 5.7.5.


Has anyone ever successfully converted a large legacy Apache/PHP/MySQL site to UTF-8? If so you deserve an award. utf8mb4 is just one issue and not the worst.


https://github.com/s9y/Serendipity/blob/05f58f90d743fe9ade24... is just the detection function I wrote for a PHP blog engine to detect whether it would even be possible to use utf8mb4 on the system. We completely ran into this issue and didn't know how to handle it for literally years. Lots of blogs to migrate, and who knows whether it worked for all of them...


The craziest issue I had was I couldn't predict what char encoding the text in my database was in. Most users entered Windows-1252, some text blobs were UTF-16, others were European character sets, and some were UTF-8. Some were Japanese SHIFT_JIS. Don't ask me how any of this happened. I retrospect, I should have dumped all the tables from MySQL and used the excellent PyPy Chardet [1] library to see what I was dealing with, do the conversions and then re-import the data. But then someone could copy UTF-16 from a Windows document and paste it in, so you have to convert going in to the database.

You have set Apache to UTF-8, PHP to UTF-8, MySQL to UTF-8, and the MySQL driver you are using to UTF-8. It's not clear how these setting interact. Are there silent conversions happening or do you always have to detect the encoding on data coming from the server? HTML pages have a character encoding specifier, but the BOM at the start of the file takes precedence (I think.) I got it to work by always detecting encoding for any text coming from the database and using iconv, but this turned out to be really slow and unreliable. It was truly the biggest mess by an order of magnitude than any other programming problem I faced in my career.

Would not attempt again.

[1] https://github.com/chardet/chardet


Technical lead for vanillaforums. It’s a huge PHP MySQL app. Probably 10b+ rows in production DBs. We did the the migration in 2017 if I remember correctly.


Not to take anything away but by legacy I meant late last century sites. VanillaForums was released in July 2006 according to Wikipedia, which makes it Web 2.0. Wikipedia also says "UTF-8 has been the most common encoding for the World Wide Web since 2008" I'm glad you posted though because now I know that sites are ok starting in that time frame, not before and then never touch the old ones again.


Yes. I had the exact same issue with '<' in HTML content stored in the database -- cms pages and also some JS that was stored in the DB and injected into the HEAD of all pages.


At least once, but it's been a while. From the top of my head:

- set encoding at HTML head - also at any HTML form element - Set database collation correctly - I believe it also needs to be done per table - The PHP -> MySQL connection itself may need to have the encoding set - All string manipulation to be done with mb_ - Even ensure your code files themselves are in UTF8


Been there, done that, with a phpbb installation. I manually converted all the non-UTF data in the database to UTF8, and updated the PHP code and database settings.

It worked fine, no major issues, just some work.


Sure, you just prepend mb_ to all php string calls >:)


Yes, but I have chosen utf8_bin as encoding and lived happily with this choice for a dozen years.


Wow! I was working on this issue in our DBMS product today!

Fun suggestion, try making a JSON string with a NULL character somewhere in the middle. It will be encoded as \u0000 and is a a valid UTF-8 code, but most C based systems will truncate the string by estimating its length via strlen.

Java community and some other software vendors designed the Modified UTF-8, which replaces the zero with a 2-byte code point. Sleek. Aside from the fact, that you are modifying the data that customer wants to stay consistent.

Postres explicitly bans such cases in the VARCHAR, not sure if it can fit in their JSON columns. Who tried?


Null-bytes are not valid characters in JSON strings. Nor any other control characters for that matter


They are valid if escaped, as explicitly noted in the Section 7 of RFC 7159 [1]. (Annoyingly enough it doesn't explicitly say JSON strings are Unicode strings, it just says that a certain subset of JSON strings is interoperable with Unicode.) GP means that the escaped null byte can still cause issues for C interoperability.

[1] https://datatracker.ietf.org/doc/html/rfc7159#section-7


Exactly, null bytes aren't allowed.

> All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any character may be escaped in this way

> Any character may be escaped.

I personally find the JSON spec very explicit while succinct


Oh, actually I made a mistake in the GP. The following sentence:

> Annoyingly enough it doesn't explicitly say JSON strings are Unicode strings, [...]

...is false, I completely missed the very first section (I obviously searched for "Unicode", but failed to thoroughly check results). I have other valid criticisms of the JSON specification but that is not, so please ignore that part of rants since it was based on a wrong assumption. Thank you for (implicitly) pointing it out.


no worries, I recently spent a fair bit of time parsing JSON with sed so the spec is fresh in mind


The plain utf8 covers the "basic multilingual plane" (x0000-xFFFF), so it will get you very far, actually.

In our app, we finally went for utf8mb4 to allow people to enter emoji. As a side "bonus" you will also soon see some clever people entering their names to stand out, such as "𝙹𝚘𝚑𝚗". Note that this is not "John". It is actually a series of mathematical symbols:

  U+1D679   MATHEMATICAL MONOSPACE CAPITAL J
  U+1D698   MATHEMATICAL MONOSPACE SMALL O
  U+1D691   MATHEMATICAL MONOSPACE SMALL H
  U+1D697   MATHEMATICAL MONOSPACE SMALL N


One fun thing you can do in MariaDB is define a collation based off of the Unicode confusables list. The result of this is that select * from usernames where user = 'John' would still work :)

It also avoids registering both 'John' and 'John'. (changed the o)

Hedgewars user registration does this.


That's a great feature. I was running a community site once where impersonation became a popular game. So we had dozens of users registering accounts with Cyrillic i,o,a,e etc.


For looking up relevant further material, this is called a 'homograph' or 'homoglyph attack'.

It was also popular for site-spoofing by sending links with these cyrillic character to lead people to a fake bank/search_engine page, until that became widely used, and then owners of the tech-stack (browsers, registrars, dns-operators, etc) shut down this attack vector mostly.


This is unfortunate as it makes things unreadable for screen readers. I have seen many tweets from companies and organisations that use these characters to replace funny "fonts" or create emphasis effects in the text, rendering them inaccessible in the process.

In my own service I plan to offer users the ability to define a "stylised" nickname and an alternative text for screen readers, should they be interested in it (like an alt text for images). If Twitter had such a feature, I'd use it because I really like the effects, even if they're cheesy. :)


Of course, nobody "refused" to fix a "bug". Instead, a non-conformant behavior was already relied upon by legacy systems out in the wild and the "fix" was added in a backwards-compatible way.

Edit: Three bytes are enough to fit nearly any of the chars in use in any language, including Chinese and Japanese, so I can only assume someone "smart" in the MySQL dev team decided to "save space" (before emoji were a thing).


That's a terrible excuse. MySQL should have fixed this in a major release. I had to work with a production system that had all kinds of issues because of this bug (engineers assumed, with good reason, that UTF-8 meant UTF-8, when it did not).

This kind of reasoning is how we end up with vulnerabilities like the recent one in Log4j. Just because a behavior made sense in the past, or an unfortunate bug made it into production, is no excuse to let it inflict damage in perpetuity.


I don't see how introducing a new major release would fix this? people would use the old version (because of the breaking change) for a while still, you might even end up in a python2/3 situation.


To the extent that it's "a python2/3 situation"... Isn't that exactly what it is now, too, then? If you have a big change to do and postpone it "because of the installed base", then that just makes it worse the longer you wait. (Unless of course you're counting on your user base to shrink over time.)


Three bytes are enough to fit nearly any of the chars in use in any language, including Chinese and Japanese

With only 3 bytes you'll completely miss plane 2, the "Supplementary Ideographic Plane" which includes tons of Chinese-Japanese-Korean Han characters.

I wish people would stop saying the supplementary characters are just for "emoji". Asian unification was very controversial and ultimately unsuccessful. Plane 1 and Plane 2 are important, especially if you're going to sell software or products in China or Japan where they are mandated.


Even that doesn't make any sense, because refusing to encode characters that require four bytes doesn't save any space; it just makes it impossible to encode those characters. Nothing about the other encoding lengths changes.

The only thing I can figure is that something somewhere is using a 16-bit quantity for decoded codepoints. Four-byte encodings are for codepoints above FFFF. (Which I guess is still someone's idea of "saving space.")

Edit: Apparently the max encoding length used to be six bytes, so there's literally no plausible explanation for this that doesn't end with "thank god I stopped having to deal with MySQL over a decade ago."


Then they should have considered implementing utf-8mb3le surrogate pairs. What a missed opportunity!


> Three bytes are enough to fit nearly any of the chars in use in any language

This is not a fixed length encoding scheme, it is a subset of UTF-8, which is a variable length encoding designed so that the most common characters only take one byte of storage. As a consequence you do not get 2^24 possible characters in three bytes, you get much less than that (less than 2^16 actually), and the benefit is compatibility and compression.

To represent the full range, UTF-8 requires up to four bytes, even though 21 bits or three bytes would do for a fixed length encoding. UTF-8 is far more efficient in storage and transfer bandwidth than a fixed length encoding would be without further compression, to the point where using a fixed UCS-4/UTF-32 style encoding is an effective way to nearly quadruple the memory requirements of a large class of programs.


couldnt emoji fit in 3 bytes as well? I don't think there are that many ...


utf8 is an encoding for unicode codepoints. Those codepoints are spread on a space that is extremely vast (up to ~4 billion) for which that can be represented with up to 4 bytes. It turns out emojis are positioned in a place where the first byte will never be 0, so even if there were only one it would require the full 4 bytes to encode them.


> Those codepoints are spread on a space that is extremely vast (up to ~4 billion)

You are off by a lot. The maximum code point is about 21bit high (0x10FFFF). The space is only 1.1 million large.


Related: time to explain UTF-8! (In case anybody is curious. I personally think it's extremely clever and worth understanding:)

Characters <128 are encoded with a single byte: 0xxxxxxx

Characters >128 are encoded with multiple bytes.

A two-byte character looks like:

110xxxxx 10xxxxxx (11 useful bits, representing code points 128-2047)

A three-byte character looks like:

1110xxxx 10xxxxxx 10xxxxxx (16 useful bits, representing code points 2048-65535)

A four-byte character looks like:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (21 useful bits, representing code points 65536-2097151)

Now, technically this scheme could expand to 6-byte characters without getting confused with things like BOM/etc, however any code points larger than 2^21 wouldn't be representable in UTF-16, which has its own set of constraints. This means the unicode consortium has basically limited themselves to two million or so possible code points, which is why UTF-8 doesn't need to go more than 4 bytes. (I wonder if a future unicode version will require a larger limit and would thus create a new "utf8mb6" scheme, and drop UTF-16 altogether?)


Unicode specifically limited itself to the range zero to U+10FFFF

Obviously nothing in the laws of nature forbids "a future Unicode version" from disavowing this limit, but we could say the same for whether "a future United States of America" could disavow the status of independent Indian Tribes it has previously recognised.


> (I wonder if a future unicode version will require a larger limit and would thus create a new "utf8mb6" scheme, and drop UTF-16 altogether?)

On a thread a couple of years ago (https://news.ycombinator.com/item?id=20600873) it was mentioned that the UTF-8 encoding scheme can be cleanly extended to 36 bits, so even "utf8mb7" would be a possibility.


It could be he's thinking of the historical definition which included support for 5 character encoding with a maximum codepoint of U+7FFFFFFF or ~2 billion.

https://en.wikipedia.org/wiki/UTF-8#History

That was restricted I believe primarily for compatibility with more limited encodings like UTF-16.

I guess it's possible that at some future point in human history when UTF-16 has been purged from memory, the 5 character encoding might be allowed again. :)


Almost all emoji are four bytes in UTF-8.


were all the other positions taken?


I enjoyed the article right up to the point where it concluded at the very end that the solution is to “switch to postgresql” without actually providing any rationale for that sudden and subjective ending.


MySQL is very similar to PHP in this regard. The good function name is actually broken so a new function with similar name is needed. This creates confusion for new people when adopting the technology.


Oh the good old "mysql_real_escape_string"[0] days...

I forgot most of my PHP knowledge but that somehow got stuck.

[0]: https://www.php.net/manual/en/function.mysql-real-escape-str...


You can blame MySQL for that one. The function name is straight out of the MySQL C API:

https://dev.mysql.com/doc/c-api/8.0/en/mysql-real-escape-str...


So in mysql utf8 does not mean utf8. Reminds me of iso8601 in php, which does not mean iso8601.

https://www.php.net/manual/en/class.datetimeinterface.php


MySQL and PHP go hand in hand when it comes to these poor decisions.


Let's say for the sake of discussion that your MySQL db has a lot of tables encoded as "utf8." Are there any known drawbacks or gotchas to converting them en masse to "utf8mb4"? Is this a lengthy operation?


Schema changes to large tables are always huge pain in the ass in MySQL. Every team I’ve worked on goes to great lengths to avoid changing the schema, which leads to all kinds of exciting anti-patterns like recycling old fields even if their name no longer matches what is stored in them.


I suspect this is because for large databases, schema changes lock the tables for extended periods of time, and if this time exceeds a reasonable time for a service window, you need good database replication. And you probably don’t have good MySQL database replication unless you’re on AWS or similar.


That’s exactly what it is. If you have a big table you are gonna have a pretty long outage while you update that schema. Like in the last place I worked our largest table would take more than an hour for a schema update. So we never did schema updates.


At my last workplace, we had a service window of one weekend. Prior to this, an application had to be restructured to even make the schema upgrade able to finish in one weekend.


sounds like how I approach laundry


So you're right but also wrong thanks to good tooling. If you have enough discipline to not put triggers on your tables then pt-online-schema-change has got you sorted. Does it take forever? Yep, but does it matter that it takes forver? Nope because it doesn't lock tables.

And even if you do have triggers https://github.com/github/gh-ost you can still do it.


There are collation changes between utf8 and utf8mb4. Specifically, utf8mb4 uses the (much better) UCA 9.0 algorithm.

But this implies that all indexes will have to be regenerated, if they contain a column that is utf8 and will be converted. As they sort differently, this requires rebuilding the index from scratch. Depending on data size this can take measurable time and effort.


The author of original article links to this one in their post:

https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4

And yeah, there's gotchas in there, the main one being that strings that barely fit now won't fit, and also maximum lengths on indexable strings.


I can’t imagine there would be too many, other than it is a tedious operation not easily automated. Depending on your data, might take a while.

Can’t believe that neither MySQL or MariaDB has created a utility to do the conversions for you automatically.


That's kind of funny - Oracle has a similar issue. If you want industry-standard UTF-8, you have to specify "AL32UTF8" as your encoding. "UTF8" is kind of crazy - it's this monstrous abomination called CESU-8 (https://en.wikipedia.org/wiki/CESU-8), which isnt' UTF-8 at all - it's actually this weird "UTF-16 complete with surrogate pairs wrapped in a UTF-8 shell" thing.


That would be a serious problem if Oracle's idea of "UTF8" wasn't roundtrip compatible with UTF-8, but it is fortunately, so approximately no one should notice the difference.


> If you need a database, don’t use MySQL or MariaDB. Use PostgreSQL.

Sure. But for many use cases, SQLite is enough.


This is a forum of professionals though, how many production apps can actually run a database with no HA, no replication, single process access, no remote access, and is weakly typed?

And if you suggest "$project that's built on SQLite" to do all that then congrats you just reinvented MySQL or Postgres but with SQLite as the DB engine.


Anything that is shipped on Android, on iOS, or as a macOS app, apparently. Sqlite is incredibly useful and much more prevalent that you imagine.


Would the DBA's on this thread consider committing code to the MySQL [1] and Postgres [2] tuner scripts that give new DBA's all your learned advise and battle hardening experience? This thread appears to be such an example.

[1] - https://github.com/major/MySQLTuner-perl

[2] - https://github.com/jfcoz/postgresqltuner


Another interesting “issue” is that Case sensitive character sets may lead to unexpected results with aggregations (eg https://blog.mallya.dev/2021/07/25/mysql-cases-sensitivity/)

MySQL has many such issues which have turned me off from recommending it. It’s a shame because it’s a solid technology that mostly works and has a really long operational history.


Genuine question: why would anyone prefer MySQL over Postgres?


In the WordPress world, it is the only database.[0] If you do WP, you do MySQL. You may sniff at WP (I certainly do), but it is used for 43% of all websites, which is not a trivial number of deployments, and not a small market.

I suspect it has a large market share in the "started as a trivial PHP program, grew to large application, too committed to MySQL to migrate to something else" market, too. There's that one website with the blue F logo that went this path, but there's probably thousands more.

That's not the same thing as anyone actually preferring it, but that's two possible reasons why there's an awful lot of MySQL deployments out there.

[0] https://codex.wordpress.org/Using_Alternative_Databases


MySQL is much easier to administer and upgrade.

For example I have two versions of Postges running on my machine right now because Debian does not automatically upgrade between them - you have to export your data, and reimport it into the new version.

https://www.pontikis.net/blog/update-postgres-major-version-...

MySQL just works. There are issues with it, but they are not as major as people make them out to be. Sometimes people like "technical purity" over just get the job done.


> MySQL just works

Right up until one of your users tries to use an emoji, apparently.


MySQL doesn't have to know that a db is UTF8 for you to store UTF8 in it. You're not necessarily using its collation functions or anything.

My understanding is replication and setting up MySQL is still easier than Postgres; it definitely was 10 years ago, that's why it's part of LAMP. The data loss issues were fixed forever ago when it switched to InnoDB.

Google AdWords and Facebook both use(d) MySQL a lot and I assume they knew about Oracle and Postgres at the time they were built…


> Google AdWords and Facebook both use(d) MySQL a lot and I assume they knew about Oracle and Postgres at the time they were built…

Why would you assume that? Both of the companies were tiny and full of children when these decisions were made. And Oracle was expensive and a non-starter for explosive horizontal scale.


Well, because I remember 2005 and the Postgres people were just as into posting about how superior it was as they are now.

(Where they had a point because MySQL hadn't switched to InnoDB yet.)


Keep in mind that that's a post from 2016. PG has made many improvements since then. Upgrading is a much easier job right now.

Also, it's strange to conflate PG with Debian issues.


I mean it's not really a Debian issue when Debian's design decisions were informed by the lack of automatic online upgrades at the time.


User mgmt on PG is terrible, same for replication, connection pooling also sucks, InnoDB is faster overall than PG. Tooling is inferior to MySQl as well as clustering and upgrade process.

PG is great but its still has a lot of issues.


Job security. I can’t think of any technical reasons.


Other way around - Postges will give you more job security, since it's rarer and takes more time and work to administer.

People use MySQL because it's easy, and they want to spend their effort in other places.


That's like saying Python is rarer than JavaScript. Even if it's true, it's not actually rare at all.

In the last 6 years, I've spent approximately 2 hours administering PostgreSQL, and I'd be the one who had to do it.


It’s “easy” until it silently shreds data and gets your org hacked through truncation attacks.

Don’t confuse lazy with easy.


I have been told that MySQL / MariaDB is significantly more scalable than Postgres, although I expect that opinion is based on old data. I don't have firsthand knowledge of this, so I can't comment further. Came to the comments to see if anyone knowledgeable would react to that point. (maybe they can reply here :) )


Last I heard, it's the opposite: MySQL/MariaDB might be faster for single connections, but PostgreSQL can be much faster for heavy workloads. For instance, PostgreSQL added a feature to support sharing table scans between multiple connections. Suppose you have a million rows in a table "foo".

- Connection 1 executes "select * from foo". It begins at row 1 and starts a table scan.

- When the reader is at row 500,000, connection 2 executes "select * from foo". It starts returning results beginning at row 500,000, so both connections are sharing the output of the reader.

- When the reader gets to row 1,000,000, the first query is finished. It's seen all the rows. The reader starts back at row 1, and continues feeding results to the second query until it gets back to run 499,999. Then the second query is also finished.

Even though 2 queries were running, the DB backend was only doing the work of 1. With that setup, you could have 100 "select * from foo" queries running simultaneously without increasing the database load at all.

I don't know if MySQL supports that kind of thing now. In any case, that's an example of some of the incredible engineering PostgreSQL uses to be correct and fast.


Even though 2 queries were running, the DB backend was only doing the work of 1.

Well, one and a half. Or the work of one query, but one and a half times.


If, in a customer facing database, you have to table scan, you have already lost.

For data warehouses this is a different story, but god help you if you are table scanning in OLTP context.


We learned this a few years ago when clients starting using emojis when saving their content and it got truncated. Very quickly converted everything to utf8mb4!


emojis like this?🆒


What is this voodoo? I thought HN didn't support emoji? It's sure 🆒 though.


WordPress has, of course, the lion's share of MySQL deployments by server count if not by row count. Yes, WordPress isn't the latest and greatest stuff. Yes, it's <bad thing> and <another bad thing>. But such is the curse of the customer base.

They upgraded everybody from utf8 to utf8mb4 (excluding some users of ancient versions of MySQL) with their version 4.2. It went live on April 23, 2015.

They announced the upgrade at the beginning of April 2015 here. https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgra... There's some interesting stuff in there about the practical difficulties of the upgrade. The biggest issue was, and still is, the need for prefix indexes.

With respect to Adam Hooper, his article was stale when it appeared just over a year after the WordPress schema upgrade. Here's right of course. But the biggest user base was well on their way to abandoning utfmb3 by that time.


Memories... When I was at Amazon, my starter project was modifying Redshift to support four-byte UTF-8 characters, not just three-byte.


Another fun fact is that string comparisons are case insensitive by default:

https://dev.mysql.com/doc/refman/8.0/en/case-sensitivity.htm...

This can definitely catch you by surprise and cause bugs that are only detected late.


I decided to build all my new projects on postgres a few years ago. I am often reminded of why that was a good decision.


Maybe this is a cynical take - but we used https://pgloader.io/ a few years ago to migrate to Postgres, and have never been happier. MySQL has a lot of stupid decisions like this.


Specifically about UTF-8 in Postgres: "PostgreSQL collation is a massive footgun" https://gist.github.com/rraval/ef4e4bdc63e68fe3e83c9f98f56af...

Postgres defaults to the system's collation for UTF-8, which caught me by surprise. Now I'm stuck with a production DB where text prefix queries, which should be a perfect use of a btree, are a full table scan. Probably gonna have to take some maintenance downtime to fix the situation :-\

And more generally about operating Postgres in production: https://blog.nelhage.com/post/some-opinionated-sql-takes/

As for Postgres, I have enormous respect for it and its engineering and capabilities, but, for me, it’s just too damn operationally scary. In my experience it’s much worse than MySQL for operational footguns and performance cliffs, where using it slightly wrong can utterly tank your performance or availability. In addition, because MySQL is, in my experience, more widely deployed, it’s easier to find and hire engineers with experience deploying and operating it. Postgres is a fine choice, especially if you already have expertise using it on your team, but I’ve personally been burned too many times.


What else would they reasonably default to? Are you proposing requiring collations to always be specified (a breaking change from the SQL spec)?


I think the behavior of the "C" collation is what most software developers expect. Plus, there is an advantage to the behavior being the same everywhere, rather than dependent on the system/environment settings.

But those preferences may be specific to my situation -- a software developer running an HA web application.

* If you're not a software developer, you might be surprised by the "C" collation, which puts "Z" before "a".

* If you're not running an HA web application, then it's easy to change the collation later -- just dump/restore the DB.


If you have any non-ASCII characters wouldn’t the C collation be very surprising? I’m having a hard time seeing how the C collation is a sensible default for UTF encoded text. If you want to limit to ASCII the why are you using UTF?


The C collation applied to UTF-8 just orders everything by Unicode codepoint.


I think the open-source nature is a major factor in the reason it's chosen so much. That's why I roll it.


What do you use instead?


Since I didn't specify the default collation when creating my database, Postgres inferred it from my system, so I ended up with "en_US.UTF-8". The "C" collation is probably what most developers expect.

The only supported way to set the database's default collation is at creation time, so I'd need to create a new database, copy the data over, and switch my application over. (There appears to be some Postgres voodoo to change the default collation for an existing database, but that's too risky for me.)

Until I get that done, I need to explicitly specify the collation for all new text fields, e.g.

    CREATE TABLE t (
        f1 text NOT NULL COLLATE "C"
    )
https://simply.name/pg-lc-collate.html


Oh whoops, my previous response was about which collation I (want to) use. Were you asking which DB I use?

The answer is Postgres. I have some previous experience with MySQL and there are advantages and disadvantages. The biggest advantage so far is partial indexes.

But since I don't have as much experience with Postgres, there's still the occasional surprise, e.g. index names not being scoped to the table, this collation thing. And from what I've read online, I should probably set up some monitoring for vacuuming issues.

The article about Postgres operational footguns recommends outsourcing those issues by using a managed Postgres service (e.g. AWS RDS or Google Cloud SQL), and that's what I do.


MySQL or MariaDB/PerconaDB most likely. For all the footguns people are bringing up in this thread it's an extremely well trodden path where anything you could ever imagine doing with a DB someone has already done a hundred times with MySQL and documented it.


I love, love, love Postgres! The one feature MySQL has that Postgres doesn't is the ability to add a new column in an arbitrary position in the table. I like my tables to have a common "layout" and I would love to have this feature in Postgres.


This is annoying but it's not a huge hassle to duplicate/copy/rename, assuming your data is small and not sharded in some unique way


If your data is tiny then within reason nothing matters, any popular database will be just fine. Where you care is precisely when your data gets big enough to be unwieldy.


That's generally what I end up doing. It's not ideal, but it at least allows me to order the columns how I want.


MySQL also lets you control the layout of your table on disk via BTree primary storage and composite primary keys.


This seems like something that could be done client side instead. But maybe not an option for psql CLI


It is, and something an ORM can manage, but I spend a LOT of time in the psql CLI.


I'm a big posgresql fan, but I'd maybe argue that you're making tradeoffs. PG has weird behaviour sometimes too.

I don't think it's always smart to just change out your data layer.

But, if you're starting a new project, I do think PG is one of the better options and tends to follow the principle of least surprise. (hence: big fan)


> PG has weird behaviour sometimes too.

Perhaps, but I'd argue that the "weird" behavior of postgres just tends to be clearly thought out design decisions that they made for a valid reason that may cause you some pain with how you use it (e.g. their process-per-connection model).

MySQL's "weird" behavior, on the other hand, just tends to be completely invalid footguns like this. Despite what some people are arguing in this thread, a 3-byte version of UTF-8 was never in any spec anywhere and was an invalid shortcut from day 1.


Oracle, Postgres, MySql, mSql, sqlite all make weird behaviour in different situation. You pick your poison.

I hated how postgres forced you to create a system user to connect I wonder if it still requires this.


As a huge Postgres fan working at a shop using MySQL I'll point out the first thing I saw MySQL does that Postgres doesn't:

Index hints.

If you aren't hitting an index in Postgres you have to dig in to table stats and figure out what is wrong but MySQL gives you more control.

However, I would still rather work with Postgres AND have to juggle a connection pooler than deal with MySQL. Transactions on DDL are _great_ and the ability to use foreign keys across partitioned tables is how it should be.


But at least Postgres has much better tools for learning about the query plans it might use. The Postgres output of “EXPLAIN” is much much better than MySQL.

It’s a shame that almost every job I worked at uses MySQL and not Postgres. But that could be because those companies all got their start like a decade or more ago when Postgres was not as well known.


> As a huge Postgres fan working at a shop using MySQL I'll point out the first thing I saw MySQL does that Postgres doesn't:

> Index hints.

Eh, it's an extension. One you shouldn't use, but it's there.


I didn't know this!


> I would still rather work with Postgres AND have to juggle a connection pooler

Is this comment relating to the overhead of idle connections, which has historically necessitated the use of a pooler in front of PG? If so, I believe this is resolved in postgres 14

https://pganalyze.com/blog/postgres-14-performance-monitorin...


I also helped with a project once that used pgloader to migration an old MySQL db to Postgres, and I think this article may actually explain one of the issues we found. We had a UTF-8-configured table in MySQL (we also discovered in the project, that MySQL can have encodings set differently per-table, while Postgres sets the encoding for the entire database) with UTF-8 data, and when we migrated it into a UTF-8-configured Postgres database, some of the UTF-8 characters were silently corrupted. They were UTF-8 characters in both databases, so the corruption didn't raise any errors, but they were different characters, so when the data was read back out by the application, the text was different.

We only caught this, because thankfully, we had written a manual checksum script, which looped through every table, read out all values from each row into the application, and hashed the results, then compared between when the app was connected to the source MySQL database vs the destination Postgres database. We ended up having to massage and fix those silently-corrupted characters.


People say this all the time but never elaborate on what they are nor link to any sources that help educate people what they are, so I just assume that when people say this, they don't know what they're talking about and are just parroting.


> Back in 2002, MySQL gave users a speed boost if users could guarantee that every row in a table had the same number of bytes. To do that, users would declare text columns as “CHAR”

This database type has existed in many databases since at least the 1980s


Dear databases, please don't get hung up about string lengths when dealing with UTF8.

If I ask for a UTF8 string with a max-length of 100, please don't apply the worse case scenario and allocate space for 100 emojis. Please give me a box of 100 bytes and allow me to write any UTF-8 string that can fit into 100 bytes in there.

100 ASCII characters. 20 emojis. Any mixture of the two.

If I ask for UTF-8, it'll be because I'd like to make advantage of UTF-8 and I accept the costs. If that means I can't quickly jump to the character at index 84 in a string, no problem, I've accepted the trade-off.


PostgreSQL recommends you use the text type everywhere, and use CHECK constraints if you really really really need to limit a field’s length on the database level. (VAR)CHAR is considered a code smell. So many fields are artificially limited to 255 bytes because that used to make a performance difference on MySQL 15 years ago.


I feel PostgreSQL should push that as a benefit of their system. I read your comment and said "Where have you been all my life?"

"Still using strings with fixed limits like you're a C programmer in the 80s? Use PostgreSQL."


Yes! This and other charset oddities almost make me want to store all strings as varbinary and handle any conversions to/from a string type to utf8 bytes client side. It makes ad-hoc query access from a shell much more annoying though.


I wonder if you can do LIKE with a VARBINARY value.

"SELECT * FROM Users WHERE Name LIKE @namePattern" @namePattern = UTF8.GetBytes("%billpg%")


More like "100 ASCII characters. Unknown number of emojis."

You'll get codepoints truncated if you use bytes.


I would hope a database engine would throw an error instead of truncating bytes.


Why would the database "allocate space" unless it was truly necessary, in any case? While char(n) works that way, most applications don't actually use it, no matter the underlying DB system.


You're literally asking them to break the SQL standard. The meaning of N in CHAR(N) is the number of characters.


No one talks about the downsides of changing.

Moving from 255 max characters to 191 or 192 max means a lot of your data needs to be moved into a text fields which means things like this field can't be a primary key.


I'm used to mssql where nvarchar (the UTF-16 datatype) supports basically unlimited length... you can't make long PKs in mysql?


You are limited by smaller types ints, dates and chars.


friends don't let friends pk on variable length columns


non numeric primary keys are generally a bad idea

can't you specify collation by column?


According to the docs, MySQL is a bit more explicit and calls the old "utf8" "utf8mb3" nowadays, and notes that it may be removed at some point. Although "utf8" is still aliased to the footgun at the moment.

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...


Wow, never knew this, this is really bad, especially in this day and age where emojis are so prevalent.

MySQL should deprecate utf8 and give a warning if you try to use it.


It is deprecated:

> You should also be aware that the utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead.

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...


How the heck did someone downvote you? Basically your link is exactly what I was looking for and I'm glad they're making this deprecation.


Bad idea. If they deprecate "utf8" they will deprecate a standard. It was designed and specified at a time when the Unicode code space had a 21 bit limit. There are some technical (storage space-related) considerations with the suggestion to "just use utf8mb4 everywhere instead" because of how InnoDB's indices work.


MySQL "utf8" is still "utf8mb3", which is not a standard anywhere except MySQL. It cannot store the full range of 21-bit code points; I don't know why you keep repeating this. The maximum codepoint in 3-byte UTF-8 is 0xFFFF, which is 16 bits.


> If they deprecate "utf8" they will deprecate a standard.

No, they won't. They will be deprecating something they created that never conformed to the UTF-8 standard. That fact that unicode didn't have codes beyond 21 bits at that point is pretty irrelevant.

It was an invalid implementation from day 1, at least with the name of "utf8".


>It was designed and specified at a time when the Unicode code space had a 21 bit limit

21 bits is the current Unicode limit. Unfortunately UTF-8 in 3 Bytes only has 16 usable bits - the first byte starts with 1110 and the two continuation bytes start with 10, so it's 4 + 6 + 6 = 16 bits.


Does this even fully fix the problem? It looks like utf8mb4 is limited to 4 byte sequences, but as far as I understand, utf-8 is variable width. Can utf8mb4 even encode the scottish flag https://www.iemoji.com/view/emoji/2476/flags/scotland ?


Flags are not single code points. UTF-8 refers to how code points are stored. If you look in your link at "Hex Code Point(s)", it is that first one that would be a problem with "utf8" in MySQL, because of the 1 in the 0x10000 position. The other six code points required would fit in fine.

Unicode is developing more and more things that require code points. I'm not sure what the longest legal non-redundant series of code points that can validly represent a glyph somewhere is, but it's getting up there with all the emoji skin modifiers and such.


Emoji modifiers, both for gender and skintone, do not produce that many extra code points. Combinations are made in the same way ¨ + e can combine into ë: a code point, followed by a combination code point, followed by a modifier. During text rendering, these code points are converted back into a single glyph.

All UTF-8 codepoints in use today can be encoded with four bytes. Theoretically the Unicode system can be used to create 6 byte code points if that ever becomes necessary, but it won't be for a while. Crossing the 4 byte boundary would also introduce compatibility issues with UTF-16, so I'm sure the Unicode Consortium will do their best to prevent this from happening as long as they can.


I missed a word, sorry. I was idly musing about the longest legal code point sequences.


Aaaah, that makes sense. I think the flag of Scotland (󠁧󠁢󠁳󠁣󠁴󠁿) is the longest usable one I've seen, but you could stack near infinite items in Zalgo form on top of normal letters if you count those. I don't think Unicode has any restrictions on the amount of co combinatory characters, though most text parsers will probably enforce some kind of limit.


Is this a problem? Flags in unicode are defined by several special characters. The flag of Scotland isn't really a single character, it's "<waving flag><tag g><tag b><tag s><tag c><tag t><cancel>".

All of these characters are multi byte combinations. The hex for the flag is not a single, super wide character, it's 0xF09F8FB4 0xF3A081A7 0xF3A081A2 0xF3A081B3 0xF3A081A3 0xF3A081B4 0xF3A081BF. You might get some weird results if you take substrings from that, but it won't be a problem for the backing database store; each separate "binary character" is a four byte sequence (as denoted by the 0xF at the front of the number).


The biggest codepoint in Unicode fits into 4 bytes of UTF-8. UTF-8 would allow up to 6 bytes, but those codepoints are not in use currently. If they ever become in use, yes, you'd probably need a new character set again. But then a lot more things will break, as higher codepoints would be incompatible with UTF-16 also.


UTF-8 only allows 4 bytes, since 2003: https://datatracker.ietf.org/doc/html/rfc3629


Legal UTF8 is limited to 4 bytes, as Unicode only uses ranges that fit the limits of UTF16.

AFAIK The flags are a weird multi-code point encoding of the ISO country codes, and each individual code point is less than 4 bytes.


UTF-8 is variable width. The biggest valid codepoint is U+10FFFF, which has a 4-byte encoding in UTF-8. Other codepoints have 1-, 2-, or 3-byte encodings.


Variable width is unlikely to be a problem. At 3:5 the Scottish flag does not have an unusual aspect ratio. This is unlike the flag of Qatar, with a ratio of 10:28, or Nepal with both a 3:4 (approximate, not exact) aspect ratio plus an irregular shape.


> Choose CHAR columns. (The CHAR format is a relic nowadays. Back then, MySQL was faster with CHAR columns. Ever since 2005, it’s not.)

Is this vs VARCHAR? To me this is fascinating if so, I'd love to see a link or benchmark. My understanding that reading a stripe off a drive, then chopping it up indiscriminately at known intervals is faster than conditionally branching for strings of other lengths


MySQL is an awful piece of software.

Try diving into the Time zone insanity, collations, and of course this UTF-8 gem that has lasted since at least 2011.


In addition to emojis, I've also had issues with certain characters copied from Microsoft Word being 4 bytes.


I've had to "upgrade" a couple to utf8, and varchar(255) bites me all the time.


Best to be used with mysqli_real_escape_string ;-) https://www.php.net/manual/en/mysqli.real-escape-string.php


> Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems.

This sounds like an offended developer. Avoiding database systems is not possible if you want to do a lot of programming tasks.



a while ago, in my first engineering job, I was tracking down a strange bug related to chinese character sets.

Turns out some of the db tables were using utf8 and latin1; changing them to `utf8mb4` fixed it right up!

https://josh.works/troubleshooting-chinese-character-sets-in...

It was a fun bug to work on. I learned a lot about character encoding, and enjoyed bringing a refined "process" to the table.


> And developers who wanted correctness were wrong to use “utf8”, because it can’t store “poo emoji”.

This sentence deserves an award :)

Edit: TIL: HN doesn't support emojis.


I just went through the process of converting a 2TB MySQL database over the summer. Quite a few gotchas along the way that make the process a huge pain.


Thanks to emoji (almost)everything now supports unicode


This HN posting should be marked with (2016)

Is the issue still an issue?


Yes and yes.

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...

> The utf8 Character Set (Alias for utf8mb3)



Year added above. Thanks!


Also make sure mysqldump is using utf8mb4 or you'll get character replacement in your backups. This often needs to be configured separately.


In 2022 with a new release of MySQL or MariaDB, is it still possible to create a database with "utf8" encoding? Does it print a warning?


I spun up a MariaDB cluster yesterday and each node came configured with utf8mb4 by default.

Testing it now, manually chosing utf8 for new tables shows no warning.


Thanks for the report! Glad real UTF-8 is the default at least.


i'll just stick latin1 charset for now... its slower and increases the size of ur tables to use utf8, etc [0]

[0] https://stackoverflow.com/questions/12449336/utf-8-vs-latin1...


>Of course, they never advertised this (probably because the bug is so embarrassing).

Not as embarrassing as being owned by Oracle.


will something terrible happen if i just convert my tables to utf8mb4? Will i have to upgrade each column?


Does anyone know if utf8mb4 is variable length or does each character take full 4 bytes?


It is variable length, by definition. If it were four bytes per code point it would be named something more like UCS-4 or UTF-32.


Thanks!


In MySQL, use postgres instead.


I wrote about supporting emojis in Ruby on Rails / MySQL here:

https://josephecombs.com/2018/05/06/how-to-support-emojis-wi...


This one is very fun to debug in production when you're on MySQL 5.8


It's because of things like that it's still hard for me to respect MySQL in 2022.

I use Postgres since 2008. It has never bitten me even once.


For something that was fixed 12 years ago?

If Postgres has never bitten you, then you aren't using it much.


It’s not fixed. MySQL still has a datatype called “utf8” that, in fact, is not utf-8 but a limited subset thereof.

PostgreSQL has some sharp edges, sure, but a lot less obvious footguns like these.


Just use ASCII.


use Postgres instead of MySQL


> Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems.

Wat? That is... not a great takeaway from this experience. "Oh, this library had bug reported in it! I know, that means I should just write my own, because clearly they're all buggy and my software is always perfect."


"Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead."

https://news.ycombinator.com/newsguidelines.html


Well, I mean, the good news is that if you write your own database system, you'll have so many other bugs in the way that you'll likely never reach this particular one!

I suggest it can be called the Sirius Database.

"It is very easy to be blinded to the essential uselessness of [the new database you just wrote] by the sense of achievement you get from getting them to work at all. In other words—and this is the rock solid principle on which the whole of the Corporation's Galaxy-wide success is founded—their fundamental design flaws are completely hidden by their superficial design flaws."


>their fundamental design flaws are completely hidden by their superficial design flaws

That hit me right in my 9-5 "not invented here" nightmare.


It's especially specific to MySQL — switching to Postgres is great because in addition to better performance and features there are all of these rakes in the grass you can forget about. Unicode just works, you never lose data because someone didn't scrupulously check the warnings after every query, etc.


Doesn't Postgres have its own set of unique of problems?


I’ve encountered far fewer and nothing treacherous like the “you lost data but we didn’t tell you about it” stuff MySQL is famous for.


Indeed, the worst of which is you can't do in-place upgrades - upgrading cluster versions is an insanely complex ride.


I routinely upgrade versions - on AWS RDS this is just a click or two in my case.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...

When you say "complex ride" what does that mean?


> When you say "complex ride" what does that mean?

On managed postgresql services like RDS it's easy, yes, because Amazon does all of the nasty work for you and has already seen all the kinks you may run into.

If you are on your own and using Docker, you essentially need a third-party image that has the old and new binaries (https://github.com/tianon/docker-postgres-upgrade). Good f...ing luck if you dare run PostgreSQL inside a Kubernetes environment, because adapting that guide to that is a can of worms large enough to make anyone squirm.

If you are on your own and are using distribution packages, it's more complicated (https://blog.samuel.domains/blog/tutorials/from-stretch-to-b...).

If you are on your own and use postgres from source, well... have fun, there's a reason why I prefer using distribution-maintained packages when possible.


I am not sure if that procedure covers major versions (13 -> 14 for example), but if it does, they are ahead of the curve. On GCP, you have to backup the database and restore the backup to a new database. On self-hosted installs, you have to pick one of these processes: https://www.postgresql.org/docs/14/upgrading.html


To confirm, RDS does support in place major version upgrade for postgres.


They do an in_place upgrade using pg_upgrade I believe. They do have to regenerate system tables - they also automate I think some updates to system statistics. But for very large databases this really is the way to go I think.

They do suggest a backup before you run this in case there are problems, most folks have periodic backups anyways, but the backup is not used.


It does. It’s really easy to outsource.


Fun fact InnoDB in place updates have plenty of caveats too: must be last column, cannot have non-default on-update/delete clauses. And if you workaround with any table-rename solutions like pt-online-schema-change then you might crash your server. (Some MySQL 8 versions crash on table renames.)

EDIT: Oops, guess you meant version updates not DDL. Leaving comment for posterity.


Postgres doesn't have a way to change the order of columns. Of course you could make a view, but my code doesn't use a view and I don't really want to make one just for when I want to check the database.


Out of curiosity, how does this cause a problem for you? I’ve been using SQL databases heavily since the mid-90s and I’m not sure I’ve ever been in a situation where I had to use SELECT * but couldn’t either process the results by column name or have the client read in whatever order needed. I strongly favor only selecting the fields I care about, however, so I’m assuming there’s a case you’ve encountered.


Wouldn't a single application of a temp table be sufficient?[0]

0: https://brianchildress.co/reorder-columns-postgres/


If that’s the biggest problem you have with your database, be happy.


I love that expression "rakes in the grass", it brings back childhood memories of quicksand, stop/drop/roll, and stepping on rakes as being things I'd always have to watch out for as an adult, plus all the loony toon cartoons who used it.


Instead you run face first into the thread per connection limit rake...


PG is actually overall slower.


In my experience that’s only true for very simple queries which play to MySQL’s strongest points, not complex ones or under high transaction volume. I haven’t found that to be a good win since the 2000s because caches usually soak up the easy queries and as your app gains complexity you hit performance cliffs which are harder to deal with, especially without some of the more advanced SQL features. MySQL 8 seems to have improved in this regard.


A) it's an obvious joke B) even if you take it literally, they never suggest implementing something new

You're misreading this on multiple levels


I'm pretty sure you missed the joke.


This is a pretty uncharitable reading considering the very next line is a recommendation to use PostgreSQL if you need a database.


god I hate mysql so much - long live postgres, one of the greatest pieces of software ever created


In MySQL, if you want to make sure that your Unicode fields are mangled, never use ‘utf8’, use ‘utf8mb3’. Since ‘utf8’ is an alias to ‘utf8mb3’, that alias might eventually be updated to point to ‘utf8mb4’ which won’t mangle your characters.

The only safe choice here is to explicitly use ‘utf8mb3’.


This is an old article and it was incorrect when written, and it's incorrect today. The author may possibly have learned since writing it that UTF-8, by design, encodes up to 21 bits of character code point in at most 4 bytes of data. It works and behaves as it should.


UTF-8 was not restricted to U+10FFFF until November 2003. Prior to that, implementers had to assume it could encode up to 6 bytes of data.

The "utf8" mode in MySQL can only encode 3 bytes of data (up to 16 bit codepoints), which is less than 4 (less than 21 bit codepoints).

"utf8mb4" correctly encodes all UTF-8 codepoints by current standards.


> "utf8mb4" correctly encodes all UTF-8 codepoints by current standards.

Yes, all good and known, but not really what the misunderstanding is about.

"utf8" follows the definitions of UTF-8 as they stood at the time, adhering to Unicode's former code space limit.


MySQL's adoption of a utf8mb3 limitation for "utf8" was over seven years obsolete when it was adopted. It would have been obsolete in 1996.


Well, that sounds great, but I've had to migrate numerous fields to utf8mb4 because we kept getting paged on emoji.


Same. Inbound SMSes with emojis. Outbound SMSes with emojis. Names and labels with emojis etc. But the implementation of "utf8" still isn't wrong, nor a bug - just like the 20 year old specs of UTF-8, they just failed to take the future into account.


The OP is wrong about the article being incorrect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: