Hacker News new | past | comments | ask | show | jobs | submit login
Feature Casualties of Large Databases (brandur.org)
120 points by craigkerstiens on Dec 1, 2020 | hide | past | favorite | 35 comments



This is a post I can't help but entirely agree with. ACID Transactions, Indexes, Views, Constraints and other features have created so much productivity. Then it came time to scale beyond a single node and replication protocols added caveats galore because we poorly understood how concurrency and isolation (strict serializability, repeatable read, etc.) work in a distributed systems environment. Applications continued to scale and we threw the baby (SQL/Relational) out with the bath water (inability to scale databases). Database features aside from horizontal scaling have mostly stood still for the last few decades.

So what did we do? We took the bare essentials (B-Trees and other storage engine data types) and made those distributed. Now that we've (painfully) re-built the database from the ground up, but sharded and distributed this time, I hope we can get back to adding features that make developer's lives drastically better.

In my opinion one of the missing features are resource and time constraints/budgets at various levels of granularity. Once the network gets involved the non-deterministic nature of SQL can really balloon and bring things to a halt. Beyond that, we need simpler and more intuitive settings for these databases, ideally assisted by tools like OtterTune. The cherry on top will be a database that identifies areas of less coordination than you need and auto relaxes (or notifies you) to something weaker that is faster and more scalable.


At least the ones you listed are all supported by cockroach db. There are still some missing bits but the gap is rapidly closing


> Make it easy to make a nullable field non-nullable, not requiring a problematic and immediate full table scan.

It is not easy, but at least it is very viable to add not null constraints to Postgres 12+ with minimal locking: - First you add a not null check NOT VALIDATED - Then you add a not null constraint (it will require a full table scan, but with minimal locking) - Afterwards you can remove the unnecessary not null check


> First you add a not null check NOT VALIDATED - Then you add a not null constraint (it will require a full table scan, but with minimal locking) - Afterwards you can remove the unnecessary not null check

This is still quite long and complicated. What stops Postgres from automatically tracking whether a column has any nulls on each insert?


If that's feasible, shouldn't Postgres do it automatically when adding that constraint (perhaps optionally, denoted by something like CONCURRENTLY)? Where's the catch?


The catch is the time you need to wait / migrate the newly invalid data.

All that NOT VALID does is add the validation for inserts and updates but does not check the existing data. This does let it get away with a less restrictive level of locking.

However there may be data which is not valid for a long long time, the "concurrently" query could be running forever.


"Small" databases are getting bigger. A laptop can manage a few billion records in a database without breaking a sweat.

At some point you have to ask yourself about the diminishing value of the data you're storing. A sales transaction at a clothing store for $78.50 is worth a lot. The 42nd temperature measurement in as many minutes for a smart home is not.

Features that allow you to treat these different kinds of data differently are going to be important.


A less obvious referential integrity performance requirement is ensuring your FKs are indexed.

This is not done by default or required by DBs but will be a performance penalty if you don't do it.

For example, having a User (Id, Name) table, and an Order (UserId, Amount) table, you may have a FK between User.Id <-> Order.UserId.

If a User record is deleted, the database needs to ensure no Order exists with that UserId to maintain the FK integrity, so will do something like an (SELECT Count(*) FROM Order WHERE UserId = @DeletingUserId) to check if its OK to delete that User record.

If you do not have an index on Order.UserId column, then the deletion of the User record will be slow when the Order table is large.

This works the other way too, if you insert an Order row, the inserted UserId is looked up on the User.Id table to assert it exists, so an index is needed there too (although, User.Id being the primary key, its already indexed), but you get the ideal.


> A less obvious referential integrity performance requirement is ensuring your FKs are indexed.

For the sake of precision, this is a requirement in MySQL¹:

> MySQL requires indexes on foreign keys and referenced keys so that foreign key checks can be fast and not require a table scan

¹=https://dev.mysql.com/doc/refman/8.0/en/create-table-foreign...


>If you do not have an index on Order.UserId column, then the deletion of the User record will be slow when the Order table is large.

not only slow per.se. The full table scan in this case would also usually take the lock on the whole Order table instead of only narrow block/range-level lock in the index like it would in the case of indexed FK. Such table lock may force some other threads/sessions to wait, and provides increased surface for deadlocks between seemingly unrelated transactions.


I think a lot of developers who think they have referential integrity and strong consistency requirements are mislead. I've seen the case where for example User A is leaving Company Z and the developers have, for some reason, rigged it so removing A's access to all documents in the world has to happen in one giant cross-shard transaction, because it's consistent which means it's correct. But ... that's crazy. There's a hundred equally-correct ways to do it with tombstones and ordering and without cross-shard or even cross-row transactions. If you give a developer something that looks like BigTable they'll learn to deal with ordering, instead of using atomicity as a crutch.


Cool let's implement a distributed database when it turns out there's no reason to.

That's not crazy - most people dont have access to enough stuff to leave one or two datastores, a consistent approach is fine.

Choosing 100 equally correct ways when one simple way works is ???


This article reads odd to me. It seems to me that serious players do not compromise on these things at all.

For example, CockroachDB (https://www.cockroachlabs.com/) aims to support extremely large databases in a distributed fashion, and does not compromise on any of the points the author raises.


Deletes leave tombstones and performance plummets pretty much like in Cassandra. If your app doesn't do many deletes (<2-3% of operations in my experience) it can work fine. You can lower gc_ttl but that will limit having consistent/complete backups.


Regarding this, mariadb galera cluster engine will allow a strict mode in its next release: https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-2...


What is here considered as Large database?


Yeah, no shit, it's hard to get things like referential integrity and transactions working at scale, it takes many genius engineers and many years.

A better title would be:

Feature Casualties of Large Databases (except for old-fashioned databases like Oracle, DB2 etc. that have been doing this stuff at huge scale for decades).


To be fair, I'm more referring to installations where it's been necessary to scale up beyond a single node. Things like referential integrity and transactions get harder for sure, but they're not unsolved — as noted in the article, we have prior art now in the form of products like Citus, Spanner, and CockroachDB.

But that said, even though Oracle et all do support these features even for large installs (on a single node), a lot of the RDMS niceties tend to be thrown out anyway. DBAs get afraid that non-trivial `SELECT` operations won't be able to use an index and cause production havoc, adding fields as `NOT NULL` gets hard to do, and even the simple act of raising an index can get expensive/scary.

This is a big part of what this piece is about: even where these features are still possible, there's a bad tendency not to use them. Why is that, and how do we fix it?


> even where these features are still possible, there's a bad tendency not to use them. Why is that, and how do we fix it?

Uh... databases ~~suck~~ have terrible usability issues? And so does SQL? Like, really hard core. I took some DB classes at uni and I've learned a lot since too, but no matter how many times some SQL wizard explains to me the "beauty of null," I just don't get it. Why would you want to use three-value logic when you can use two-value logic? Why doesn't SQL have custom types? I avoid almost-programming languages like the plague - if I'm gonna write a Powershell script that's longer than 50 lines, I pull out C#; if I'm gonna write a bash script that's longer than 50 lines, I pull out python; and if I'm gonna write a SQL procedure that longer than 50 lines, I do it in the application.

For a job, I had to write a SQL script that, for each row with some conditions true, check the value of a string column, split it into a list of elements by commas, then split each element into pairs by a horizontal line, and return rows that had duplicate keys. Write a query that does this on MSSQL 2012. Then tell me if it works on the top five other SQL vendors. Now do it again, but instead of a string data with char delimiters, it's protobuf data.

Other old programming languages like C and C++ have had their usability issues. Conflicting implementations, basic functionality missing (std::string::contains). But SQL feels .. perpetually 20 years behind.

PS: Use [Kusto Explorer](https://docs.microsoft.com/en-us/azure/data-explorer/kusto/t...) for an hour and then tell me SQL has lots of nice features again. I have a Pavlovian reaction just to seeing "NOT NULL"


Yes and no. When you describe '...split it into a list of elements by commas...' style problems you are describing un-normalised data. SQL is explicitly designed to deal with normalised data, so it isn't a fault if it struggles.

NULL is unpleasant to deal with, but it is unavoidable - sometimes the value of data isn't known and the database has to represent that somehow.

But I'm with you on getting out of SQL quickly as the code gets longer. It is a bad language and it is better to be working in something else.


PostgreSQL supports arrays as a first-class feature. There's very little issue with "unnormalized" data when the data will never be used as a lookup key within the database.


SQL was designed somewhat before PostgreSQL decided to support arrays. It is a terrible, terrible language for dealing with non-normal data.

If you are doing string parsing in SQL, mistakes were made.


Oracle (and DB2) have had multiple nodes for decades, with support for referential integrity and transactions across nodes. I guess perhaps you are talking about geographically distributed nodes that have slow pipes between them?

But in any case, yeah, there's a bizarre tendency for humans to want to latch on to hard and fast rules. I've heard things like "don't join more than 3 tables together", and you encountered at your workplace a rule that sql updates should only affect a single row at a time. Many workplaces have such farcical restrictions, resulting in behaviours like the allegorical monkeys in the room (https://workingoutloud.com/blog/the-five-monkeys-experiment-...).

My theory - small children go through a phase where they love to know what the rules are and they love to set down rules for others. Some people never grow out of that, and they hold a worldview where everything would be OK as long as everyone followed the rules. And if for some reason that didn't work, then all you have to do is to add some more rules. Such people may not be the most creative but they cling fiercely to their rule-based approach to life. Since it's always easier to add rules rather than remove them, a poorly managed organisation can find itself trussed up tightly in these rules and an all around rubbish place to work where everything takes forever.

While that covers the prevalence of rules such as "no SQL statement may update >1 row", you also ask why people may tend not to use features such as referential integrity in avant garde databases such as Spanner et al. Personally, I would avoid using such features if only because they are so challenging to engineer and are likely to contain bugs for a few months or years as they are bedded in. That's especially true once a system is under heavy usage and the impacts of any issues are enormous. It's easier to just keep on doing whatever you are already doing.


> ...multiple nodes for decades, with support for referential integrity and transactions across nodes...

By definition, that would mean relying on something like two-phase commit. You're essentially running on a single "node" that just so happens to be physically dispersed across multiple machines, since the system would become partially or totally unavailable if the nodes can't communicate with low latency.


If something works, or sort of, people will not want to touch it. Large datasets become complicated and any optimization can turn bad for several use cases. It's overuse of RDBMS basically.


> Yeah, no shit, it's hard to get things like referential integrity and transactions working at scale, it takes many genius engineers and many years.

Well, the problem is that such mechanisms _must_ be in place (in well-designed applications); they're not optional.

The consequence is that, if they're not implemented at the lower (db) layer, then they're going to be implemented at the higher (app) layer.

There is often the discussion of indexing/referential integrity for schemaless data stores. One may think that such data stores solve those problems; they don't - they push it to the application.

Therefore, it takes many genius engineers-years... anyway. Solutions provided by the data stores, like FKs, just happen to solve the problem out of the box (where applicable, which of course, it's not always the case).

The article actually mentions this, although not so dramatically :-)


Yeah well, it all depends what you mean by "scale" or "large". The vast majority of real-world applications are well served by a single-source-of-truth approach, which makes it relatively easy to get these features.


Distributed computing and networks, amirite?

I'm sorry, but if you can't see that ACID transactions don't scale over multiple machines, then of course you will bemoan about how a large database gives up your prized feature.

Except what are you doing that is SO LARGE that you must have ACID semantics with it and are willing to go through all the hoopla to have cross-network transactions and joins? Because this hoopla is elusive and rare (impossible to do performantly, on a pragmatic scale). You are much better off designing your large volume loads to not require ACID, and to take all that is ACID-critical and keep it on so-called small databases.

This was the tradeoff that was long figured-out by NoSQL offerings, but somehow many engineers continue to insist on using 1 database for every problem, and then wonder why one solution is not reliable enough for critical data, or another doesn't scale for non-transaction-requiring data.

This is also why people writing the applications must also be the people operating the databases or at least be aware of the operational characteristics of the underlying databases. The choices you make in the application affect that. For many use cases, you can design reasonable concurrency with data without transactions, as long as you have atomic updates. But it takes sitting down, realizing that limitation, and then designing for it.

Another complaint I am reading about is nullalble fields. Fact of life is you are kidding that your database will enforce you schema and do all your schema migrations. You need to write application code that can tolerate an evolving schema. ALTER TABLE stops working at a certain scale. You cannot go back to every record and update it with new guarantees at the data store, it is simply not practical and honestly a waste of resources.

Even with SQL databases and the ALTER TABLE business, I've inevitably found our organizations writing custom schema migration logic in the application. So you still end up with a forced mess of application schema migration logic and database schema migration logic. I think long-term you should put you data migration logic into your app and stop trying to do that in your store. I've done that and have been much happier for it. All the devs know where the migration code goes (the app). The flexibility is considerable and not restricted to what the db layer might support.

Referential integrity and lack of joins is why also we denormalize data. Are your data structures overly normalized? (I'm sorry, uhhh ... "fully normalized" according to your databases professor?) Well just how normalized do they really have to be? How much can you denormalize without taking real risks or giving up real performance? I bet a lot more than you think. And how many ways can you come up with dealing with the denormalized approach, when you need to re-normalize? I bet more than zero. And how often do you need to do it? Likely never, or infrequently enough that so-called "bandaid fixes" to your denormalization are actually practical solutions, bemoaned only in classrooms devoid of production experience.


> SQL is the most expressive language ever for querying and manipulating data, and in the right hands, that power can make hard things easy.

... what? The most expressive language ever?? You're kidding.

Python? Scala? R? Julia? Java? Rust??

I mean, SQL is fine and all, and folks keep trying to unseat it which I feel is folly but to call it the most expressive language... Not even close.

SQL is rigid and structured. Basic math and string operations are frustrating at best. It's a small, easy to learn and powerful language. But it's far from expressive.

edit: a lot of replies are implying I took the quote out of context, allow me to apologize:

For manipulating and querying data, SQL is not the most expressive language ever, I wouldn't even call it expressive. It's baroque, rigid, and structured. Spark (with e.g. Scala) is far and away a more expressive framework/language than SQL ever will be. Python with pandas is far and away a more expressive framework/language than SQL ever will be. The list goes on.

I'm not trying to shit on SQL -- I live and breath it and I wouldn't be making money doing what I'm doing without it -- but it's not expressive.

Take a look at the first lines of the Wikipedia article[0] for Expressiveness (referred to as Expressive Power):

> In computer science, the expressive power (also called expressiveness or expressivity) of a language is the breadth of ideas that can be represented and communicated in that language. The more expressive a language is, the greater the variety and quantity of ideas it can be used to represent.

I get what the author is trying to say, but expressiveness is the wrong term. Datalog is arguably more expressive than SQL.

[0] https://en.wikipedia.org/wiki/Expressive_power_(computer_sci...


(Author here.) I think you're confusing a couple different major classes of languages — declarative versus procedural. I said specifically "most expressive ever for querying and manipulating data", and I'd stand by that.

The nice part about SQL compared to the languages you've cited is that you just need to specify the "what" and not the "how". I tell it what data to fetch, where I want it from, how I wanted it grouped/sorted (etc.), and the engine figures out the best (and probably a very efficient) way to do that. This is different from a procedural language like Python or Rust where it'd be on me to also define every implementation detail — get data from here, join by doing this and with that algorithm, sort what comes out, ...

This gives SQL a number of very nice properties:

* It's very fast and succinct to write.

* Even very complex queries are likely to be executed as optimally as they can be (because database engines tend to be very good at this).

* It's so high-level that it can actually get more performant over time. See for example Postgres 12, which eliminated an optimization fence for CTEs (common table expressions) [1], which means that a lot of queries got faster automatically, without the user having to do _anything_.

I love a lot the languages you listed, but SQL is a different animal, and amazing in a way that's quite unique.

---

[1] https://info.crunchydata.com/blog/with-queries-present-futur...


I don't know. Procedural and declarative languages can both query and manipulate data. Maybe I'm being pedantic, but that's not my intent.

For all of SQL's warts, it is the golden standard, and I agree with most everything else you say (it's succinct, easy to get optimal behavior, etc.) but compared to what you can express with Spark in Scala, it's not the end-all-be-all of expressivity.

In fact, several SQL-compatible query engines have extended SQL to allow you to do work around its limitations and manually hint at what you want the engine to do. And even then, there are tons of situations where you fight the engine to do what you want.

If I had a nickle for every Data Scientist or Finance person I unblocked because what ends up being a recursive join was making their query run time infinite...


I like sql a lot. It makes sense to me. But I also really like discrete math and set builder notation, so that might have something to do with how much I like sql, because I build my queries up as if querying the database were a discrete math problem.


You took the author's quote out of context despite pasting/typing in the entire quote, which means you either did that on purpose or you copy-paste/type things without reading them. Twice.


You can call Java from Oracle's PSQL, so that makes it a super set including Java, right? :P (I've done things I'm ashamed of)


Our sysop loved PG/Perl for stored procs...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: