Hacker News new | past | comments | ask | show | jobs | submit login
The database ruins all good ideas (squarism.com)
251 points by kristianp on July 17, 2021 | hide | past | favorite | 164 comments



The title of this piece is great: very catchy. But I don't think the content supports the title - by the end of it I wasn't at all clear /why/ the database ruins all good ideas.

A few other points. First, horizontally scaling database reads is actually reasonably straight-forward these days: one leader, multiple replicas, load balance reads to the replicas and use a mechanism such that users that have just performed a write have their read traffic sent to the leader for the next 10 seconds or so.

Not trivial, but also not impossibly difficult - plenty of places implement this without too much trouble.

Scaling writes is a lot harder - but a well specc'd relational database server will handle tens of thousands of writes per second, so the vast majority of projects will never have to solve this.

When you do need to solve this, patterns for horizontally sharding your data exist. They're not at all easy to implement, but it's not an impossible problem either.

The article talks briefly about mocking your database: definitely never do this. How your database behaves should be considered part of your application code under test. Running a temporary database for your tests is a solved problem for most development frameworks these days (Django supports this out of the box).

Overall, my experience is that the database /enables/ all good ideas. Building stateful applications without a relational database in the mix is usually the wrong choice.


> The article talks briefly about mocking your database: definitely never do this

I honestly thought you meant to never ‘ridicule’ your database, and I was thinking “that might be a healthy level of respect” :)


I think I better title would be “reality reveals that ideas were actually bad”.

The issues described in this blog are about the challenges of persisting a consistent set of data, when you need to support a lot of read and write access. That is just a fundamentally difficult problem. Databases are the most sophisticated tools we have for doing that, but really they just tend to be a set of (hopefully) well implemented trade offs that attempt to nicely balance all of those conflicting requirements.

If a person can’t find a way to adequately persist their data, then their idea is probably bad, or they just don’t have the competence to implement it. An equally suitable title for this post could have been “why doesn’t magic exist”.


Exactly. Consistently and reliably storing business data is something that actually turns out to be important in the real world.

If you can't actually retain your business data, it really doesn't matter how many "good ideas" or fancy deployment strategies for the presentation tier you have.


horizontally scaling database reads is actually reasonably straight-forward these days: one leader, multiple replicas

There's one feature that I'd like to see in that area: partitioned writable replicas. In the same vein that you can partition the table storage across an index, I'd like it to be possible to assign different writers to different parts of a table/database. Of course, you'd still need a single primary replica to handle the transactions that transcend the configured partitions, but we already have routing engines that can transparently redirect an incoming query to any available replica, so it's partly there.

There's probably corner cases lurking that I can't even think of, but in my mind it's the only thing missing from building a truly web-scale (multiple zones, multiple datacenters) ACID-preserving relational database.


> There's one feature that I'd like to see in that area: partitioned writable replicas.

This is getting more common these days. Google’s Spanner database does this. The database is divided into “tablets” which are each given their own Paxos state. Spanner moves tablets between server instances to do load balancing. There’s no need for you to do the partitioning manually, although if you run into performance problems, you may need to understand how tablets work.

Spanner’s definitely not unique in doing this, but Spanner does have a nice writeup of how it works:

https://static.googleusercontent.com/media/research.google.c... (PDF)


This is essentially the idea of VoltDB [1]: All tables are partitioned and each partition is assigned to a single thread. As long as each query only touches a single partition, you get linear horizontal scalability and serializable isolation. The drawback is that if a query touches multiple partitions, only that one query can be executed on the entire cluster, so this architecture only works for very specific OLTP workloads.

[1] https://www.voltdb.com/


I’ve manually implemented this approach before and I would be a little scared of doing it automatically.

When you do it manually you are very aware of queries that need to go to multiple shards and really do whatever it takes to avoid them.

We sharded by account id and all an individual users data would be on that shard.

Basically the only queries that need to transcend the shard are when you need to enumerate accounts by some non-ID value like an email address. You have to ask all the shards in that case.

Things like creating an account are tricky if you want to make sure that each has a unique email address. You need to use two-phase commit across all the databases to do it reliably.


Checkout TiDB. It sounds too good to be true, and it is, because transactions must be co-ordinated across multiple nodes on the network.

Thus the write throughput can be very high but the write latency is also very high.


> I'd like it to be possible to assign different writers to different parts of a table/database

If you're happy letting the database do that for you, that's how current CockroachDB works.


Vitess does just that, letting you seamlessly scale writes in the same way you would scale reads today, scaling linearly with additional nodes.


> The article talks briefly about mocking your database: definitely never do this.

I definitely recommend doing this for same tests: it's the only way to check how your application behaves in case of failures.

The best thing would be to have a DB where you can inject failures, but I'm not aware of any. So test - the happy path on the real db - the sad path on the mock/fake (which might be a light wrapper around a real db, but with the ability of injecting failures)


That's true: the one place where it's a good idea to mock the database is when you want to simulate what would happen if a specific database error occurred that can't be triggered another way.


Scaling writes is theoretically really easy via sharding, and I've done it several times. Unfortunately in typical web stacks these days ORMs absolutely do not support sharding in any reasonable way, even if your data model supports it easily (users don't collaborate so you can 100% segregate by user ID and never need cross-shard joins except to static tables).

Agreed re: db mocks, database bugs and quirks should be captured and addressed in tests, not discovered only in production. I don't care how mature the DB you're using is, there's always going to be some edge case behavior that you can't predict until you actually see it happen, even just simple stuff like exactly how an error filters back to the application when you send data that violates a constraint. I've seen way too many tests that make assumptions about these things that turn out to be incorrect, leading to horrible behavior like writes getting through on prod that should have been blocked because of pre-checks that didn't turn out to be enforced the way the test writer assumed they would be.


> The article talks briefly about mocking your database: definitely never do this. How your database behaves should be considered part of your application code under test. Running a temporary database for your tests is a solved problem for most development frameworks these days (Django supports this out of the box).

Spring Boot for Java does this too (generally using H2 in memory database). I have found that for CRUD services, doing this vs. pure unit tests that mock the db, yields far better benefit (provided you don't involve heavy volumes in the tests to avoid test data maintenance issues). With this approach, I have found database side edge cases/constraint violations that wouldn't be caught by mocking the database. Also, it is a much more comprehensive test if you take the approach of a) loading an initial state, b) doing business logic mutations and c) compare the final state with the expected state; e.g. you might find side effects that you hadn't considered testing for and so on.


I always go further and use the same database for my tests e.g. use postgres rather than H2.

It is easy enough to setup and tear down using docker.


After a bug slipped through because of it, I switched CI to run the same point release of the database, and validate that the schema is identical to production.


That is a good idea. Though H2 supports specifying the database dialect so it is a pretty close and usually quicker approximation.


If speed is one of the things holding you back from testing with the native database, running the database data on a tmpfs mountpoint is worth looking at.

https://vladmihalcea.com/how-to-run-integration-tests-at-war...


For me, it is more of zero setup while getting most of the benefits which is "good enough" (tmpfs works great but every new developer would need to do this + setup/permissions needed for docker + differences between Windows/Linux and/or desktop vs CI host etc).


It looks like tmpfs is Linux-only feature.

Otherwise I have a single docker-compose.yml file that sets up postgres using "Docker for Desktop" on Mac and Windows and works natively on Linux. CI (Linux) uses the same docker-compose.yml.

I haven't used docker in production, but it has proved very useful in test.


and now as easy as adding `@AutoConfigureEmbeddedDatabase`

https://github.com/zonkyio/embedded-database-spring-test

also helps with having it verify all your database migrations as well


But what if the data setup is convoluted? How does one avoid a horrific domino stack of fixtured data?


I find using libraries like factory_boy and pytest's fixtures mechanism makes that horrific domino stack pretty manageable.


Man I fucking hate factory boy, by default it fuzzes the input meaning your tests are non deterministic. Honesty, I don’t understand why people use it; how many models do people have where meta programming factory classes makes sense?


I think it comes down to whether you want to maintain a fixture or a factory. In my experience, it can be easy to set up a fixture but over time it increasingly becomes a pain -- sometimes enough that setting up a factory is worth it.


Before thinking of removing the RDBMS; replacing itwith a NewSQL/NoSQL; or trying to horizontally shard, ask yourself: what is the performance I really need?

* Read-only queries can easily be scaled out to read replicas.

* Write transactions.

As an example of numbers publicly available, GitLab.com runs more than 250K read-only txs and more than 60K write txs on a single Postgres cluster, with room for further vertical scalability [1].

Do you need millions of transactions per second? Then go ahead. Don't? Then ou can scale very very far with a properly tuned and administered relational database like Postgres.

[1]: https://about.gitlab.com/blog/2020/09/11/gitlab-pg-upgrade/


If the number of users, number of active users and all major tables grow 100x, would postgres be able to support the application, without major re-designs?


At 100x, gitlab would be generating $15 billion in revenue each year. Having to redesign their database structure wouldn't be the thing that keeps them up at night.

I'll also add, that the point is that a devtools unicorn is able to run with a single postgres cluster, so maybe it isn't the rdbms that's going to be the limiting factor on your startup with a hundred users.


Presumably they would then have 100x the cashflow? Enough for a redesign?


But what about when there will be problems with your single point of failure?

Sometimes you'll run into situations where an instance of any software has degraded performance for reasons beyond either your control or understanding. That's one of the situations where horizontal scaling let's you deal with such circumstances better, instead of outright impacting everything and everyone connected to it.

A front end web server container crumbles under the load? No worries, just redirect the traffic to others. A back end API container has problems with the JVM reserved memory and/or the thread pool for processing connections has been filled and all new incoming requests just get queued up? Just have more instances to cope with the load and don't let one instance under load affect all others.

But what are you supposed to do if long running SQL exhausts the resources of your DBMS? What if a background process needs to run some really complicated writes, but your application needs write capability without slowing down?

That's not to say that bad configurations or bad code shouldn't be addressed, but any means of fighting it and minimizing the impact is worthwhile in my eyes.


> A front end web server container crumbles under the load? No worries, just redirect the traffic to others.

You can do basically the same thing with read replicas, promote one of them to primary and shut down the machine causing issues.


Why have the RDBMS as the default? Why assume you need ACID and transactions when they're overwhelmingly likely to bring you nothing but trouble (the number of web applications that make effective use of database-level transactions is approximately zero).

You're far better off starting with a system that does seamless active-active by default, and then figuring out what kind of transactional guarantees you need. An RDBMS might make sense for a few specialised use cases, but most of the time it doesn't.


> Why have the RDBMS as the default?

It's possibly the sanest default for a data persistence layer. I'd argue otherwise: use a RDBMS unless you know what you are doing.

> the number of web applications that make effective use of database-level transactions is approximately zero

Almost every driver and layer (ORMs, etc) that connect to databases use transactions. Transparently or not, but it uses them. So actually they are using transactions all the time.

> You're far better off starting with a system that does seamless active-active

Think about what means active-active. Either it means that you serialize everything in every active node (pointless); or you need to deal with concurrency and data synchronization. Which in turn means either supporting distributed transactions (e.g. via Paxos or RAFT protocols) or doing conflict resolution (which essentially is a nice way of saying "data loss"). Both problems are much, much harder than the (usually harmless) consequences of using transactions on a "classical" RDBMS.

The only drawbacks of transactions are that are limited to scale in write on a single node. But if this single node can scale to the numbers I mentioned ---and they do, and more-- most use cases will likely fit within this envelope.

Therefore being RDBMSs like Postgres the best and safest, the default, approach that should be taken for data persistence layers.


> Almost every driver and layer (ORMs, etc) that connect to databases use transactions. Transparently or not, but it uses them. So actually they are using transactions all the time.

They're using them but they're not getting any useful effect out of them, since the database-level transactions don't actually align with business-level transactions. Fundamentally you can't actually use database-level transactions in a useful way in a web application because you can't share a transaction between multiple HTTP requests (without introducing much bigger problems).

> Either it means that you serialize everything in every active node (pointless)

Not pointless - you gain redundancy which is the whole point. Performance will be bad but for a low-traffic system it's often actually fine.

> Which in turn means either supporting distributed transactions (e.g. via Paxos or RAFT protocols) or doing conflict resolution (which essentially is a nice way of saying "data loss"). Both problems are much, much harder than the (usually harmless) consequences of using transactions on a "classical" RDBMS.

The consequence of using transactions on a classical RDBMS at scale without thinking carefully about it is deadlocks, which are just as bad as distributed transaction problems, and you still have to do conflict resolution in the case where a transaction gets aborted, which will happen sooner or later (most RDBMS fans just ignore it and lose data). If you want decent performance then eventually you'll have to model your data properly using CRDTs (this is true even in a classical RDBMS) and at that point you gain nothing from traditional RDBMS transactions.

> Therefore being RDBMSs like Postgres the best and safest, the default, approach that should be taken for data persistence layers.

RDBMSes are not safe. The way they achieve their integrity guarantees is by rejecting, and dropping, unexpected writes. This almost always loses data in practice.


> Why assume you need ACID and transactions when they're overwhelmingly likely to bring you nothing but trouble

Maybe this for consumer web applications, or for CRUD apps.

But my experience with B2B and SaaS applications is that sooner or later, I always need transactions (or locks) for something. Maybe it's for the billing code. Maybe it's for batch jobs or workflow logic. And that's when I'm really happy to have a transactional database.

The alternative to having transactions is setting up a Raft or Paxos server to handle distributed locks, and those require a lot more ops effort than a copy of PostgreSQL.


Well if it's not distributed then it's a single point of failure, almost by definition. I hate maintaining a Zookeeper cluster as much as anyone, but anything that runs on a single server is not a real alternative (and if a single-server SPoF is acceptable, then just running the batch job / workflow processor / whatever it is on a single server works just as well - maybe with a pool of workers to scale out the actual processing, but a single coordinator, sort of like how Jenkins or Hadoop do it).


My most common batch/workflow usecases tend to involve a few hundred workers, and tolerance for brief downtime (especially if jobs don't fail outright).

PostgreSQL works surprisingly well for something like this. It has a rich set of coordination and locking primitives, and it can usually be recovered and restarted in under 10 minutes if the server fails. Yes, connection overhead from 300 workers is a real problem, so it doesn't hurt to put a thin layer in front of it.

One thing that an RDBMS buys me is flexible feature set and a solid ecosystem. Want write-ahead logs and automatic point-in-time recovery? It's out there. Need a specific locking primitive? PostgreSQL probably supports it. Etc.

If I were looking at 2,000 workers, yeah, it would be time to set up a Raft server. But a good RDBMS will do the job for a surprisingly long time.


I wouldn't say "flexible" exactly - RDBMSes come with a lot of stuff, but it's all coupled in surprising ways and you have to use it the way the system expects. They remind me of those do-everything application frameworks that come with a built in web server and task queue and distributed transaction framework and what have you, and you can do these amazing demos in 10 lines of code but as soon as you want to customize the behaviour or step slightly off the beaten path you're in for a world of pain.

I do see where you're coming from, but I find they're more trouble than they're worth in the long term. YMMV I guess.


Every app I’ve worked on has used transactions to maintain data integrity. And before you say I could have stored everything in a document, I had to move away from MongoDB because it couldn’t handle the ways I needed to query the data in a performant way.


"Maintain data integrity" sounds good, but what does it actually mean? In practice with an RDBMS it means you drop a write on the floor (or worse, deadlock) if it violates a constraint. And in practice that's almost never good enough, so you end up having to build the same kind of application-level validation logic that you would have written in a non-RDBMS system anyway.


RDBMS and Non-RDBMS both have there place, I have used both in the same system several times, all for things that they were good at. Transactions allow you to be confident while making complex changes that in case a failure occurs all partial changes will be rolled back. making use of database level validations and enforcing referential integrity is essential for keeping data consistent over the long term and making data migrations easier. Sure in trivial applications you can just dump data to a document store and have a validation soup handle the rest, that too can be implemented cleanly but Knowing when to leverage RDBMS over other DBMS is an essential skill for an engineer. Because that is how you build scalable systems. Not by throwing JSON stores at everything and calling it a day.


If scalability is your concern then you can't use any of the supposedly core features of an RDBMS, since fundamentally there is no way to have a transaction across multiple nodes without solving a much bigger problem.

Validation is vital but the datastore is not the place to do it, because handling invalid data by dropping it on the floor is almost never the right behaviour.

There is no substitute for actually understanding your data model, but once you do 99% of the time you'll find using an RDBMS comes with minimal benefits and significant costs.


My experience is the exact opposite: once you understand your data model, 99% of the time you will find that NOT using a RDBMS comes with minimal benefits and significant costs.


It is extremely difficult to design a good RDBMS schema without understanding the data model, and once you do, it is there documented in its entirety with best in class tooling for anyone else to come along, pickup and be up to speed with it, additionally you don't have to forgo document storage, most if not all modern RDBMS suppord json(b) types.


RDBMS tooling is a long way away from best-in-class, and the data model is extremely awkward in a way that actually distorts your modelling (no collection columns, no sum types...) and there's no real support for keeping track of schema evolution. I agree that recording your schema model explicitly and keeping track of it is very important (using something like Avro's schema registry), but RDBMS tools are not actually that great at it and using an RDBMS brings in a lot of other baggage.


The point is that you can go pretty far on a single cluster (GitLab example). That 99% figure is trivially wrong in that case.


Why would you be input data in a way that violates referential integrity? That would be pretty bizarre.

That being said, if your code doesn't catch and handle errors, then there's a couple of deeper problems already.


> Why would you be input data in a way that violates referential integrity? That would be pretty bizarre.

Well if you can't get input that violates that integrity then what are you gaining by enforcing that integrity?

> That being said, if your code doesn't catch and handle errors, then there's a couple of deeper problems already.

Sure, but how can you do error handling without data storage, given that your application itself is stateless? Maybe you retry the failure a few times, but ultimately your only option is to drop the data on the floor, which is almost always not what you want.


If you allow invalid records and multiple versions of a schema in a collection, you will forever being paying for that mistake every time you read from the database.


If you're fine with just dropping that invalid data, you can always just delete it at any point in the future. (Yes, that should horrify you - but so should the idea of just erroring and dropping it when someone first tried to write it, which is what an RDBMS would do). The overall effect is the same either way.

Fixing up data that was written wrongly takes work, there's no getting away from that. But if you have a record of the original write attempt then at least you can do that work and recover the data (if you decide the cost/benefit is worth it). Whereas if you have the datastore reject it at write time then you're SOL before you've even started.


> Yes, that should horrify you - but so should the idea of just erroring and dropping it when someone first tried to write it, which is what an RDBMS would do

Have you ever actually encountered any RDBMS that just silently "drops" your write without informing you, or is this just how you think they work?

Otherwise, isn't this just like any other data write erroring out -- MongoDB driver saying "Service down!", simple flat-file write saying "Disk full!", etc -- in that if your app doesn't react to this, it's not the data store but your code that sucks?


> Well if you can't get input that violates that integrity then what are you gaining by enforcing that integrity?

Perhaps I'm wording this badly?

Your application should _not_ be generating data which can't go into the database correctly. If it is, that's a different set of problems. ;)


What are you claiming the purpose of integrity constraints in the database is? If your application doesn't generate data which can't go in the database correctly, what's the observable difference between a database that enforces constraints and a database that doesn't?


The difference is that one of them informs you when you've failed at generating valid data to save, and the other doesn't.


I feel like this misses the point. The reason all the other stuff besides the database is so straightforward is because you can localize all the state in the database so you can deal with the hard parts in one place and with maximum support.

The database doesn't ruin all good ideas. It makes all those ideas for other tiers possible.


The mistake is thinking half the system is doing nothing. It’s a bit like saying I’m paying to store my backups offsite but most of the time it’s not really used so it’s a waste of money. It just doesn’t hold.

Most SQL problems that most of us have to deal with stem from inadequate indexes and/or poorly written queries. No fancy active-active setup will ever solve those issues.


It might be a waste. Think RAID 1 vs RAID 5. There might be plenty of ways were leaving performance on the table.


Which one of those leaves performance on the table? One is optimized for performance and the other for disk usage.


Maybe its a bit tortured of a metaphor, but perhaps with continued research we can find a way to use more of the hardware just like how RAID 5 provides more storage over RAID 1 while still keeping parity.

I don't think its foolish to keep asking for more. We still might find it.


Databases are the reason all that other shit is so easy. Databases are the heart of computing but somehow this clown industry still doesn't get that at a large scale.

I’ll replicate your clown ass todo list app with a few DB tables, a few queries and 1 HTML table dont make me fucking do it :D


Also, developers never design their own database schemas anymore, then when the some crappy ORM (like Django’s) generates a tediously slow schema they blame the database.

It’s weird how developers get asked about OO design patterns in interviews, but not once have I ever been asked about database design (beyond some useless stuff about 2nd normal form).


Huh, almost every lead backend job I've interviewed for had a lot of DB design questions, more than normal coding stuff because it's so critical to scaling. It's core to the role even for lower level positions, so it really should be more common.


Clearly you are not a Node.is dev.


Or a Django dev, or a Rails dev, or any other ‘framework developer’.


As a matter of fact, I have done both of those a lot, and I'd argue that knowing how model design interacts with the database is arguably more important because if what footguns the ORMs can be.

But you are correct, interviews for those types of jobs for whatever reason don't tend to focus as much on DB stuff, though they really really should given that a quarter of your job will end up being unfucking the system away from the poor choices that your predecessors made (if you're guaranteed to be serving 1mm DAU and everyone will be generating thousands of records per month, maaaybe don't use normal ints for uids...).


> end up being unfucking the system away from the poor choices

The amount of times I've had to fix UUID4s stored as a PK string in my career... It's bloody infuriating, it's like your typical 'ORM developer' just doesn't understand that: a UUID is a 128bit number; doesn't understand that in Postgres there's an extension for it (which Django doesn't use); how much slower a string lookup is compared to a numeric lookup; and how that indexes are physically stored in order, meaning that a UUID4 will be inserted at random points on disk utterly killing insert performance.

Generally these kind of performance fuckups happen because you get developers who treat the database as a magical performance box that needs an occasional index. One of the worst systems I've seen was an accounting system that used triggers to update the balance, so you'd insert a ledger that would update a single row. Guess what that'd do? It'd block all other inserts until the update was committed. The previous developers kept on trying to improve the performance by adding more servers, but it was just burning money because they were inadvertently locking a row (rather than calculating balances on the fly).


I'm a "framework developer" (and primarily a PHP framework developer at that) and I definitely got those kinds of questions the last time I was interviewing, and the company I work for still asks schema design questions.

Designing a good schema is, in my opinion, the single most important thing for creating useful software. I would much rather work on janky spaghetti code that operates on a clean and well-mapped schema than beautiful SOLID, commented, well-documented code that runs against a terrible, fragmented datastore.


The article is talking about Oracle RAC, and the "crossover" he is talking about was usually really quick Infiniband, not a crossover cable (so, a separate NIC for good reason!). The person that wrote the article doesn't seem to understand RAC enough to actually comment properly. A properly tuned RAC instance will scale horizontally very very well.


Former (recovering?) Exadata user here. I'll go as far as to say that configuring a best-practices RAC cluster is easier than configuring a best-practices MySQL cluster. RAC solves some additional problems, like discovery and load balancing, that MySQL kicks out to the user (e.g. you have to manage DNS and run a tool like ProxySQL separately).


If you look back far enough, it wasn’t infiniband but a small separate fibrechannel fabric, and further back again I recall buying specialist dual-ported SCSI drives that were marketed primarily for cluster i.e. quorum-disk purposes. More than one goat was required to make these configurations work.

(I seem to recall that many SAS drives now are dual-ported but designed as such to fulfil a desire for multi-path resiliency within storage arrays; this may have different characteristics to those needed of a quorum device)


I’ve seen Oracle RAC in use in many banks in particular where uptime and resilience are critical requirements. It scales well horizontally, but at a very high cost.


How much redundancy do you really have if both (or all N) systems are within 10km of each other?


Alternatively a private L2 network for a N > 2 RAC cluster.


And this is why Google wrote Spanner. I think cockroachdb tries to solve the same problems.

If you need ACID compliance and you need a lot of it, everywhere, all the time, now there are better options than giant Sun/IBM boxes.

Databases are not the problem.


> If you need ACID compliance and you need a lot of it, everywhere, all the time, now there are better options than giant Sun/IBM boxes.

Oh I dunno, I'd still say most claims to be "better" at data integrity with performance, and a lot of it than a good old zOS mainframe with a soundly designed DB2 database on it are [Citation needed]-class...

But yeah, databases certainly aren't the problem here.


It's still a fairly new thing for people outside of Google. It's been a few years, so I imagine it's better now...but when I tried CockroachDB, it was pretty easy to trigger horrible write latency. Bad enough to be a non-starter.


Which version, and what networking did you have between the cluster nodes?


I don't recall other than it was before 2.x, but I don't think it's terribly important. The cockroachdb folks were very transparent about it, including a couple of major re-works to address performance. Here's a blog post about one of them: https://www.cockroachlabs.com/blog/2-dot-0-perf-strides/

I was just trying to say that a stable, high performance SQL database with distributed writes is a fairly new thing.


The article is confusing (confused?), mixing up different concepts like containerisation and active/active database configuration while not even going into detail but only mentioning concepts in passing by. I don't get the point.


I have been working on rewriting a monolith into individual services during past year at work - and what we finally implemented for consistency across different databases of services was that we keep a signal queue - where whenever we encounter an inconsistency, a doc is pushed containing the info and we have a corrective service always reading from that queue and doing the necessary updates to tables. We made a heirarchy of source-of-truthness and we use that to decide what to do with inconsistent data. Though, users sometimes have to see a error messagge "We had an error and we are fixing it in few XXX time" based on current queue load - but it has been working fine mostly for us.


… is that less operational overhead than running a single process on a server?


Why was that change made? From my understanding users are constantly seeing outages now where they see an error message and have to wait. Were there more outages before? How was the previous system?


Am I just dense, or was the idea the CEO had just not a good idea and simply garden variety premature optimization?

Also, it seems unfair to say that the database ruins all good ideas when there continues to be significant impressive innovation in the space in cloud services. Like Amazon Aurora for one. If you really want to treat your RDBMS like an elastically scalable (with significant caveats) resource priced by the request, you can.


I think the title implicitly meant “good”, as in “you think it’s a good idea until you press it against the grindstone that is reality”.

Also I think these issues arise at the database because that’s were the write concurrency is pushed.


If you really want or need to go active-active but also don’t want to think about it much and can live with MySQL my preference would probably be AWS Aurora multi-master if you can live with its limitations:

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

As noted there and elsewhere though often you are better off just scaling up.

If you do want to really think about it then NewSQL (Cloud Spanner or others) could be the way to go but the cost, performance, compatibility/portability and other concerns make this a much more complex decision that isn’t easy to capture in a blog post (or HN comment ;).


No it doesn’t. They scale amazingly well if you throw money at the problem. Most people never get there. When you do you will know. I’ve been there. When you’re spending $3 million on hardware and licenses a year you either have a viable business or fucked up badly. That’s the real decider.

The answer is to start siloing customers or application concerns out into separate clusters depending on your operating model.

If you spent the last decade writing shitty cross database SQL then you’ll have to silo by customer. That’s the only real constraint.


I'd phrase it differently: the database is the litmus test of your supposed "good idea". If your good idea doesn't survive referential integrity or even data consistency, your idea doesn't deserve the label "good".

Of course, the next "good idea" is to sacrifice data integrity in the name of performance. That can work, but usually it's just a technical form of growth hacking. Sacrificing data integrity without understanding the problem space is a disaster waiting to happen (but luckily, it may not happen on your watch so you can continue hacking).


Very good points there.

I have directly experienced the problems associated with blindly assuming that you can write your own consistency guarantees in Redis. I didn't do it but I had to write some tooling to rewrite the AOF to get some data back...


I'm sorry, are you saying that a good idea should survive the loss of referential integrity or data consistency?

I don't feel that's what you mean, and it's probably my fault for misreading your comment.


I read it the other way: if abandoning correctness is the only way to make your idea feasible, your idea kinda sucks.


Yeah, I probably could have phrased that better. The sibling comment is correct, I meant that if your idea requires abandoning referential integrity or data consistency, it's probably not a good idea. Since it's the database that actually enforces those constraints, it may seem that the database causes the problem. But in most cases, the problem is the data model or the idea itself.


I see.

I mean, one of my ideas requires many transactions and records that require data consistency. I have no idea how to make sure it can scale to Internet scale. Do you know any good resources? Or is it just "get mysql, pay thousands a month for either a cloud provider or colocated hardware"


I would say that most resources would be a mixed bag in most situations but much of the advice skips over something very basic that experienced people tend to assume you know: you are always going to be limited by the number of times you can actually sync a write to a disk and/or how fast you can read back from the disk and these are the main upper bounds. One mistake I see inexperienced engineers make in modern environments is not realizing that in their particular cloud instance they are explicitly or implicitly using the cloud vendor's network storage and they are entirely hamstrung on that. The other is not configuring the database to use all the RAM- but those are usually covered by the existing tuning guides.

The next from a design perspective is to leave natural places to shard/split the data up so you could represent the data consistently on multiple systems. Sometimes it's pretty bonehead things like your database tables related to login are affecting your critical path transactions and/or also tied to the referential integrity of your application data in ways that are hard to split apart. It goes back to the IO thing where if you are reading and writing login or UI traffic you sure as heck are not writing some other critical time sensitive transaction especially if it's to the same disk. The other obvious thing is to be trying to run analytic queries (aka look at many records) on a database tuned for mass INSERT/write traffic and having no plan to do that with a different resource.

If you are still worried about it I would suggest just prototyping it using a currently modern API/RPC server in front of whatever datastore you come up with. As systems have gotten more distributed the straight line speed of connecting to the database directly makes less and less sense in my opinion and having an API server you control allows you to be clever later in ways that might not be obvious today (especially keeping application-aware counters so that you can see what your application is actually doing as query-style logs for databases are extremely expensive in performance). Either way if you sit down and consider the performance characteristics of the underlying system, think about each class of transactions are you considering from your application in a write/read context (hopefully from measured data), and using tools like EXPLAIN you'll quickly start finding what pieces of the puzzle might need to move to start scaling.

The answer also might just be some parts of your application would be better served something other than a database like a message queue, a more straightforward ISAM or key-value store, or depending on your situation just flat files written onto the fastest local storage you can get to then be later transformed and loaded into a database sometime later.


Thanks for giving me some thoughts to turn over in my head. I have to admit, having been able to support most of my projects on a simple LAMP stack, I may simply be overly worried about scaling. But my hope is always to have a sideproject skyrocket.

Did you read the article a little while back about Discord's challenges with storing tons of messages and how they upgraded their tech stack? Did it make sense to you?


The discord post made sense because they understood what their workload was and did the type of 'classifying' queries that I previously described. They also decided to move off Mongo when they were hitting the 'real limits' of the underlying system from not being able to fit critical things in RAM. So they had a solid understanding of what was going on and they moved to rectify it. I don't want to presume to know anything about your sideproject but there's always going to be an limit you hit in any system, and you just need to build monitoring to understand when you are approaching that cliff.

They also had some clear characteristics of their system for their critical query that they could take advantage of: a new message is written to generally once, the messages table scales linearly, and updates and deletes are relatively rare. Using those things they were able to seek out a datastore that helped maximize the thing that's happening the most which is the write, and they were able to engineer a system accordingly. They also know that the messages would grow linearly without bound at a variable rate, so they could plan for that capacity. To me, that makes a lot of sense.

Especially if it is a side project you can't over-engineer capacity but you can do at least the measured math of what actions are actually happening in your system and what are your expectations for those queries. I would suggest that often times you'd want to focus on your critical queries that if slowed down would have the biggest net negative, but you'd have to find those. Every user action in a system should be attributable to some combination of reads or writes that can be reasoned about to try and figure out where there might be problems of contention, linear growth, exponential growth or growth up to a limit for a table, or something else. Either way, once you start classifying some of the critical paths in your system often either an answer or better questions start to come to the top. Without some of this information you can't back of the envelope whether or not your system can take a perceived load.


I'd echo the sibling comment, but also add that thousands a month isn't that much if you have a profitable idea that is internet scale. Thousands a month is the cost of one software engineer.


Sorry, you are absolutely correct that a profitable idea at internet scale can absorb monthly hosting costs of thousands a month. I was using it as a placeholder for "throw money at hardware/cloud services".

Especially if costs can scale with the userbase, it's a very good problem to have.


> The answer is to start siloing customers or application concerns out into separate clusters depending on your operating model.

This is such an underrated solution (siloing or sharding your data in some way). I think people don't do it because:

1. The tooling doesn't make it super-easy (e.g. good luck sharding Postgres unless you're willing to pay for Citus)

2. "Trendy" companies in the past decade have been network-type products (social networks, etc), where the structure of the data makes it much harder to silo (people need to follow each other, interact with each other's content, etc)

3. We as an industry took a several-year detour over to NoSQL land as a promised solution to scalability.

Life would be a lot easier if you could say something like:

* I want a Postgres node.

* I'm happy to shard my data by some key (customerId, city, etc) and am willing to accept responsibility for thinking through that sharding key.

* My application has some logic that easily knows which DB to read/write from depending on shard key.

* There's some small amount of "global application data" that might need to live on a single node.


Sharding by customer has a lot of benefits for b2b applications also. You can run each customer in a seperate instance in their own vpc if you want to, and guarantee to your customers that their data is never commingled with another customer. This makes a lot of sense if you have relatively low number relatively high value customers.

It radically reduces the infosec/screw up blast radius. For example it means however badly you screw up it is just about impossible to accidentally show data from customer A to a user from customer B.

You can let customers specify region/availability zone and bring their own keys for encryption making lots of enterprise security compliance things easier (for regulated use cases).

There is no one true way of scaling that suits all possible use cases and as engineers, we should think for ourselves and build a solution that works well in our situation, rather than thinking you have to do a certain thing because that's what google/facebook etc do. Their scaling problems are most likely really different to yours.


> The tooling doesn't make it super-easy (e.g. good luck sharding Postgres unless you're willing to pay for Citus)

There’s YugabyteDB.


For your "life would be easier" version, isn't that pretty much what Citus does? And Citus is open source so paying for the managed version is optional.


saying "Citus is open source" doesn't mean its as simple as running `yarn install citus; yarn run citus`. Its is much much more complicated than typical open source. You pay for support for a very good reason.


On paper I’m not dealing with the same problem as the people in this thread, but I can’t help thinking that the root causes are the same.

I have a team that won’t even look at the fact that we measure performance by response time, but we are lumping together write traffic and two classes of read traffic into a single monolithic chunk of code. We are SaaS where our customers write and their customers read. And of course search queries are by far the slowest traffic.

But there’s some sort of mental block about splitting things up that is only slowly changing, and our app is so “flexible” that a load balancer would struggle to tell which urls involve search functionality.

I think people just want to “know where to look” but if your code is on a cluster there already is no “there”. You’re probably looking at some log aggregator anyway, so who gives a tinker’s damn if they come from separate log files on the same box or the same log file?


I just automatically silo by customer because I am so scarred by so much shitty SQL written by ??? that it just make sense for privacy reasons alone.

Row based permissions are not great imo, either performance is crap or you have some weird bug.


Totally agree there as well.


Databases scale just fine. Its called sharding. Azure has HyperScale (with is Citus), AWS has Aurora. The actual problem with databases is upgrading to major versions without downtime. All the other problems are just noise


Agreed. But one impediment to sharding is bad DB design, which is rife right now because everyone starts with Django/Rails/Laravel and starts treating the DB like a document store that happens to have some relational features.


Article would benefit from an example or two of ideas ruined by databases. The examples given all have to do with implementation details, not ideas. Databases actually enable a lot of good ideas, and reliable well-understood implementations.

Pointing out that database servers don’t scale horizontally like web servers had the same level of insight as pointing out cars don’t scale like kittens.


Can’t be bothered. Aurora in Postgres/MySQL and then just scale that vertically till it stops.


I think that being able to have separate databases for different domains is pretty good, although it does rely partly on the application layer to keep (UU)IDs consistent.

I'd be curious to know if there's a way to have databases talk to each other to just sync up primary keys for referential integrity. That could maximise the benefit of decoupled databases while still having good referential integrity. And a network disconnection would still mean existing PKs would be in place, it just wouldn't be aware of new ones. Not perfect, but not bad, perhaps.



> Or maybe you use sqlite in dev/tests and something bigger in prod.

I had that for a while. SQLite for the test suite and Postgres for actual runs. It was a shitshow, having to support two separate SQL dialects with separate feature sets. Then I realized that you can actually bootstrap and start a Postgres in tmpfs in about one second, so that's what my tests do nowadays.


Nice, does MSSQL have something similar? I know there is localDB, but that has its quirks and databases need to be cleaned up manually.


I’m not convinced the author knows what they’re talking about. The answer to the question they posed is basically “ACID”, so unclear what all the verbiage is. Also unclear why they expect all readers to have an architecture featuring multiple application servers all sharing a single database, since that is neither classic monolith nor classic microservices.


I get the same impression. Relational with ACID guarantees can't be implemented in a distributed system. Depending on which guarantees you want to give up, you might be able to get close. A lot of newer platforms give up guarantees to achieve something like this.

But if you have multiple nodes that have to sync every transaction over a dedicated NIC, well, that's not a distributed system. It's just multiprocessing with extra latency.


> Relational with ACID guarantees can't be implemented in a distributed system.

CockroachDB and YugabyteDB would like to have a word.


It's a causality problem, not a technology problem.

Information can't be changed in two places simultaneously. That latency, no matter how small, requires us to either eat the latency, give up guarantees, or alter our system's behavior in some other small way.


Of course. But thanks to raft per table, the queries are coordinated in such a way that to the observer it looks like acid.


More people should know about YugabyteDB. Granted though, article talks about about „traditional RDBMS”.


just because it's the thing I know best at the moment, it's worth checking out Galera which actually does do multi-master replication. Of course you have to live in MySQL/MariaDB world which is a non-starter for a lot of peeps.


>Each tier is either easy to reason about scaling out horizontally except for the database.

Vitess [1], A database clustering system for horizontal scaling of MySQL, or Planetscale [2] which is the SaaS version. Of course everything is good on paper until you run into edge cases. But I am convinced within this decade scaling problem or hassle will be a thing of the past for 95% of us.

[1] https://vitess.io

[2] https://www.planetscale.com


"I learned a ton about splitbrain mostly through trauma"

The most typical way to learn about splitbrain it seems. I don't think I've experienced more work trauma than split brain trauma.


SQLite is a remarkably good solution to most of these problems, if deployed correctly.

For your main line-of-business database? Of course not. But a deployment of rqlite[0] for your service workers in a read-heavy workload? cuts out a round-trip out of the VM, mocking is trivial, there's a lot to like there.

[0]: https://github.com/rqlite/rqlite


You're missing the whole point of the article. If you set this up in an active-active configuration (if that's even supported) you have the exact same distributed systems problems as the author.


The whole point of having a database is that you need persistent data that is consistent for all clients. You can’t have a bunch of SQLite dbs and get that.


rqlite author here. Happy to answer any questions about it.


I liked the article idea but I feel like it is two sort of related things in one.

I'd have liked to learn more about why it's only one db. For instance because they're often insanely efficient and one server can handle many queries.

I was also hoping that it'd suggest specific alternatives.


The issue with scaling the database is that SQL joins (generally) do not scale under CAP.

So, alas, you either go SQL in the early stages and then need to do considerable engineering to down-convert to say, Cassandra or DynamoDB.

Or you accept reduced database language sugar and complexity up-front (no joins, limited index/views, or architect with explicit sharding) with a more scalable database approach.

There's basically no magic sauce for scalable SQL. Frankly most of the people telling you otherwise are selling varying degrees of snake oil.

As people will point out, SQL databases on modern hardware scale pretty freaking large. So you can get a lot of mileage putting off the "true scalability refactor". The good news, by then you should know your queries and data that need full scaling, you aren't guessing ahead of time.


SQL is just a language. Any techniques which can be used scalably (client-side joins for instance) could also be used by an implementation supporting the SQL language. Perhaps your argument holds for some of the more well-known RDBMSes out there, but I don't think SQL necessarily has to be unscalable in the general case.


Let's take the general case of a join: two tables that are keyed/indexed by fundamentally different data.

To distribute data, you have a data distribution scheme. Cassandra and various distributed hash maps (which is the typical approach) it is a consistent hash function. But even if you do distribution using natural ordering, the same problem exists:

The data you are joining is going to be on different nodes on a row-by-row case. Hashing will produce this by the design of the hash function. Natural ordering will do this because different key datatypes will order differently.

In the case of large scale distribution across a LOT of machines (which is what you invariably have to go to once you expend the options in big iron), that means a huge amount of network traffic, with each retrieval needing to be resolved for consistency due to if you want AP. If you rely on CP, then your join is dependent on SO MANY nodes correctly communicating that you become extremely exposed to network partitions, retries, etc.

Thus you either shard your data so all data is on the same machine (but your joins are necessarily subsets of the overall data: only the shard), or you don't and prepare for extremely bad performance, unreliable performance, or approximations of correctness.


But how is that specific to SQL?


To me this boils down to the hard part of concurrency is shared memory.


Isn't that like saying reality ruins all good ideas. :)


> In addition, you have footguns with non-obvious behavior around [Docker] volumes.

Anyone know what these are?


He may be referring to overlayfs which has had its share of bugs along the years.


A dockerized database server is something I cannot understand; I understand bare metal, I can accept virtualized, but I cannot find a good used case for a mid sized or large server (dozens of gigabytes to dozens of terabytes) dockerized and I don't know why a smaller server is a problem.


A docker image is simply packaging. There’s a small, almost always negligible, performance hit. But the value comes in being able to ship a consistent application and configuration pairing. This becomes really valuable when shipping multiple databases for multiple services.


I know what docker does, I never saw a database server deployed in mass. One of my teams manage over 250 database servers, but nobody ever supported the idea of using docker for that (all the servers run on Windows).

Shipping multiple databases for multiple services does not necessarily mean multiple database servers. In the database world I saw large servers with many databases more often than multiple servers of one database each; in the second case it was the vendor laziness (cannot give the name), not than a reasonable business or technical reason. When I asked about it the answer was "we'll consolidate in the next release".


Sure there’s a lot of variables here. If you’re a Windows shop I would not expect a push for docker since it is Linux based. As far as multi-tenant logical databases, that is also highly dependent on workload and failure domains (and significant cost if you’re talking Oracle).


For our team, the reason is standardization. We have our entire baremetal infrastructure running Kubernetes. There is no process for running anything in not-a-Docker-container. This does come with some caveats for stateful databases, but it's still easier to manage than a special-snowflake environment for database deployments.


vitess on k8s gives u some nice options.. sure a little extra application logic but so what. you can easily break out of vitess sharding whereever its needed and you still get the benefit of k8s schuduling.


Writer clearly haven't heard about MongoDB :P


Aren't there quite some solutions today?

Fauna, Upstash, DynamoDB, CosmoDB, Firestore, Planetscale.


Use CockroachDB.


To solve what problem?


To get replication and horizontal scalability for free (well, within the license fee). But I’d go for YugabyteDB instead of CockroachDB.


CockroachDB is a terrible technology that causes almost guaranteed data corruption due to its lack of ACID guarantees and is written in a language with a GC which contirbutes to GC pauses. The dev team refuses to listen to feedback to port their code to C ;(.


Because "just rewrite your program in C to avoid GC pauses" is such a flawed argument when it comes to any production system that it isn't even worth discussing.

The reality is there are limited resources to work on any given project and "rewrite" is generally not the correct way to fix a given problem.

Especially since the first thing you are going to need to do is show that a network system isn't resilient to GC pauses or that GC pauses are frequent enough to impact throughput (very very few applications are actually materially impacted by sub second spikes in latency).


GC pauses are not the only GC issue. Thrashing the caches and blowing up memory usage by an order of magnitude can be also very bad for performance. In a database system memory is very precious - the more of it you can use for caching / buffering users data, the better the performance.

As for the subsecond spikes in latency, these tend to multiply in a distributed system. If serving a client request takes N internal requests, the likelihood of hitting a GC pause somewhere is much larger than if you did only one local request.

Not sure about Go, but none of the "free" Java's GC guarantees low pauses. There is STW fallback even in the most advanced ones like ZGC. So you never know when it stops for more than 1s.


Go GC pauses are bounded at 0.5 ms:

https://blog.golang.org/ismmkeynote


Is it a hard guarantee or just a soft goal like in G1, which tries to stay within the target pause milliseconds, but there are many ways it can fail to? Can it handle 100+ GB large heaps?

Even if it is a hard guarantee, then, from the link you posted, it is not even generational, so it will scan the whole heap quite frequently, and that is going to influence average performance quite visibly - you definitely dont want a database system to access all its cached memory once in a while.

Database systems are really all about memory and I/O management. You shouldn't outsource those core features to a universal algorithm, unless you wish to forgo any competitive advantage (at least in performance department). So this pushes the devs into the off-heap manual memory management territory, where dragons live (at least in Java, again - maybe Go is better in that regard). I've been there, and I don't recommend.


> Because "just rewrite your program in C to avoid GC pauses" is such a flawed argument

False. All popular databases are written in C, and continue because of the GC issue. I would not use a general-purpose database written in Java because of GC, for example. We'll see how well Go works in practise.

> sub second spikes in latency

Go is supposed to be sub-second GC pause latency, but understand that most SQL queries are sub-millisecond, so GC latency is still a significant issue compared to query times.

Go might be acceptable now for niche databases like column-store for certain use cases, though.

Also, see the excellent comment above about distributed systems and server cache issues. You can't do application performance analysis with GC literally everywhere.

The puerile knee-jerk hatred for C on HN has to stop - almost every software you use is written in C, from scripting languages to operating systems to web servers to databases.

Source: DBA who's worked with current databases, as well as a custom database written in Java with significant (ie. brutal) GC problems that required a total refactor (rewrite) to "work" at all.


I'd like to reply specifically to the Go is supposed to be sub-second GC pause latency part - Go's GC is crazy fast, it pauses for only microseconds, comparable to the socket transport overhead. It can't be thought of as Java-like at all.



The fact that database engines written in Java exhibit high latency is not a reason to dismiss all other programming languages except C and C++. It still makes sense to consider modern languages like Rust, Go, Zig, etc.


I've run production systems on H2 that ran rings around the same company's dedicated-DBA systems.


This is daft. CockroachDB exhibits extremely strong isolation and consistency levels. I believe it is strictly serializable under most circumstances?

Also - anyone who says that x database must be rewritten “because GC” is just making an incredibly un nuanced argument about a nuanced problem. People have built production ready databases in both Java and Go. If you care about low/predictable tail latencies then you have a bunch of other more important problems to solve before you worry about the behaviour of a modern garbage collector. For example: how good is your cache hit ratio? How are your synchronous replication protocols affected by grey failures? That kind of thing.


I believe it only guarantees serializable isolation. You may get strict serializable, but it doesn’t appear to be the case that you will know for sure if any transactions were not linearizable.


> its lack of ACID guarantees

Our transactions implementation is our crown jewel. You might want to check your sources.


weird, nobody mention nuodb.


At a glance, they don’t publish a price list. There might be a 500 GB trial version behind “You are not authorized to access this page.”


Even without using NewSQL databases, it is often very easy to structure your application for application-level sharding.

For example, for https://corridorchat.com/, we have a relatively small number of business accounts (tenants), but with a many users per tenant. And new tenants are created relatively infrequently in the scheme of things.

So I have an architecture with a central 'corridorchat central' PostgreSQL database and a scalable number of shard clusters, all managed with Patroni + Consul, and fall-backs that are read-only until they need to be promoted. Consul DNS allows the application to look up either a read-only replica or a write one.

To know what shard a tenant is hosted on, it is necessary to read from the central database. This requires one of the read replicas - and I can create as many of these as I need. Many transactions then require writing to the shard database for that tenant - but since I balance tenants between shards (and have several shard databases per cluster to allow for future scaling if a shard becomes too hot) I can add more shard clusters as needed. Writing to the central database is constrained, but it is a very rare operation, so there is no expected scaling problem there.


This is great until you have a tenant too large for a single shard. It’s pretty difficult to perfectly shard relational data. At some point you’ll run into a need for cross shards queries.

I think for most workloads, this approach should work for a good long while.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: