Hacker News new | past | comments | ask | show | jobs | submit login
If All You Have Is a Database, Everything Looks Like a Nail (pathelland.substack.com)
178 points by kiyanwang on Dec 8, 2020 | hide | past | favorite | 154 comments



I think

> "Isolate applications. Keep different applications away from other applications’ tables."

is only good advice if the tables are application specific data and you don't do microservices in that stupid braindead way that makes it so that everything from the admin panel to data visualization are their own "applications" with their own databases and doing things that would be even the simplest of queries becomes a project in writing what are effectively bad performance joins via random http APIs. IE have a data model and understand where the most painless boundaries are, don't throw up dozens of DBs for the hell of it.


> doing things that would be even the simplest of queries becomes a project in writing what are effectively bad performance joins via random http APIs

In some ways, is GraphQL trying to slap a fresh coat of paint on what is otherwise is a bad problem under the hood? At my employer, we're trying to adopt it, but we've never addressed the underlying performance issues of the microservices, and now the slowness is quite apparent with the GraphQL wrapper tier.

This isn't a diss to GraphQL. The queries are awesome to write. The developer experience of writing / being the client of a GraphQL server is extremely excellent. But I feel I might not be alone in a journey that is playing out to be a mediocre implementation because we are avoiding the core problem.


If you find yourself doing in-code joins across microservices, you need to stop and think:

Are my services correctly scoped? Am I doing these joins because there is maybe a higher level domain that goes across these two services?

Should I re-scope my services? Or should I build an Aggregator-View service that holds a JOINED copy of the data.

These aggregator-view services often make sense as Backend-For-Frontends. They must be restricted to a read-only API since they mirror data from other microservices.

GraphQL is a dead end when you are doing joins in code.


What you are describing is in my optinion the exact problem happening with almost all microservices. Building an aggregator microservice to join together two services is another microservice in a complex microservice landscape...

People on hn seem to hate monoliths, but a monolith with just a few services outsourced to microservices is much easier to handle. No more joining microservice, no more race conditions when trying to do transactional processing over some microservices and all the problems of microservices.


I agree that a well-maintained monolith is simpler and more stable than micro services.

The emphasis is on well-maintained. If you’ve ever had team A write a crazy query and kill the database for team B, you’ll start to appreciate the damage containment and encapsulation that microservices provide.

EDIT: Of course there might be other, better solutions to this problem.


With “kill the database” I mean things such as transaction deadlocks.


>If you find yourself doing in-code joins across microservices, you need to stop and think:

This is a place where far too few people stop to think about what you're doing. I've seen too many cases where an operation requires coordinating requests across multiple service methods THAT ARE ONLY USED IN THAT PARTICULAR OPERATION! The devs spend all of their time a) coordinating things and b) trying to make it performant and c) realizing none of the gains of having microservices to begin with.


It does beg the question, if the developer authors a GraphQL server to do joins, why didn't the application just do that in the first place? Is it always a symptom of a dead piece of engineering?


The alternative is an explosion of REST API routes as each application has different requirements. It is much simpler to have one graphql endpoint that is flexible enough for all consumers.


I've seen the explosion of REST endpoints you mention, several times. But isn't GraphQL just adding a thin veneer over that, giving the appearance of a single URL, but actually kind of containing all those REST routes under the hood? If you have an explosion of REST routes, you'll likely end up with an explosio6of GraphQL queries too.


Yup, pretty much. However, the implications of veneer can be significant.

Providing a single unified interface which provides granular, unified querying of disparate types has several benefits. In particular, it can simplify logic in the consuming application, and reduce network load (both in terms of the number of requests, and especially in the amount of data returned).

For example, a pet website backed by a traditional REST architecture, may have separate endpoints for say `/pets`, `/owners`, and `/purchases `. In this context, the front-end may need to make calls to all three of these endpoints, retrieving the full payload for each - only to discard most of the fields, and keep 2 or 3 relevant ones from each entity.

By comparisons, a GraphQL based approach would allow a single consolidate query for just those specific fields (from all entities).

Of course this isn't relevant in every use case, and there's no silver bullet - and in many cases, a REST based approach may well be better.


No.

It contains only a subset of the rest routes and allows to combine them efficiently. That is what avoids the explosion. You don't have to add any new "rest route" equivalent in graphQL if some user just wants a new combination of data but doesn't want to make 10 requests to get it.


That's the theory. But I've only seen it work in practice if there is a single backend data source, and that data source is a database. In which case, we've been able to achieve similar results using OData for several years.

Often in reality, you end up with specialised GraphQL queries for performance reasons, because they are reading and combining data from a plethora of backend data sources - cloud-based APIs, on-premise APIs, external APIs, databases...


That's the theory and it can very well be practice. It has worked very well for the teams where I used it or have seen it being used. And I'm talking about a service that provides an API to a frontend (or other backend services) and aggregating data from multiple datasources of different kinds into one graphQL API.

> Often in reality, you end up with specialised GraphQL queries for performance reasons

Ah, is that so? And with REST you don't?

You seem to have _a lot_ of experience with GraphQL. I wonder why your previous post has the form of a question. Or was that just a sneaky way of giving your opinion without backing it up?


> Ah, is that so? And with REST you don't?

No, that's exactly what you end up with using REST; I'm pointing out that, at least with deployments I've seen, GraphQL ends up with the same problems, only hidden behind a single URL.

> You seem to have _a lot_ of experience with GraphQL

I didn't (and wouldn't) say that I have a lot of experience with GraphQL (I have far more with REST/HTTP-based APIs), but I've seen it used enough times to know it's not a panacea, and doesn't always solve the expected problems, especially when it's fronting dozens of (sometimes painfully slow) backend data sources.

> I wonder why your previous post has the form of a question

It doesn't? I was proffering an opinion, based on my observations when GraphQL is fronting many data sources.

> Or was that just a sneaky way of giving your opinion without backing it up?

I would politely and respectfully point you towards HN's comment guidelines[0]

[0] https://news.ycombinator.com/newsguidelines.html#comments


> I'm pointing out that, at least with deployments I've seen, GraphQL ends up with the same problems, only hidden behind a single URL.

Cool, let me give you a mini example, you'll immediately understand it.

Two endpoints: 1. users, which contain video-ids and can efficiently be queried by user-id (index in database) 2. videos, which contain video data can efficiently be queried by video-id (index in database)

How do I get all uploads of a user? I get the user by id (request 1) then all his videos by making one request per video (n requests). No big problem for the database (it's all indexed) and sure, we can improve the performance, but for now the bottleneck here is the number of requests that goes over the network if we use REST. This doesn't work. You need to make a new endpoint or change an endpoint to make this work in a performant way.

In GraphQL we have one endpoint with two different queries. So the same problem can happen if someone queries naively. But, they can also write one nested query that says "give me user for id X and for each video id, give me the video data". It will be one request from frontend to backend, but still multiple selects to the database, unless we use some "smart" GraphQL framework that does magic for us.

But we already solved a big part of the problem. And maybe that is already performant enough for what we need and we don't have the need to improve the database query part. Lot's of time and code saved on the backend and frontend. Yay.

You might say "but we didn't have indexes". Then my answer is: well, you are not worse off with GraphQL and still gain the benefits for all the cases where you have indexes in place. If you have non of these, you seem to be doing something wrong.

> It doesn't? I was proffering an opinion, based on my observations when GraphQL is fronting many data sources.

Yeah, seems you edited your post. When I read it, there was 100% a question-mark in it. ;)


> How do I get all uploads of a user? I get the user by id (request 1) then all his videos by making one request per video (n requests). No big problem for the database (it's all indexed) and sure, we can improve the performance, but for now the bottleneck here is the number of requests that goes over the network if we use REST. This doesn't work. You need to make a new endpoint or change an endpoint to make this work in a performant way.

The answer to this problem is resource expansion. Dylan Beattie actually has a nice and somewhat humorous presentation about how to address this with REST interfaces: https://youtu.be/g8E1B7rTZBI

Here's the part that addresses such data access in particular with a social media site example (people who have photos, which have comments etc.): https://youtu.be/g8E1B7rTZBI?t=1807

Sadly, approaches like that and even HATEOAS are not awfully widespread, mostly because of the added complexity and the support by frameworks simply not being quite there.


Yeah sure. You can use resource expansion and also field targeting (to only get fields you need), add some typesystem on top of it, make sub-resource-expansion look a bit nicer by... I don't know, creating a small DSL for it. And then standardize documentation and tooling around it.

Voila, now you have your own implementation of GraphQL that might even be better than the original one, who knows. Only one way to find out!

Until then I will continue to use GraphQL wherever it works well. :)


> In GraphQL we have one endpoint with two different queries. So the same problem can happen if someone queries naively. But, they can also write one nested query that says "give me user for id X and for each video id, give me the video data". It will be one request from frontend to backend, but still multiple selects to the database, unless we use some "smart" GraphQL framework that does magic for us.

In reality, you'd more likely have an endpoint like `/videos/user/12` or `/users/12/videos`, which would provide all videos for a user, but we'll put that aside for the sake of an example.

Hmm, so I think you're saying that there would still be separate queries, but you're "batching" them, such that client sends multiple queries simultaneously to the backend, and then the backend is responsible for executing them? With HTTP/2 (which supports request multiplexing), you get a similar result using REST endpoints.

> Yeah, seems you edited your post. When I read it, there was 100% a question-mark in it. ;)

I'm not entirely sure if you're kidding, but I assure you I didn't edit my post.


> In reality, you'd more likely have an endpoint like `/videos/user/12` or `/users/12/videos`

Sure and then you do the same with images and friends. And all of those have tags. Friends are also users so they also have videos and images... which have tags. Friends can also have friends of course... it can even be circular!

I wanted to keep my example simple, but here you go...

> Hmm, so I think you're saying that there would still be separate queries, but you're "batching" them, such that client sends multiple queries simultaneously to the backend

Sorry, I used the same term for two different things. Nothing happens in parallel. Frontend sends one request (not one query...) to the backend which contains two graphql queries (one of them a nested query). It is all one http request and one response. No batching here, you can send this request using curl or postman.

HTTP/2 does not help here, because the second query is dependent on the first one (hence a subquery). Because you need to know the video ids first. Sometimes HTTP/2 helps, but not in this case.

> I'm not entirely sure if you're kidding, but I assure you I didn't edit my post.

Not kidding at all - maybe I mixed it up?


I've worked with quite a few REST APIs where you can send a GET request along the lines of "GET /users, include: videos".


It's essentially a facade pattern[0], which in itself can be a good thing.

For example, if all consumers move to GraphQL and away from interacting with all the separate microservice REST endpoints individually, you then have the option of refactoring your architecture (merging/adding micrososervices for example) without worrying about disrupting consumers so long as you ensure that the GraphQL queries still works.

[0]: https://en.wikipedia.org/wiki/Facade_pattern


Look into the patterns of CQRS, event sourcing, flow based programming and materialized views. GraphQL is an interface layer, but you still have to solve for the layer below.

API composition only works when the network boundary and services are performance compatible to federate queries. The patterns above can be used to work around the performance concern at a cost of system complexity.


> API composition only works when the network boundary and services are performance compatible to federate queries.

Isn't that the whole point?

The main reason why individual reaponsibilities are peeled out of a monolithic service into independent services is to be able to unload those requests to services that can be scaled independently.


> even the simplest of queries becomes a project in writing what are effectively bad performance joins via random http APIs

Don't forget the part where the queries are impossible to test, because you can't spin up real instances of all 15 APIs in a test environment, so all the HTTP calls are mocked and the responses are meaningless!


> Don't forget the part where the queries are impossible to test, because you can't spin up real instances of all 15 APIs in a test environment,

Why? What's stopping you?

> so all the HTTP calls are mocked and the responses are meaningless!

Why are you talking about mocks in the context of an integration tests?


You're the one who said "integration test".

A single DB query is normally something that _can_ be trivially unit-tested. But if the DB query is split across half a dozen complex services, suddenly it can _only_ be integration tested. Which usually means slow iteration and no local testing.


> You're the one who said "integration test".

Yes, I was. And that's the problem, because it seems you didn't realize that tests that involve checking how 15 APIs interact are indeed tests that check how different services are integrated.

Do you disagree with anything?

> A single DB query is normally something that _can_ be trivially unit-tested.

Well, not really. You unit test the call to the repository, not the call to the DB. Calls to the DB are tested in integration tests. Think about it.

> But if the DB query is split across half a dozen complex services,

Aren't you trying to tests how half a dozen complex services integrate?

I mean, take a step back and read your own description of the problem. As soon as you talk about interacting "complex services" is already a clear tell that you're way beyond unit testing territory.

> Which usually means slow iteration and no local testing.

No, not really. It still means you have to add unit tests that cover how your individual clients and repositories and business logic works under the expected scenarios of a distributed system, and after you pass all of those you check how your service interacts with other services in integration tests.


> Calls to the DB are tested in integration tests

This might be our disagreement -- this is IMO outdated terminology, and in my experience not modern practice. It's super easy now-days to swap in a local lightweight database like sqlite which is 99.9% compatible with your production database, and use this for even tiny unit tests. Validating any single database call would definitely be a unit test in any of the codebases I've worked with.

"Integration" tests in my mind cover cross-service interfaces and infrastructure which is a PITA to access locally. For example, interacting AWS infrastructure whose behavior you can't reliably mock locally.


If you're spinning up a database, however lightweight, I'd still call it an integration test.

I think part of the terminology problem is that a lot of apps and services are basically code for turning an HTTP request into an SQL request, and then some more code for turning a SQL response into an HTTP response. That means there's very little in the way of discrete logic to unit test without a bunch of mocking. In a system like that, I think the only meaningful tests are integration tests.

For people who only work on that kind of glue code, it's easy to forget that testable units really do occur in other kinds of coding. But I promise they do.


> this is IMO outdated terminology, and in my experience not modern practice. It's super easy now-days to swap in a local lightweight database like sqlite which is 99.9% compatible with your production database, and use this for even tiny unit tests.

I actually see it as more "modern" to use an actual database.

I've been around long enough to have seen trends going from database -> mocks -> fake -> back to databases again!

Mocking is fine for some cases, as are fakes (e.g. lightweight SQLite wrapper you mention), but for anything non-trivial, such solutions often have "blind spots" - such as foreign keys, or unsupported data types. At the same time Docker was becoming a thing, I (and others) were growing tired of seeing issues related to these limitations.

At some point, it pays to have integration tests that just hit a real database. And with Docker, you can do that trivially - it takes only a few seconds to spin up and populate a fresh Postgres instance. I haven't looked back since.


https://www.testcontainers.org/modules/databases/ is brilliant for this in the JVM world.


There's a fair amount of truth to this. Many applications are glorified SQL generators, that effectively JIT compile the application logic to SQL. So if you want to test the application logic, then you need to run it on the runtime that it has been compiled for, that is: the database.

Testing the application code in isolation would be like trying to test that your compiled binary contains the machine code instructions you expect, rather than testing that it actually runs correctly. There are niche times when that's useful, but most of the time, not really.


>>This might be our disagreement -- this is IMO outdated terminology,

It really isn't. The concept of integration tests, and the test pyramid, is not only the standard practice but also basic concepts that somehow some developers struggle to understand and use.

> and in my experience not modern practice.

That could only start to make sense if modern practices consisted of screwing up their test plans and fail to do any form of test on how different components integrate.

I assure you that that is not the standard practice in any competent shop, big or small.

> It's super easy now-days to swap in a local lightweight database like sqlite which is 99.9% compatible with your production database, and use this for even tiny unit tests.

You're just stating that it's super easy to screw up badly by not knowing what you are doing. And I agree.

Take a step back and read what you're saying. You're replacing an external service accessed over a network by multiple clients with an internal service running in the same process that only your app accesses, generating entirely different databases and SQL, seed the database with between none and almost no data, and then run a few isolated tests.

And somehow this makes you believe you're testing how your system integrates?

I mean, even if you use frameworks with support for code-first persistence models you should be aware that different databases are generated by different targets. Take for example ASP.NET Core's Entity Framework Core, which notoriously fails to impose constrains with jts sqlite driver.

How, pray tell, do you believe your tests are meaningful if you're not even testing a system with the same topology, let alone the system you expect to put in production ?

This is by no means "modern". This is outright wrong on so many levels, including the mental model of how the system operates.

If you're really interested in best practices and how to do an adequate job testing your stuff, you should be aware that every single test beyond unit tests is supposed to be performed on the application intended to be deployed to production. The application is built and packaged after conducting unit tests, and that same package passes.through a whole obstacle course until it reaches production. It makes sense if you start to think about it. I mean, isn't the whole point of testing that you ensure that the app you want to run in production, and not a variant of it or a different build, runs as expected?


> dont' do microservices in that stupid braindead way

Best advice I've read on HN this year.


If you need microservice architecture, it's likely because either the size or the rate of change of your data prevent you from using a single RDBMS with synchronous transactions. So your microservices talk to something like Cassandra.

You might want some of the advantages of microservices without yet reaching this scale. Then you likely have most microservices stateless, and can admit you have a single database where you can do joins (but not updates) across microservice borders.


CQRS


There’s a model of software based around shipping events around, and subscriptions between systems. The purposes of separation are at least a couple important, perhaps you know. Each has a DB, often embedded, that is suitable and materialized from the subscriptions and its own; mutated predictably.


Transmitting events is fine, I like transmitting and receiving events. But if you end up having a bunch of DBs with a lot of overlap I don't know how much you're gaining as it pretty much turns into another permutation of DB sharding.

Of course microservices done wrong doesn't mean event-based systems done right are bad.


> But if you end up having a bunch of DBs with a lot of overlap I don't know how much you're gaining

You're gaining component isolation and separation of concerns, and the ability to scale services horizontally without bothering about shared state.

Also, you gain the ability to develop each system without having any constraint imposed by third parties.

In other words, you're making your service not only simpler but also more robust, easier to operate and maintain.


Each individual microservice becomes simpler, but the whole system becomes much more complex than an equivalent monolith with a single DB.

Microservices have their advantages, but simplicity and testability aren't part of them.


> Each individual microservice becomes simpler, but the whole system becomes much more complex than an equivalent monolith with a single DB.

Indeed a distributed system forces developers to think through how dependencies communicate, which otherwise would be taken for granted.

However, that misses the whole point of microservices. Services are peeled off monoliths because the monolith no longer meets operational requirements, which is trivial to provide with a distributed system. I'm referring to aspects such as reliability, or ceasing to be able to scale vertically.

In addition, some people simply mindlessly develop monoliths not realizing they already have a distributed system, once we account for the fact that managed services and even running a database already opens that door wide open.

> Microservices have their advantages, but simplicity and testability aren't part of them.

Not really. Contary to your claim, microservices make systems more easy to test. The only thing that changes is that you're no longer able to drop a breakpoint in a request and track it all the way through a database. However, you get cleanly separated components with very crisp boundaries that make it trivial to test interactions.

Perhaps the main cultural clash that is added by moving from a monolith to a distributed system is that developers start to need to thing things through when they put up an interface.


> is only good advice if the tables are application specific data

Isn't that the whole point?

> and you don't do microservices in that stupid braindead way that (...)

I feel that this cynical jab is not only set as a flimsy straw man but also adds absolutely nothing to the discussion.

Independent services need to have their own independent database. Simple. Each service has its bounded context, persists it in its database, and just runs. It matters nothing what data the bounded context holds or what your opinions are on microservices.


A lot of posters here seem to have been deeply burned from microservices designed along the wrong lines. I mean, sure, it happens. You're going to make mistakes just like you can misjudge how to separate concerns in a set of classes. It shouldn't be an issue to fix it.

Maybe some teams focus on pure separation before they have a solid design? Maybe its just a culture of feeling like service boundaries can never change?

I don't really get it either.


I think what we're seeing here is Conway's law in action. Most places I've worked don't define their bounded contexts based on separation of concerns, but rather on separation of reporting.

That is: team A has been given arbitrary stewardship of generic app functions T, U, V based on management purview. Team B has been given stewardship of generic functions W, X, Y, Z. So the bounded context is along those lines. But wait, both V and W are related to billing, you say? No matter!

Since the cliché is you can't solve people problems with technology, the same applies to micro-services: it's much harder to convince managers to shift accountability than it is to move the code from one micro-service to another. Even if you do move the code, you get to have your shit and eat it, too.


> I think what we're seeing here is Conway's law in action. Most places I've worked don't define their bounded contexts based on separation of concerns, but rather on separation of reporting.

You're not wrong. I mean, one of the goals of microservices is to peel off responsibilities to a service that can be owned and managed by a small team.

However, what I've been seeing in these anti-microservices rant is developers struggling to acknowledge the need to reason with loosely coupled interfaces, reason about partial system failure modes that were always there in monoliths but we're either ignored or turned into critical system failures, and the microservices tax.


I think some of the problems I've encountered are related to teams trying to do microservices before each service can actually have a small team.


That pretty much sounds like my workplace in action.


I like the idea the author mentions of message queues in the database. It also sounds like it would be easier to separate if need be. I also like the idea of keeping everything in a single database as long as possible. Even if it turns out not to scale, it is a lot easier to figure out your business logic in that environment. Anything which avoids microservice madness and Kubernetes keenness as long as possible is very attractive to me.

Maybe slightly contrary to the spirit of the article, but a few days ago there was talk about implementing the backend in your database here: https://news.ycombinator.com/item?id=25290339. Someone mentioned how Cloudfare was initially implemented on a single Postgres database: https://news.ycombinator.com/item?id=22883548


There is definitely merit in your comment - basically, solve your problem first - but in practice, extracting functionality (such as a message queue), especially if the system is older, is very difficult and expensive, so it's put off so it becomes even more expensive.

I've seen a few projects where instead of committing to a large re-architecting of an existing application, or extracting specific components, they went for a full rewrite. I don't think they were very successful (maybe for the managers that pushed for a new department to be cobbled together from consultants and self-employed folk), because the problem was not well-understood and they over-engineered towards e.g. microservices and/or everything into an event bus before understanding the problem and the consequences of their decisions.

At one project, I'm convinced the decision to rewrite was political (the existing application was a C#/.NET application, the new manager Decided to have the new stuff made in Java on AWS. Oh and for every IT system to be migrated to AWS as well. Then the army of consultants and self-employed folk Decided to go for a microservices architecture linked together on an event bus (AWS Kinesis, for some reason), spent two years developing only to realize that hey, having a separate 'order' and 'inventory' microservice is kinda tricky when you actually have to check inventory to make an order. [/rant over])


>At one project, I'm convinced the decision to rewrite was political (the existing application was a C#/.NET application, the new manager Decided to have the new stuff made in Java on AWS.

When my startup was acquired, they eventually landed on rewriting our C#, AWS-hosted software so it would be in Java and ran on in-house data centers.


On the grand scale of things, I don't have that many years of work experience under my belt. What do you think would be the optimal evolution of a company from day 1 would be in this context? Obviously it's all speculation, but I would be interested to know how you think the message queue on database idea would go if implemented from the start.


If you do only one thing then following the "functional core, imperative shell" idea of application development will set you up nicely for easy refactors in the future.

https://www.destroyallsoftware.com/screencasts/catalog/funct...

In short. Keep mutations to the edges of your application and keep your business logic purely functional.

Day 1 may be too soon to introduce queues etc, but having a clean separation between application logic and attached resources (like your storage layer) will greatly help with maintainability, testability and extensibility from the get go


I've been mostly functional throughout my career, so that method would have appealed to me before. The idea is to never mutate but always create a copy of data to prevent race conditions. Fundamentally though, you are mutating data. We have to accept this. Creating copies at each iteration does not change this. The question then is what abstraction allows mutation without allowing data races.

Recently I was looking into how I would implement a simple TCP protocol in C++. I think Boost's ASIO (asynchronous IO) provides the correct abstraction for this. It allows you to write your code as if it were single-threaded. Code which needs to be serialised can run on the same "strand". Strands simply serialise code, even if the code runs on separate threads.

I think this abstraction would work for the example in your link as well.


> What do you think would be the optimal evolution of a company from day 1

optimal solution is which ever one that gets the product shipped and making profit. All else can come after. Because if you spend too much time planning for scale, even before you reach any scale (unless you specifically have pre-knowledge of such), you would not be successful.


Just one data point, that might be of use to know:

We're currently running into an issue where we've used a table as a message queue in a database.

We gained a compliance requirement that we must log all access (including direct DB queries) to certain info in the database. One of the tables being used as a queue contains this data and is polled hundreds of times per minute making our logs ridiculously huge and we now have to move the queue somewhere else and/or split up the data.


Surely moving the data doesn't change the compliance requirement?


No, but in our case, moving the sensitive data to a table queried less frequently than one used as a queue shrinks our logs considerably.


TBH, that's the most obvious solution for eg. task management. I am sure majority of applications started at least 10 years ago started off their task management using the app database for queue management, before the proliferation of messaging queues.

The couple webapps I worked on moved to a rabbitmq-based solution relatively painlessly (though old BaseTasks tend to linger for a long while still).


Depends what you need the message queue for. I think it can be a sensible choice, it depends how much of the rest of the system you need to start building. If you're then working out retries, etc. start looking to see if there's a prebuilt solution.

I find a few things I tend to look for in solutions:

* Are other people using it for the same sort of scale of problem I have? This cuts both ways - some projects are complicated and work well for people who have much harder scaling issues than I may have, and some projects are too simple to deal with scale issues I have. Some backend systems may need to process billions of things, others sit and take 20-30 a week.

* Are there plenty of people using it? More people = more chance bugs are known about with workarounds / already fixed, more chance other people have hit the same problems I will.

* Have people been using it for a while now? Unless there's a good reason, a project a year or two old that's stable and solves your business requirements can be far better than a shiny new thing. Let other people hit the issues and upgrade next year.

* Can I debug it? This is what actually came to mind reading your comment. A database is nice when you can keep a full history of what's been going on.

* What exists where I have to build the absolute minimum amount of stuff? A nice example here is python-rq. For some of my problems I can easily set it up, it does what I need and probably more and happily chugs along. I need to build next to nothing to use it. Also, whatever you pick first likely reveals another unspecified business requirement and finding that out ASAP is important.

* Is there some industry standard? Company standard?

Don't be afraid to adjust the way you were planning to build your project to fit better with some industry standard approach. If you want to use package X (say, luigi or airflow for data processing) what philosophy does it go by? Can you tweak the way you're working to fit in with that? There are often good reasons for certain choices, and even when it's less clear it'll at least avoid you fighting your systems so much.

Not too precise I'm afraid as it all comes under "it depends" but picking something:

Job queues / processing, try for example

* Amazon batch * Luigi (+aws batch potentially) * Python RQ (with redis) * Just files on S3 (an underutilised option imo) * SQLite or whatever database you're using now if you need something more custom


I agree against microservices where they aren't in real need, and that a single database can go along way if well designed, since ram today is abundant everywhere and stuff like moving indexes on dedicated disks with dedicated iops is as easy as ever. but why the hate against kubernetes?

even your bog standard java three tier app gets some tangible benefit from being dockerized (repeatability, whole os image versioning) and placed in a node that abstract everything else (load balancing, routing, rolling deployments etc)

at the lowest level, it's just a descriptor for infrastructure, it impacts nothing of your app or business logic unless you want it to.

even single node testing environments are covered one way or another (my fav now is minikube), the only real limit is one having a requirement for self hosting.


> but why the hate against kubernetes?

You seem to equate kubernetes and containerization. Kubernetes is just too complex for most usecases, that's all. There are simpler solutions (in the sense of less powerful) that still allow using containers and scale-out but are less maintenance.


> You seem to equate kubernetes and containerization. Kubernetes is just too complex for most usecases, that's all.

Kubernetes is just a way to run containers. You configure your application, specify how your containers are deployed, set the plumbing between containerized application, and that's it.

What problems are you experiencing with Kubernetes that make you feel it's too complex?


I think a lot of people here conflates using kubernetes (not actually more difficult than any other container service) with running a kubernet cluster (a difficult mess and it was even included in the original post)


you need something to run the containers anyway, and with all the management overhead docker and other systems adds, one can just cut the chase and let an orchestrator do it


Or one can use a service that does the orchestration in a certain way so that one does not have to care about it, as long as one can live with the constraints of the service. E.g. AWS Fargate or GC Run.


> E.g. AWS Fargate or GC Run.

I really can't understand this line of reasoning. How exactly is something like Fargate simpler than Kubernetes?

I understand the argument in favour of using Docker Swarm over deploying your own Kubernetes on premises or on heterogeneous cloud, but Fargate?

And using GC Run to run containerized apps completely missed the point of containerization. Containers are way more than a simple way to distribute and run isolated apps.


Well, Fargate is simpler in the way that I define CPU+RAM, give it an image, tell it how many of them to run in parallel and let it run. If something crashes it will be restarted. If I want to deploy I push a new image.

That's pretty much it. The container cannot be accessed from the outside except for the defined interfaces like http, the containers cannot discover each other unless I use something else to coordinate between them. That is all restrictive, but it also makes things simple.

I still need to do health-check and such - so if I don't really need long running containers, I can make it even simpler by using lambda. (well in theory, I don't like lambda, but that's because of the implementation only)


> Well, Fargate is simpler in the way that I define CPU+RAM, give it an image, tell it how many of them to run in parallel and let it run.

Kubernetes allows you to do just that with about a dozen lines of a infrastructure-as-code script.

> That's pretty much it. The container cannot be accessed from the outside except for the defined interfaces like http, the containers cannot discover each other unless I use something else to coordinate between them. That is all restrictive, but it also makes things simple.

Kubernetes does that out-of-the-box, and allows you to keep your entire deployment in a repository.

> I still need to do health-check and such - so if I don't really need long running containers, I can make it even simpler by using lambda.

With Kubernetes, health checks are 3 or 4 lines of code in a deployment definition.

And it's used to manage auto-scaling.

Oh, and unlike Fargate, you can take your Kubernetes scripts and deploy your app on any service provider that offers Kubernetes.


A dozen lines of code is a lot. Especially when it's in a language you don't know and don't already have installed. Often, the most difficult program to write is "hello, world", since when it fails, it fails in ways that are difficult to understand and specific to your installation and infrastructure. It can be hard to get help for those.

I'm sure that once you've bitten the bullet and learned Kubernetes, these things are easy. But for a lot of use cases it's great to avoid having to bite that bullet, do a one-click thing, and get back to your core development.


> A dozen lines of code is a lot.

Are you really trying to argue that using 1 or 2 lines of YAML per component to define how an entire application is deployed and scaled is a lot?

How many lines of CloudFormation would you need to do just that?

And... I mean... What's your alternative to a infrastructure-as-code config file? Clicking around a dashboard to treat your app as a pet?

> Especially when it's in a language you don't know and don't already have installed.

It's self-describing YAML (or JSON). It's not a arcane special purpose DSL like cloidformation or SAM, or a convoluted Python program like CDK.


> Kubernetes allows you to do just that with about a dozen lines of a infrastructure-as-code script.

Apart from what was already mentioned - what if I exceed the number of machines in my cluster? What if I want XGB RAM and there is no such big machine in the cluster? I don't even have to ask these questions if I use fargate because I don't care about the underlying cluster at all.

> Kubernetes does that out-of-the-box, and allows you to keep your entire deployment in a repository.

You've got it backwards. Kubernetes allows a configuration where you can access the containers. And now, if I have to deal with a system using Kubernetes, I don't know if someone sometimes accesses the containers unless I look at the code and check that it is up to date / successfully deployed.

> With Kubernetes, health checks are 3 or 4 lines of code in a deployment definition.

Yeah and code in your running service and monitoring and alerting... and and and. Maybe for you these are small differences, but for me that is a huge maintenance burden.

> Oh, and unlike Fargate, you can take your Kubernetes scripts and deploy your app on any service provider that offers Kubernetes.

That's certainly an advantage. But it doesn't make Kubernetes simpler. And that is what we are discussing


> but why the hate against kubernetes?

Because it adds a massive account of (to some degree hidden) complexity. Especially it adds a lot of additional (potential subtle) failure modes.

And while this is more a property of micro services then of kubernets, there is some demand for a more opinionated simpler alternative. (There had been a HN article about that recently).


Reproducibility is great. I would probably employ Docker for that if that were my role. However, this is completely from my own limited personal experience: I've worked with multiple devops people at earlier stage companies who have harped on about how we needed to focus on potential future scale when the company had hardly any customers. This was detrimental to our ability as programmers to understand the domain and requirements due to everything being so much harder to implement and test. It's not that I didn't try. I am also not convinced you need to worry too much about things like load balancing in the beginning. In these contexts, I never saw Kubernetes being used for reproducibility alone.


Most of the app I write have a sla and a load balancer is a great piece to have ready for simplifying/automating rolling updates, even if the backend has a single node


Fair enough.


> even your bog standard java three tier app gets some tangible benefit from being dockerized (repeatability, whole os image versioning)

Whole OS apart from the kernel. It's rare that a Java app has a dependency on a particular OS version, but, in my experience, even rarer that the dependency does not include the kernel.


That was one of Java's big original selling points: write once, run anywhere.

They dropped that kinda fast. With, like, version 1.1.


No, not at all. People routinely develop on OS X and deploy on Linux without issue.

Here, though, i just meant that i can develop on Ubuntu 18.04, build on CentOS 6, and deploy on RHEL 7 without issue.


I mean, they dropped the slogan. The idea still held up, and in a lot of ways, still does, though the jungle of different versions and Maven version hell makes it less universal than the initial hope.


Arghh the biggest issue in current place (I be leaving before year end) is the amount of "engineers" who are pathologically afraid of relational database.

You have dozens microservices per teams, split between k8s, meso, aws, azure, physical servers each on using everything under the moon to store data up to storing files in s3

Needless to say the amount of issues due to data inconsistencies is incredible. And of course the amount of actual users of the overall product is tiny being an enterprise products, maybe few thousand concurrent users at peak.

They now want regional replication and screwed due to so many sources of data. Where postgres alone would have done


I currently have a bunch of clowns who are indistinguishable from most of the clowns in my profession. Application slow? We’ll just add caches instead of trying to create an architecture.

“But we have an architecture.” No, you don’t. You have a set of idioms that you arrange like dominoes until an answer you like comes out the other end. That’s only an architecture in the way suffering is a personality.

You don’t need more caches. You need an architecture and dynamic programming. Someone save me from promise caches. First couple times they were so cool, now they’re a golden hammer for lack of even the most rudimentary of data flow analysis.


I've long thought that caching is to software architecture what ketchup is to diner food. Something's not right with your meal? Just apply more ketchup.


I'd think that was the salt of software architecture, but there are limits to which you can fix up a dish by adding salt, beyond which the dish becomes inedible. So I'd guess "upgrade hardware" would be the salt of software architecture.


That sounds like needless idealism

Do you really think your front page needs all the queries reran for every single user that visits? Even though things won't change for, let's say, even 30s?

Even the front page of HN is cached. Their system is pretty snappy but not caching what's trivial just doesn't work in the real world.


That suggests that I believe the right amount of ketchup on food is zero. That's incorrect. I'm not saying you shouldn't ever cache anything. I'm saying that caching is used to cover up a lot of architectural flaws, when sometimes we would be better off if we fixed the flaws.

And I'd add that you notion that the only two options are "cache X" and "rerun all queries" is a part of the problem here. There are more things, Horatio.


"There are only two hard things in Computer Science: cache invalidation and naming things."

Seriously don't build a cache until you have exhausted all other performance improving avenues. I am not against caches but throwing a new cache in the code every time a performance issue is found or worse preemptively making a cache due to a possible performance issue is an endless source of cache invalidation bugs.


Yes don't cache everything needlessly, use it wisely


Isn't dynamic programming (specifically memoization) a specialized type of caching? Am I missing something?


Only in the sense that the lion is a specialized type of mammal, rather than king of the jungle.

Caches are global state in sheep’s clothing. In many languages and implementations, they can be mutable shared state, which leads to unexpected side effects and concurrency bugs that may cross transaction or user boundaries. It’s a covert channel where code interacts without ever talking to each other. In languages without static analysis you cannot determine quickly which pieces of code talk in this fashion. It’s all secretly coupled instead of overtly coupled. You only learn by vast experience, and in the interim you make mistakes, which makes the oldest team members appear capable and the newest unworthy of trust. But it’s the people who tilted this playing field that you can’t trust. In a word: toxic.

Done right, memoization shows up clearly in your sequence diagrams, near the heart of your architecture. The top level of the code knows what’s going on, and orchestrates communication. People walking in off the street can figure out what the hell is happening, and be productive in short order.

It can often force you to think about when your system of record and your source of truth diverge. Usually I see people cache low quality precursor data for a decision, and then they have to cook it every time.

The thing I liked most about AngularJS was that you were meant to fetch all of your data fairly early in an interaction, mis-en-place style. If the data you got was a little garbage, you cooked it before any interface code ever saw it. The view just saw the nice version.

Doing so affords another architectural tool. All of that cleanup code can be moved to the backend or the database in a single refactoring, getting it out of the client. And the overt coupling makes you contend with the fact that a piece of code talks to too many hints to get its job done. It’s a mirror some do t want to look in.

The problem is, all of this involves making decisions, and a lot of developers either never learned, or learned to distrust their decisions. And rather than finding way to mitigate a bad decision, far too many of us have instead spent years or decades learning how never to make a decision at all. We’d be a lot better off with strong opinions loosely held.


Perhaps you need an architecture which uses caches ... And you were both right.


This is standard practice in the micro services world, bin ACID compliance and replace it with network partitions because you’re so fashionable you have no users and no data.

In the system I’m working on (it’s mongodb tastic) someone had added a try catch around everything (EVERYTHING) that reverts the service manually in the catch block. I don’t need to explain to people here how moronic rolling your own transactions like this is...

Everywhere using micro services appears to be the same :-/


I don’t think we’ve quite cracked how to have a bunch of things behave the same without being the same thing. With DCVS, containers, monitoring systems and the like, we are nibbling at the edges but there’s a lot left to be desired.

You and I should be able to benefit from a set of corporate- or at least division-standard data retention services without sharing a database, let alone each other’s schema.


Looks like every single engineer / lead wanted to have their favourite bleeding edge technology.


Missed the obvious pun: "Everything Looks Like a Null"

I do wish practical database development/administration was taught in college. It's such a necessary skill...


For people who are interested, Harvard's CS165[1] is a great course about how to build storage engines for relational databases.

Topics include memory vs disk I/O, the memory hierarchy, buffer pools, B-Tree indexes, query optimization, and more. Throughout the course, you have a project to build a columnar relational datastore with C.

I don't think the lecture videos are all publicly available, but you can learn a lot from the slides and readings.

[1]: http://daslab.seas.harvard.edu/classes/cs165/


Thanks for this link. Never before had I come across a course with learning objectives as aligned to my learning goals as this. Much appreciated.


Looks great. What level of student would that course be aimed at do you know (e.g. 2nd year CS undergrad, 3rd year)?


I would say, 3rd year or 4th year CS undergrad.

Pre-requisites would include a foundational understanding of C pointers, tree data structures, and algorithmic complexity (i.e. time and space complexity).


Regular old sql Databases are not sexy so sadly they dont get the attention they truly deserve


I think they are. I learned to code running an old (hand written) forum as a kid (MySQL + PHP) I tend to use postgresql and python now days but my mind is heavily trained to use that relational core and when NoSQL became popular it was so annoying to see the “hackers” in my hometown writing slow systems in mongo and JS because they never were introduced to relational dbs and thought they were becoming deprecated.

I’m glad that phase has passed and glad nosql has its place as a key value store (though I’m sure the younger hacker scene likely still relies on nosql as it doesn’t take as much effort to spin up a store and get results; in contrast you need to think ahead about what tables you create and how they relate to each other.)


I don't get that. Databases underlie everything. It's foundational.


When you are viewing the world through 19 year old eyes and a mind filled with naiveté and dreams of revolutionizing the world, then studying the technology literally your grandparents where using seems like taking a history class. Why waste your time with SQL when you could be studying CUDA backed, AI driven, graph databases?


Thankfully my dreams of a career in science were crushed before I pursued one in programming so I don't relate to that.


It used to be when I was in uni. Then everyone complained it was not cutting edge enough for the job market.


"Everything Looks Like a NULL"*

No one likes to read sloppily formatted SQL.


"* Looks LIKE NULL" perhaps?


Nothing looks LIKE NULL as that would always evaluate to NULL, which would be filtered out in a WHERE clause.


I like it


I've seen some really great courses offered by MIT and CMU on building database systems (will have to find the exact courses). And a database development course is something I would love to see from OMSCS (from Georgia Tech). Sadly, right now, the only DB course offered is not very rigorous and really targets how to use a database from an application development standpoint (i.e. create DB, tables, SQL queries etc)



I often feel like bioinformatics is like this steampunk alternate reality where, because HPC clusters don't generally do database administration, all the technology and the ecosystem has developed has been built on flat files. Let me tell you, it's not great.


It’s fine, you can just nest stuff by having a tsv in which one of the columns has an arbitrary number of values separated by commas. One of those values might even be some map structure with pairs indicated with equals and pairs separated by pipes. It’ll all be just fine.


We heard you liked delimiters, so we put delimiters in your delimiters, now you can delimit without limit.


Wow, this week at a new job I just found 50 database columns that contain a serialized php array, this makes me laugh.

Also does anyone know how to cheaply migrate PHP arrays into actual relational tables, I'm, uh, asking for a friend.


>delimit without the limit

Oh man, you just made my day.


You got half of your answer in your sentence.

That's not only bioinformatic but the entire HPC world tends to avoid database.

They usually prefer HDF5 or similar, and there is reasons to that. It is much easier to scale one million node accessing a flat file over a DFS than it is over to a database.


Also, in these fields with HDF5, you tend to write once read often. Bioinformatics and other HPC using researchers have totally different resource consumption than web services. 'Data' really means something completely different.


> Also, in these fields with HDF5, you tend to write once read often

Server oriented DBMS specialized in write-once-read-many workflow do exist.

However, you are right: research have completely different data consumption model than web service.

And in HPC, it is: - Much more efficient to do sub-milliseconds massive parallel data access over a parallel DFS, one network switch away than it is do it over a DBMS.

- Often much more convenient to move a flat file around to do analysis/model modifications of scientifics results on your laptop.


It is much easier to scale one million node accessing a flat file over a DFS than it is over to a database.

They are also much easier to distribute. I can just upload my arbitrarily large hdf5 file to your ftp server and you can just open it in matlab/jupyter and start playing around with it. Doing the same with a database (other than sqlite) is really hard and requires that our database versions align and you'll probably need help from someone from your IT dept. to get the right version installed and so on.


The biggest hindrance to good engineering is everyone not knowing everything. I know that sounds stupid. But I can't explain enough to most developers why they need to understand how to safely deploy database changes before they can even consider writing an application that uses a database. And I can't explain enough to the architects why the developers need to be given agency and options and allowed to make their own decisions and just implement what works. Both seem to be lacking critical insight and so we always end up with slightly-crappy things that work just enough to "ship", and spend the rest of the time trying to shore up why it always breaks.


If you're in a position to worry about outgrowing your data storage strategy you either really, really muffed it or you're doing pretty well for yourself.


Or your project relies on significant data ingestion and produces artifacts based on that data... There are tons of examples of this.


OK, granted. But there are many more examples of projects that were more complicated than you needed to be because someone thought they needed to use a cool new storage technology they read about but didn't really understand when something tried and true would have served the purpose just as well.


Yeah, as the article says towards the start:

> Neither of these mattered much unless your company was successful.

I think this pretty much applies to every blog post about "good practice".


or u work in an enterprise


If every one of Walmart’s 2.1 million employees used an enterprise system every day, that system would be less than 1% the scale of Snapchat (250 million DAUs).

Internet scale and enterprise scale are completely different animals.


But Snapchat, as most web-scale companies, doesn't care whether a single operation gets lost and whether a small number of user have data loss. Whereas Walmart, at least, should care whether every single employee got his salary or whether the taxes for him were booked properly. Therefore a lot of "solutions" of so-called web-scale companies are to naive for an enterprise.


All the more reason that a traditional SQL database is well suited for many enterprise applications.


Most of the databases at big enterprises are tiny. Just because the company is big does not mean every department of it handles tons of data.


Im an SRE, the architect decided some time ago that we needed a mongodb db deployed in our cluster. I fought against it at first, as it is best to get a managed solution instead. The dev's position was, "how hard can it be to operate a database? You can use a kubernetes operator now" lol I just slacked this article :) Thank you!!


Hosted MongoDB is still hot garbage, it just costs more and you can't fix it yourself when Atlas has dropped the ball.


Usually managed systems cost less if you factor in salaries. Like everything, it depends on the specifics though.


Lambda(TM) your entire infrastructure and save the most ;)


I have several instances in production now, and it's honestly been pretty good. Not running anything over M30 though.


One thing that many devs don't pay much attention to is as soon as you go with SOA/microservices and separate DBs you are now in the realm of distributed systems with all the fun that entails.


two years into a job with a dozen applications where sharing databases amongst them is the defacto standard. there aren't even any views. I feel like I'm in a never ending loop of debugging and troubleshooting and never have time to burn it all to the ground. it is time to move on.


Tell WordPress, Drupal (what's left of it anyway) and co who're storing HTML fragments, templates, and other markup as BLOBs "in the database".


Why do blobs cause transactional problems? Create the blob. GIve it an ID. Give it a path/URL. Do all that before going anywhere near the database, at which point the blob is all settled an immutable. Then, stick the tiny record into the database which identifies the blob. Done; the blob is now live, from the POV of the database consumers.


One thing to consider is backups. How can I restore my application to a previous known and valid state after hw/sw/human error.

If the application state is split for example between database and message queue having consistent backups becomes harder.


The article should be named, "If all you have is a transactional database, certain kinds of things look like nails".

My interest is analytical, non-document DBMSes, and the text is mostly irrelevant in my world :-(


I was hoping that the author would talk about RDF data stores. It may take a while, but I expect some combination of RDF and relational databases with virtual RDF and SPARQL support will do a lot to solve data silo problems.

That said, good article.


Fascinating. These problems have been around for years and they are not going away.


And if all you have is a hammer, everything looks like C++.


And if you don’t even have a database, nothing is realistically possible.


Relational databases are just "the way we do things around here". They may have been smart at one point, but they haven't made sense since we started putting web frontends on our applications (it's not a coincidence that the most successful "database" for that was MySQL which, at the time, supported approximately none of the alleged virtues of databases), they certainly don't make sense now.

There's one thing that SQL databases are good for: semi-ad-hoc reporting queries (and their transactional stuff is grossly overengineered for this use case, so remember to turn it off). Produce a read-only SQL database for your analysts/data scientists to use, either regular dumps or updated in real time. But you absolutely must treat it as write-only from the point of view of your live system; doing programmatic queries of an SQL database is madness on several levels: SQL is not machine-friendly, the performance characteristics of any given query are obscure, and the transaction system means you're virtually guaranteed to deadlock yourself sooner or later.


Relational queries and ACID are still absolute requirements for managing your data. Web frontends are just that, frontends. Is there something mathematically changed that means that relational logic is somehow now false?

MySQL was a hack that people that didn't understand relational databases used as a glorified K/V store. It was "fast" compared to the alternative of postgres, but of the three qualities of good, cheap, fast, MySQL only won on the last of those and for a definition of "fast" that most of those web frontends didn't need.

"SQL is not machine-friendly"? Why not? It's also not that relevant to the performance of a relational query. That's driven by the performance characteristics, which are driven by the query planner using the statistical analysis of that data.

"...the transaction system means you're virtually guaranteed to deadlock..." you mean, if you try to break ACID, then the database enforces constraints to stop you and make you reconsider your data access properly? That's like saying Rust's borrow checker doesn't allow you to program race conditions.

Storing JSON in a K/V store and doing joins in code is a regression, not an advance. Not expressing your data's constraints as actual constraints means that bugs can happen that shouldn't.


> Relational queries and ACID are still absolute requirements for managing your data. Web frontends are just that, frontends. Is there something mathematically changed that means that relational logic is somehow now false?

The web request/response model makes ACID pointless. The whole point of ACID is that if two users try to update something at the same time, you won't lose either of their updates. No web app implements that: if user A loads the edit page and then user B loads the edit page and makes an edit, their update is going to be lost when A saves the page. (Or maybe the webapp implements its own locking, but it won't - and can't - use database transactions for it, because there's no way to know whether the user is still editing or has closed the page).

> MySQL was a hack that people that didn't understand relational databases used as a glorified K/V store. It was "fast" compared to the alternative of postgres, but of the three qualities of good, cheap, fast, MySQL only won on the last of those and for a definition of "fast" that most of those web frontends didn't need.

Exactly! MySQL did none of the things that "good" databases do, and yet MySQL-based companies were far more successful. Conclusion: the things that "good" databases do aren't actually that valuable.

> "SQL is not machine-friendly"? Why not?

It's irregular, hard to generate safely, and hard to parse. It makes no sense as a machine-to-machine interfaces (and it was never designed as one).

> That's driven by the performance characteristics, which are driven by the query planner using the statistical analysis of that data.

Only if the query is complex enough for those to be a dominant factor. For a simple primary-key lookup, today's databases typically spends 3x as long parsing the SQL as actually executing the query.

> "...the transaction system means you're virtually guaranteed to deadlock..." you mean, if you try to break ACID, then the database enforces constraints to stop you and make you reconsider your data access properly? That's like saying Rust's borrow checker doesn't allow you to program race conditions.

It's just the opposite: SQL databases rely on magically figuring out the locking for you, and don't give you enough visibility or tooling to let you manage it in a safe way. You can't look at a query and know which locks it will take, in which order, unless you're intimately familiar with that particular database's internals. Application programmers know to make sure every thread obtains locks in the same order, or ideally use higher-level abstractions. in SQL queries you don't have that option.

> Storing JSON in a K/V store and doing joins in code is a regression, not an advance. Not expressing your data's constraints as actual constraints means that bugs can happen that shouldn't.

I'm all for type safe datastores (and RDBMS type systems are pretty limited: no decent collection types, no aggregates unless you count temporary tables which come with a giant set of pitfalls). I can believe that referential integrity could be worthwhile. But SQL is a very poor interface, and datbase-level ACID is worse than useless in today's (web-based) architectures.


> The web request/response model makes ACID pointless. The whole point of ACID is that if two users try to update something at the same time, you won't lose either of their updates. No web app implements that: if user A loads the edit page and then user B loads the edit page and makes an edit, their update is going to be lost when A saves the page. (Or maybe the webapp implements its own locking, but it won't - and can't - use database transactions for it, because there's no way to know whether the user is still editing or has closed the page).

That's a problem that is solved by CRDTs and the like, not any form of data store. But it's just one use of web frontends. Banking doesn't use that, neither does a large chunk of the world's applications.

> Exactly! MySQL did none of the things that "good" databases do, and yet MySQL-based companies were far more successful. Conclusion: the things that "good" databases do aren't actually that valuable.

Conclusion: Early web startups were successful, irrespective of what data store they had. Irrelevant to today's internet. And judging on the wrong criteria. Companies weren't "MySQL-Based", their business model disintermediated and undercut existing models. Correlation is not causation. Companies that ran on SUN/Solaris were successful as well.

SQL is not intended as a machine-to-machine interface, it's a language for expressing relational queries. Prepared statements and caching remove that "3x as long parsing"

> Application programmers know to make sure every thread obtains locks in the same order...

Evidence suggests the opposite, thus the need for things like Rust's borrow checker. Studies have shown that programmers can't work out what the race conditions and locking requirements are.

> But SQL is a very poor interface, and datbase-level ACID is worse than useless in today's (web-based) architectures.

SQL is a crap language, agreed. But your definition of "web-based" architectures seems to be picked to be the CRDT related stream of changes. Which, by the way, could be implemented in an RDBMS by writing the CRDTs as transactions and enforcing the ordering with the appropriate rollbacks and commits.


> That's a problem that is solved by CRDTs and the like, not any form of data store.

But once you accept that you have to use CRDTs or similar, traditional-RDBMS ACID becomes pointless, and you can and should use a datastore that makes better tradeoffs.

> But it's just one use of web frontends. Banking doesn't use that, neither does a large chunk of the world's applications.

Banking doesn't do that, but it doesn't do RDBMS-level transactions either. People cargo-cult the idea that you should use an ACID database, but I'm not sure I've ever seen a web application that actually made use of the database's transaction support, and fundamentally I think it's probably impossible - people say things like "if you get a deadlock, just retry", but if it's something you can retry then it didn't need to be a transaction in the first place.

> Studies have shown that programmers can't work out what the race conditions and locking requirements are.

Which is exactly the problem with using RDBMS transactions! It's hard enough to figure out deadlocks when you can see the lock statements in the code - it's impossible when you don't even have tooling that can tell you which locks will be held by which parts of the code before you run it.

> But your definition of "web-based" architectures seems to be picked to be the CRDT related stream of changes.

The thing about the web is that it makes everything a distributed system. Even if you only have one server, the fact that your clients are accessing it over a network means you have all the problems of distributed systems and are better off using a system that's designed to deal with them.

> Which, by the way, could be implemented in an RDBMS by writing the CRDTs as transactions and enforcing the ordering with the appropriate rollbacks and commits.

You wouldn't want to do a DB-level rollback or commit, the whole point of using a CRDT is that even when two writes conflict you can still commit them to durable storage straight away and resolve the conflict asynchronously. So the DB-level transactions don't help you.

You can implement a perfectly good event sourcing architecture on top of an RDBMS by effectively using the tables as queues - I worked on a system that did exactly that back in 2010, and it worked very well. But at that point you're not really getting much value from the database - its transaction support only gives you ways to shoot yourself in the foot, its flexible query system likewise, its schema support is better than nothing but not by much. There's a lot of really impressive engineering under the hood, but the parts that would be useful to you aren't actually exposed where you can use them (e.g. the MVCC implementation could be really useful, but your database will go to great lengths to pretend that it isn't there and there's only ever one version of your table at a time. E.g. the indexing implementation is often very good, but you don't have enough control over which indices get built when). There are better options these days.


There appears to be some confusion in the terminology here

- database transactions are not meant to prevent the "user A and B are editing object X at the same time".

- Database transactions ensure that when you ask the database to modify X, it either happens in its entirety or it doesn't. It's not meant to prevent concurrent edits, it's meant to keep the data in the database consistent.

- CRDT's allow user A and B to edit object X concurrently, because it's mathematically proven that two instances of the CRDT can be merged later without conflict.

You also don't need to start locking things yourself if all you want is to prevent user A from overwriting user B's changes. Add a column "version" to your table and when the user submits a change, include this version number in the change request. Then upon database UPDATE, check if the version in the DB still matches expectations. If it doesn't you reject the UPDATE and you can report back to the user what happened. You could easily apply this logic using triggers which will ensure that no matter the source of the UPDATE statements, it will be a certainty older versions can't overwrite newer versions.


> - Database transactions ensure that when you ask the database to modify X, it either happens in its entirety or it doesn't. It's not meant to prevent concurrent edits, it's meant to keep the data in the database consistent.

It keeps the database consistent. But fundamentally that's only worth something if you can use it to make things consistent for the user.

> - CRDT's allow user A and B to edit object X concurrently, because it's mathematically proven that two instances of the CRDT can be merged later without conflict.

Right. But the way that works in practice is that your datastore needs to store both user A's edit and user B's edit and then handle that convergence, and if you use a datastore that's actually designed for distributed use like Riak or Cassandra then it will give you the tools for doing that. An RDBMS simply can't do that because it's built around maintaining the illusion that there's a single version of each row (even when it's actually using MVCC behind the scenes). Yes, you can build your own CRDTs "in userspace" but you have to do things like use two tables to represent a 2P-set (two columns would be bad enough, but no, SQL databases don't have collection column types so you really do have to create two extra tables for each set property you have), at which point the database won't be giving you any help.

> Add a column "version" to your table and when the user submits a change, include this version number in the change request. Then upon database UPDATE, check if the version in the DB still matches expectations. If it doesn't you reject the UPDATE and you can report back to the user what happened. You could easily apply this logic using triggers which will ensure that no matter the source of the UPDATE statements, it will be a certainty older versions can't overwrite newer versions.

That's hacky and manual - you're effectively duplicating something that the database itself is doing at a lower-level, so you could do it much more efficiently with a datastore that would just expose that to you. And it still doesn't really give the behaviour you want - you don't want to refuse to store the incompatible change, you want to save both versions and somehow merge them in the future (possibly by asking the user to do so); an approach that destroys data by default is not ideal.

Yes, databases are Turing-complete and you can build whatever you want in them, but if you don't go with the grain of what the tools and built-in features expect then you're giving up all the advantages of using that kind of database in the first place. You're better off picking a datastore that has first-class support for being used in the kind of way you want to use it.


See, the problem is that RDBMS's and SQL seem to be more complex than the problem you're applying them to, so it seems like there must be a more straightforward solution. However, as of 2020, every shallow attempt to replace them with something simpler has suffered from the "hidden" problems that you don't realize you have until you try to live without an actual database.


Not my experience at all. I'm working at a company that makes a great product while avoiding SQL RDBMSes for its main datastore; some of my previous companies did the same thing as well. No datastore can save you from having to understand your dataflow, but if you skip the RDBMS and the SQL you can avoid a lot of unnecessary overhead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: