Hacker News new | past | comments | ask | show | jobs | submit login
Just use Postgres for everything (amazingcto.com)
707 points by KingOfCoders on Dec 10, 2022 | hide | past | favorite | 430 comments



I think that the job of a CTO is to minimize the surface area of "built in-house" so that your team can focus on things for which your customers actually pay you.

Sure, PostgreSQL can be used for sessions, but Redis has solved this problem on virtually every single platform a long time ago. Your customer probably doesn't even know what a session is, but they will definitely learn about them when your in-house implementation inevitably encounters an edge case.

Of course, PostreSQL has wonderful search capabilities but it was never intended to be used as a search engine. Solr was created in 2004 and is still being improved and used in production every day. Do you know what will happen to your full-text search when your customer types in a mix of English and Chinese characters?

Yes, PostgreSQL addition of SKIP LOCKED was neat but AFAIK the author of that feature himself recommended using a traditional job queue unless you had a very good reason against it. RabbitMQ was designed as a message queue and when you read the documentation you realize that they had encountered virtually every single problem and figured out a way to deal with it so that you don't have to. The documentation pretty much tells you what problems you are going to have later so that you can plan for them today.

Choose boring technology (tm), follow industry best practices, and enable your team to get their work done using the right tools for the job, so they can deliver the product to the customer, and leave the office on time.


I get where you're coming from, but I think that folks often underestimate the maintenance burden of additional components in their system.

Sure, if your use case requires it, use Redis for sessions or RabbitMQ for queues. But you can also use a library with postgres, or even write the 40 lines of code yourself.

Each component has its own debugging requirements and tooling. Each component adds a bunch of complexity. Sometimes it's worth it, sometimes it's not. It's not as clear-cut as you're making it be. There are pros and cons to both alternatives.


I actually had this debate at work a few years ago. We needed a queue system and I wrote a postgres prototype in about 100 lines of code that would have worked fine for our use-case and would have likely scaled with us for years of growth.

My boss said he just wanted to install the best thing and be done with it forever. So we ended up spending 10x the time and wrote like 10x the code to integrate a third-party solution and we only ended up using the most basic features. Not to mention additional infrastructure.


People underestimate how bad databases are at implementing queuing systems. I’ve helped replace such systems multiple times.

But it doesn’t need to require 10x the time or code. That just sounds like overengineering. Database-as-queue vs. 10x ultimate solution is a false dichotomy.


People overestimate their requirements, because almost nobody had ANY idea what computers are capable of, and that big tech is HUUUUGE compared to the rest.

You are not Google and you are not Steve Jobs


Yeah they suck at it. There are so many good queuing systems, and if you’re on any cloud there’s gunna be a cheap managed queue that is super reliable and near zero effort. If you get beyond trivial scale just use a managed queue.


> how bad databases are at implementing queuing systems.

What difficulties did you meet?


I think there exists a midway, where you rely on the heavily-tested open source solution that implements what you want, and a sprinkle more. RabbitMQ for example is that layer that your 40 lines of code would do, but pays off dividends in good nights' sleep.


RabbitMQ is not free. You need to learn it, configure it, understand it, interface towards it, install it, upgrade it. Lots and lots of things.


So those 1000 lines of code (your prototype was 100 lines and this was 10x that) would still have been a week tops, right? That doesn't seem too bad a price to pay for scaling potential, even if in hindsight it wasn't needed. You can't always know in advance which features you are going to need and which ones are superfluous.


If what you need is a simple work queue, you really need to scale very far away (in the magnitude of several thousands of requests per second) to overgrow a solution built on top of PostgreSQL.


In other words: boss has no idea, boss doesn’t trust your technical nor business skills

Conclusion: abandon this company.. you’re probably just a cost center and nagging headache in his eyes. If you’re not valued or respected, there’s only way way up, and that’s out


It takes effectively zero effort to maintain my redis instance in AWS, so I’m not sure you’ve got a real argument here…


This is all fine and dandy but the article has a great point here. Redis is absolutely amazing, but if you bring it in you have to care about more stuff. Mo' stuff, mo' problems as they say. You now need to synchronize writes between your Redis and your DB (hello "after_commit" hooks and similar, how is your read-after-write doing?). You need to install metrics for Redis (lest you find that suddenly your application spends a huge amount of time in MGETs or blocks on set operations). You need to have good failover in place at AWS and make sure you do not save anything non-transient into it – yes, this is how Redis is supposed to be used for transient stuff, but are you positive your application can cold-start well enough with a blank Redis? Oh, and now everyone on the team needs to run a Redis locally, and a matching version at that - hello docker-compose...

Brief: yes, a specific datastore is usually better fit for the job, except that until your app requires more performance than Postgres can deliver if you already _have_ Postgres you might as well just stick to it.

Same for SQS - SQS is incredibly performant, but there is a whole lot of features it does not have which a PG-based queue system like good_job will give you out of the box. Just off the top of my head - priorities, separate queues, scheduling - and, do not forget, atomicity with your main transactional workload.

So while it is usually - when everything works great and the workload fits - not a big deal to run a specialized store, it can be more economical and simpler to just stuff everything into the DB until you outgrow that.


> Mo' stuff, mo' problems as they say.

This is basically not true, is my point. There is no meaningful "problem" with throwing up a Redis instance in AWS, this just doesn't mesh with my experienced reality.


Speaking from experience, you can get away with not understanding Redis for a while. Then one day you'll wake up and everything will be on fire, because you used Redis wrong, and now your main and replica are in a death spiral DoS'ing each other trying to pass a 1TB replication log back and forth.

You don't need to learn your tools for doing simple stuff under normal circumstances. You need to learn them to do bespoke surgery on live data while everything is on fire and the customers are threatening to fire your company. Or better yet, so that you can avoid doing that altogether by anticipating limitations during design.

That being said, doing everything in Postgres is also going to bite you if you have moderate scale. This is really the same mistake again. Postgres looks like a big truck you can just load up and load up, until you wake up one day and there are cascading failures across all services that touch that database, because you wrote a dumb query that took a lock for 5 entire minutes while it did an HTTP request. It's robustness will lull you into thinking something is working well when it's actually barely working.

(Before you object, yes, it is a better idea not to have multiple services talk to the same database, I hear you. And no, you shouldn't ever hold a database lock while doing an HTTP request, believe me I know. These things can happen.)


Er, I really feel like you’re not understanding how simple AWS makes managing a Redis instance.

I’ve been using Redis for nearly 10 years and it’s been a seamless and pleasant experience. Honestly it sounds like you’re taking your specific experiences and overgeneralizing.


Partly I'm sharing war stories because that's a fun thing to do, I'm not being entirely serious. Partly it's that you said that wasn't your experience of reality, so I broke off a little piece of mine and offered it to you. I'd suggest we're both generalizing, and as long as it's not taken too seriously, that's fine; it's how shop talk works.

I don't know the nature of the applications you've been working on those last 10 years, but it was more or less the main database for a high bandwidth, low latency service I was working on, also using Elasticache.

Problem spaces vary. If you're using it as a cache with modest load and consistency requirements, maybe you never need to understand it. But those sorts of requirements often creep & change out from under you.

So if you're saying, Elasticache did a good job of abstracting Redis, sure, I agree. If you're saying, there is no additional cognitive load to adopting a new service in your data path, because you don't even need to understand it - that puts a shiver up my spine, and makes me hear Pagerduty alerts in my head.


Keep in mind the context of this thread. It's some shitty blog suggesting you do every fuck thing on postgres. Myself and the other guy are simply suggesting that running a single-node instance of redis is an infinitely better and simpler choice than implementing a cache or a job queue on a rdbms.

I don't feel designing for a guaranteed high-availability application was part of the discussion at all.


If you don't find it applicable, feel free to discard what I'm saying, I take no offense. But I just don't quite understand your perspective. Maybe I have oncall firefighter brain rot, but a distributed job queue is exactly the sort of thing I'd want to be available, and a cache is something I regard as being very dangerous and requiring utmost care.


You use SQS, couple clicks and you’re done. You shouldn’t be using a cache for a queue. You use a queue for a queue.


Yes, mostly. There are a few things to take into account though:

- No multiple queues - No priorities - Practically no scheduling (the delay is very limited) - Creating and tearing down a queue takes a lot of time and the number of queues is subject to AWS account limits - The FIFO/LIFO semantics (remember about no priorities?) will bite you when you least expect

It does have great durability unlike Redis though and will scale to much, much larger queues in an easier way.


Not disagreeing, but purely for interest, Redis contains a queue-like primitive.

https://redis.io/docs/data-types/streams/

You can make good job queues out of this, combined with sharding or consistent hashing, for low(ish) latency applications. Each shard has a stream, they operate on data stored in Redis, and you pass them the key to this data over their stream.

But SQS is great, and a great rebuttal to the article. Totally easier to prototype a job queue that way than with pg, and you probably won't need to move off of it.


I have used both SQS and PG-based queues and for the smaller workloads/smaller systems (read: "not very very large systems") I now prefer the latter. There is also a non-trivial amount of stuff that we turned out to need for operating SQS at scale on the application side, basically to compensate for the things SQS does not have. It is great it doesn't have those things, but if you have a smaller application you might want to have those things instead and sacrifice a bit of scalability.


The advantage being that you could sort things to implement priorities and such? Did you use listen()/notify() at all?

ETA: seeing your list of missing features now, that all makes a lot of sense. In my mind the biggest advantage of SQS is that it glues together all the other AWS offerings, so you can go AWS -> Lambda for an instant job queue (with concurrency limits, etc. so you don't blow your hand off - perhaps undermining the simplicity argument). But everything you're saying makes sense if your job queue needs any degree of sophistication.


Sure, but you still need to run redis vs click a couple of buttons and create a queue, offloading the entire management to AWS staff.


Redis isn’t just a cache. That’s memcached. Also SQS absolutely sucks for a job queue as soon as you want to do anything like control concurrency or have job priorities, but if your needs are simply “I need a background job queue” then SQS is likely a great choice.


I agree with the OP's point of the overhead of adding more things to maintain.

For Redis though, the overhead is far more trivial than something like Kubernetes or Kafka, or even Elasticsearch or MongoDB


The overhead is actually the conversation we're having right now about whether postgres or Redis is better. It's not that postgres is hard to use or less perfomant, but that there's memory overhead in "Here's how you use Postgres for session management, here's how you use it for application building". Use Redis for this, Postgres for that is easier to grok.


That's not the scenario they're describing, postgres has most likely already been designed and worked on to scale to their workload, using postgres means you don't have to replicate that for another system.


As long as it works the first time and everyone on the team is fine installing a local Redis - there is very little problem. If the code doesn't make assumptions about read-after-write consistency for jobs. There will be "problems" (or - rather - things you will find out you haven't accounted for) when, for example, an improper URL is used and your Redis fails over. Or you do not have a replica configured (someone decided to "let's save some budget of team XYZ and is this really necessary it is transient after all"). Or you started using something that saturates your Redis. Or that you haven't configured alerting on Redis metrics...

It's all normal stuff, by a long shot not the end of the world, but it is stuff that you need to do, and it is more stuff, and it can bite you if you come unprepared and "just clicked a few instances into existence last year".


Chiming in to concur. Redis is amazing and simple software. You can use a managed service like Elasticache or install the binary on a VM instance. Folks using a relational database for a job queue or a cache when an infinitely simpler and more appropriate tool is available are just making poor technical decisions.


But it's not simpler when you consider all the things you've got to do around and after installing that binary on a VM instance. Consider the overhead of managing it - monitoring it, updating it.

Failover when the node dies. Clustering for high availability?

Backups? For a cache, probably not, for a job queue broker, probably necessary.

Making sure your app deals with inserting into Redis on successful transactions and not when a transaction is rolled back.

Getting up and running can be fairly painless, staying running on all edge cases and handling partial failures is what gets you.


I’m confused the scenario you have where a) a singular postgres install which does everything is acceptable vs b) as soon as redis comes into the picture, suddenly you need HA and monitoring and apparently running transactions with full ACID integrity?

It’s just a nonsensical and unfair comparison. You can run a single Redis instance with normal rdb disk syncs and don’t ever update it for years on end without issue. Is that guaranteed resilient? Absolutely not, but that’s not the scenario in discussion. We’re talking about the context of a bootstrap/MVP scenario, not an enterprise setup.

I’d take a single-node redis job queue everytime over a HA citus/postgres cluster improperly acting as a queue.


I think the point is they have a Postgres server running anyway as the datastore and the job queue being in Postgres gives you HA, backup and Transaction for free. I think Redis in particular won't give you transaction right?

Needing Transactional semantics for jobs alongside an application operation makes a lot of simpler queue/tool choices difficult.


Thank you for the nuanced assessment, I would tend to disagree still.


Every added service also gets multiplied by the number of environments you need. Sure, prod is just one. But you need a staging env too. And CI needs the service to run the tests. And you probably want a feature-branch environment. And local development of course. It adds up and every service you add gets multiplied by the number of environments.


And each developer has their own dev env. QA’s may also run things locally.

And then there is monitoring, dealing with security issues, and upgrading.

It all multiplies.

Elsewhere people are debating whether Redis adds that much more overhead. I think they are missing the wider point. It’s about how many pieces you will have. 1 is simpler than 2. 2 is simpler than 3. 3 is simpler than 12.

Nowadays there are so many great specialized tools that do certain things really well. And for most use cases, you don’t need them. And when a time comes that you do start needed them, you add them then. And people will grumble about the idiot who implemented full text search inside of Postgres, while being completely blind about the 7 years of saved time NOT managing elasticsearch across 38 environments and quarterly upgrades.


Now you have to pay for 2 services instead of one (elasticache is not cheap), and you may need to account for two differently configured redis instances (setting it as cache store requires a different configuration than setting it as a job queue).

You'll also need to code for two integrations (orm + whatever you're using redis with), which may be a solved problem or not, depending of your stack. And even then, still more complex than just postgres, and more error prone considering you'll either have to ignore enqueue reliability, or find a complex way around it.


Have you considered you’re just a shitty dev?


SaaS is nice, but you still have to worry about Redis drivers, local development set-ups, etc. And unless you can make do with AWS’ free tier, that Redis instance isn’t free.


This doesn't apply to everything. But yeah redis does what it does really well. I almost feel like I want to spin redis up before I spin a DB up. It is excellent tech. Can use it for caching. Sessions. Distributed locks (niche but when you need it it's fantastic).

But yeah I think this is more applied to the folks who decide to spin up a large service when they just need a messaging layer that their DB can solve for them. It has non insignificant cost.

Redis is pretty low maintenance tho in many configs (k8s, on a vm, or a service like on aws have all basically been zero maintenance for me).


Perhaps you’re not using it in “clever” ways, as recommended in TFA?


Those 40 lines of codes give me nightmare. What if the dev disappear and we know only the C binary, source disappeared ?

Little tiny custom stop gaps are the worst in a bit system.


> Choose boring technology (tm)

I'd call Postgres pretty solidly "boring technology", including for session storage and job queues. People were storing sessions in SQL databases when I got my start in 2005!

It won't address every scale and every use case, but then, that's never your project's requirement anyway.

I frequently see the term "boring technology" treated as a euphemism for "what I'm accustomed to".


Absolutely on point.

Postgres can and will scale to workloads that average developers can't comprehend. I've used Postgres professionally since 2003 doing everything from run of the mill web work to high volume log aggregation systems with massive datasets. Postgres can ingest and index hundreds of thousands of records per second on pretty modest hardware by today's standards. One just needs to learn the tool.

Postgres is far more "boring" and battle tested than Redis. All the hipster tech that came out of Web 2.0 companies (including Cassandra and Redis) has hallmarks of being built by junior developers who refused to learn and then build upon state of the art and went on to reinvent many wheels quite poorly.


You had some reasonable points but then you segued into name calling and baseless negativity. Both redis and Cassandra are well built, well understood common industry tools that have been use at scale for many years now. With their own trade offs yes , but quit it with the “built by junior hipster developers” BS, it’s not a good look


I stand by my point. It was all reactionary stuff. I went to the conferences and saw the talks by those bright eyed young developers. Their disdain for RDBMS and SQL was as fervent as it was misguided. They hailed a new era of Big Data and NoSQL. Hadoop was gonna be the way to store it all or you weren't webscale.

Forward a decade and most of it is in the dustbin of unmanageable tech while good ole RDBMS outlived them all. Mongo is about the only one I still hear about and see pop up in job ads. Might share the fate of Hadoop too when people learn that Postgres can index JSON too and even be sharded if you need to (you likely don't).


If you think that redis and Cassandra are in the “dustbin” then I don’t know what to tell you, except that you’re possibly very out of touch. Especially redis, it seems to be used everywhere


You may be confusing redis with mongo which definitely suffered that. Redis is simple and well designed when used as directed.


Won't adding item to the queue leas to fsyncs on each commit? That first thing comes to my mind and worries me


Fsyncs can be combined. You don‘t need 40k fsyncs for 40k inserts. While one fsync is running multiple inserts are queued and fsnced together when e.g. the last fsync is done.


Counter argument: 1 postgres instance is far easier to manage than a slew of servers. I can spin up a heroku environment with a postgres very quickly, and with little effort.

If you're starting out with a team of say 4, and that team will stay small, don't waste your time integrating complex tech.

Solr is DEFINITELY better than postgres fulltext search. However pg_search and a clever index has let me solve 95% of all our search needs in almost no time with zero maintenance. However once you want to get into search complexities, Solr quickly outperforms postgres, but you have a lot of setup and env management to do now.

So it is all about team size. My instinct is at around 10 engineers you should start thinking about which complexity is worth pulling out of postgres and into its own service based on needs.


Solr is better in snots use cases but does things like sparse indexing very poorly and Postgres is actually worlds better


"Snots use cases"? Is that an unusual autocorrection for COTS?


I very much disagree with the idea that a CTOs job is to minimize code built in-house.

Every app has different needs and there's always some code that would be a lot simpler if it were written in-house and specifically built for the needs of the application.

Abstraction has a cost and if you take everything off the shelf you'll end up with a much higher overall level of abstraction in your codebase. Plus, your engineers aren't going to understand third-party code as well as in-house code.

I've worked at companies that wrote almost everything in-house and I've worked at companies that had a phobia of in-house code. Both had problems and I think the real solution is somewhere in the middle.

Edit: I also get a "nobody was fired for buying IBM" vibe from this.


Not only that, but in-house offerings can be a competitive advantage. Sometimes it's over-engineering, and sometimes it's the secret to what makes your business successful over competitiors.

In house has the ability to build only what is really needed and nothing more, and can adapt to your specific needs, it can also identify unique to your domain challenges and tailor solutions specific to that.

A good CTO has good intuition into when and what makes sense to invest in an in-house solution and what is best using a self-managed open source solution, and what is best using a paid managed offering, and all manners of hybrids.


Honestly the biggest competitive advantage isn't in the thing you're building in-house, it's just the fact that you can debug it 10x faster than your competition.

I never understood this until I saw some dysfunctional enterprise companies. I've seen entire teams spend 3 months doing fuck all because they can't get a local dev environment or build pipeline to work.

If you only have Postgres, and you're weighing up whether to add a second dependency instead of installing a Postgres plugin or writing 50 lines of code, of course it's going to seem like an obvious choice ("I can get this working right now instead of spending a day or two on it! Wow!").

If you make a habit of making those decisions again and again, then your project is going to spiral out of control before you even realise it.


I think this may underplay the additional operational cost and risk in deploying multiple classes of services.

You added RabbitMQ for that one queue use case? You suddenly need to handle some health check edge case in prod since your programmers doesn’t have experience with it. Just added redis? You now have an extra set of server and client sdks to regularly patch up.

Etc. Sure, there’s a point where it’s logical to add a new class of services, but it’s not remotely close to zero (which is how I read the comment).


The enemy's gate is down. If a CTO brings the surface area of "built in-house" down to zero while accomplishing all of their objectives, ad infinitum, they win the game.

Obviously, it would be impossible to maintain such an advantage in real life for any extended period of time. However, orienting your team towards that goal gives the CTO a way to quantify risks and costs associated with accomplishing their objectives. Health checks and SDKs are standardized commodities that can be implemented and maintained at a predefined market rate that's always approaching zero. Finding and fixing a bug in your proprietary code has a potentially infinite cost.


Finding and fixing a bug in your own code that interfaces with a Kafka cluster that's tripping a disturbuted systems edge case is infinite squared. Solving it requires a huge volume of knowledge about complicated systems and topics as well as debugging across system and server boundaries.

It's all trade-offs. Complexity is complexity, whether the code is yours or not.


Thinking about software costs in this way is so oversimplified that it ends up being wrong in practice. I'm reminded of the McNamara Fallacy.

Also, SDKs are not standardized. Third-party software still gets updated which means code maintenance for your team. And, even worse, it's software that is generally opaque because, by definition, it wasn't written in-house. I worked at a place where they had a phobia of in-house code and the end result was huge amounts of time wasted on updating libraries and debugging issues caused because we didn't really understand the code we were using.

The real answer is deeply understanding your product and finding the right mixture of in-house and third-party software that maximizes simplicity while also allowing for flexibility and growth in ways that matter for your specific product.


Good reply, thanks.


Now your engineers need to understand all these technologies instead of just postgres. As usual the correct answer to when to repurpose existing tech in your stack and when to add new tech depends on your specific needs, team size, etc.


The majority of engineers understand basic SELECT and INSERT semantics. Correctly building queuing and caching systems on top of database concurrency primitives is an order of magnitude harder than just using RabbitMQ and Redis.


> I think that the job of a CTO is to minimize the surface area of "built in-house" so that your team can focus on things for which your customers actually pay you.

The "built in-house" things are the only things that actually make your company competitive and provide shareholder value. There is no value in downloading and installing commonly-available stuff from the internet.


Which is exactly why if you aren’t a database vendor, you shouldn’t have your engineers spend their time maintaining a proprietary database technology.


If you maintaining a proprietary database technology solves a problem that provides some shareholder value, then go ahead and do it.

Maintaining off the shelf software that you got from Github provides no shareholder value whatsoever, however. It is a pure cost center.

As a CTO you do not want to be the guy that runs a pure cost center department.


There is negative value in rebuilding things that could have been downloaded for free from the internet. Postgres, MySql, Sqlite, RabbitMQ, Redis, Kafka and all the other common tools each have had thousands upon thousands of hours sunk into bugfixes, correctness guarantees and performance work. Rebuilding a poor version of that does not provide shareholder value at all.


Many others have said this elsewhere in this thread, but the cost of using all those things is not free. Every added service is an extra burden to maintain, all the way from writing code through to running and maintained in production.

Sometimes, the cost (both immediately and long term) of introducing a new component into your stack will exceed the cost of building and maintaining an inferior in-house solution, and sometimes it won't. Either way, the cost of these freely downloadable services is not just their cost to download.


The problem isn't using one of these things, the problem happens when you use each of these things. Now you have a distributed system where data is in different systems and you have data inconsistencies or have to deal with distributed transactions. The advantage to using postgres is you can continue to use a single transaction for data integrity.


I’m not sure I follow. Postgres has been around forever, it’s basically what I would call boring technology. Sure they’re still innovating, but the core tech is mature. An example of a “not boring” database would be like Cockroach or Planetscale.


I think "use postgres for everything" is a good solid foundation to start with (if you dont have a really good reason not to and know the exact use case beforehand). Can redis, rabbitmq or solr etc do a better job for a specific user case, of course becasue they are dedicated tools for the job. But the question is, is the builtin tool in postgres "good enough"? And for many thing they are.

I see postgres like a tractor. It's the most important machine and you can attach alot of different equipment's to get alot of jobs done. Some times you notice a task need a specific machine that the tractor is not suited for, then you invest in that machine. To invest in all specialized machines from the beginning just because they can do one single job better than the tractor is not an efficient way of farming, there is a price to have all those machines too, that does not necessarily result in a better outcome.


Redis doesn’t “solve” sessions.


This is the usual debate on "hammers and nails".

You are absolutely right, but what if(just an example) you need transaction behavior between your queue system and your database? Good luck with 2pc or using saga and similar.

Or, as other people stated, how much will cost to maintain postgres + redis + rabbitmq vs only postgres?

In my opinion, the golden rule is: *use the most general purpose tool you know unless you hit a hard limit of the tool, in that case start moving to a specialized tool*.


Couldn’t say the first sentence any better. If your company doesn’t make money directly from the “thing you built” than it’s losing money, lot’s of money.


This is build vs buy vs run. Yes, what you're saying might mean slightly less is built, but there's a lot more to run. Having ops people who know Postgres inside out might go better than having ops people who have to try and know four totally different technologies a bit.


Building on this, for many companies, the leader of the IT org has the main responsibility of focusing on end (or outside) customer needs at the highest quality with a low "Total Cost of Ownership".


Part of being a CTO is knowing when that specialization, both in tools and in-house development, make sense for your context.


Choose PostgreSQL as long as possible.

This is probably good enough for 99% of things out there running stuff.

For everything else you have architects which will do the right thing for you.


What is a session?


I love postgres for most things, but these days (Especially while my product is in early development, embedded, or just not internet-facing) sqlite is amazingly workable.

Killer postgres features however: Row-level security (Fantastic when you're using something like postgrest for rapid backend development [1]), and its built in fulltext search engine is 'good enough' for use cases like when you have an enormous users table and need to index something simple enough, like email addresses for quick login.

[1] https://postgrest.org/


> need to index something simple enough, like email addresses for quick login

It sounds like you might be unfamiliar with the common trick to index long text fields in Postgres: you just make an index of hashed values, and use that for lookup as well to ensure index gets hit.

In case you are familiar with it, maybe it helps someone else who stumbles upon this. :)


I'm confused, how is matching against a hashed index faster than just matching against a string field? Surely postgres' indexing engine should treat these two things more or less the same, perhaps quietly performing the text -> hash conversion on the email field for quick lookups when it's used directly in an index (and perhaps performing even more optimizations than this basic transform)?


I can confirm that indexing on a hashed value is definitely faster on SQLite at least.

The reason is that indexing on a hash produces a much smaller index so more of it can fit in memory. By minimizing disk seeks, you can speed up query time even if the Big O is the same. In both cases it should be O(log n) since SQLite uses btree indices.


This is purely up to the schema designer or dbadmin. You can create a Postgres index specifying “using hash” to specify that the index will not contain the contents of the field, just its hash.

I’m pretty sure that still necessitates a hit to the db to prevent false hashes, but that’s going to be the case with your approach, too.


Surely you give up LIKE queries benefiting from the index this way?


Last time I needed this, LIKE queries still did sequential scans anyway, so the presence of an index or not did not matter.

Also note that you can have multiple indexes in Postgres (well, any RDBMS).

For emails in particular, you can even do partial indexes (eg. imagine filtering emails on @gmail.com to hit a separate index that's only half of your full table index). This requires some care with queries, but a wonderful feature regardless.


With a trigram index in PostgreSQL you can even do index-supported leading wildcard searches.


I think GP may be referring to the use of GIN indexes https://www.postgresql.org/docs/current/textsearch-tables.ht...


Yes that.

When you get over a million users...

I've yet to try the index of hashed values trick though, when I revisit this problem in the future I'll make sure to take note of this!


I've used it successfully at admittedly a small scale to enable full text search of notes-type records on a TEXT field.


I was worried they were talking about doing it manually, and was thinking "surely there's a built-in way to create a hashed index" -- Glad to learn there is!


A couple of obvious reasons why it's not the same: a hashed index is not useful for anything but direct equality comparison.

If you need any normalization like lowercasing, you need to do that yourself.

Next, you can't do collation/ordering using the index (and greater/less than comparisons), unless the hash function maintains ordering properties.

By having an option of an expression index customised to your data, you can use it to get the fastest query performance.


You're probably right. I'm not really a DB admin I just found the builtin fulltext search when I was tasked with reworking a million + users table to go fast. Boss thought we needed sphinx or elastic or something hahaha


Alternatively you might be able to just use a hash index if you don’t care about range query performance.


it's a huge waste of space and now you have to have triggers somewhere to keep the index field consistent.

the database should be able handle case insensitive indexed lookups directly.


you can use postgres generated columns to keep the index field automatically consistent. https://www.postgresql.org/docs/current/ddl-generated-column...


I've seen this kind of thing a lot before and I'm saying that it's almost never needed. All you're doing is building your own bespoke indexing system on top of a database that is doing it (much better) already.


You're augmenting the indexing system. By using hashes you get a column with fixed predictable size. If the average size of your large strings is larger than 16 bytes (or 32 bytes if you store the hex string) you'll get more rows per memory page. If you've got many millions of rows the savings adds up. A little bit of savings let's a smaller DB instance go farther.


Oh wow that makes perfect sense, I see exactly why this solution would work better now, very good point.

The other thing is that if you're inserting your emails in without running some ToLower() function on them first in the validation, you're probably making a bit of a mistake. There's some other discussion in the thread about this.


You should definitely normalize e-mails before hashing or doing any comparison on the back end. Especially if you've had a stage during account creation where you verified delivery of that normalized address. When you take in an address later you normalize the input and compare it to your stored value (that itself was normalized before storage).


This is using the DB as designed. Not a bespoke solution.


Couldn’t you use an expression index, or even a stored generated column with a normal index on it?


Just use an expression index.


Can't you use CITEXT for this?


It is not very useful to add full text search to an email field used for login. A regular unique index, perhaps case insensitive, is what you should be using.


Today I could be a lucky 10,000 - are emails case sensitive? I had always assumed you could pre-process (ie lowercase) them before insertion so that database case sensitivity was not an issue.


Just use the citext extension and move on with your life. This is a solved problem: citext recalls the casing but querying and indexing against it is case-insensitive.


> are emails case sensitive?

IIRC, as per spec/rfc, e-mail addresses ARE case sensitive.

However the de-facto standard is to ignore such thing and deliver emails for Bob@example.com, bob@example.com and BoB@example.com all to the same mailbox.


Technically speaking the portion of the email address before the @ is case-sensitive. However in practice it is ubiquitous that they’re treated as case-insensitive across all mail platforms.


This means however that you should store the original email and not just lower case it on insert. Imagine if you could reset password for Jane.Doe@example.com by registering the jane.doe@example.com address (assuming example.com does differentiate between the two) and requesting password reset for that.


Surely this is why standards are important. An email server could use whatever logic it wants to determine which account to deliver an email to. But if email is to be used by other services as an authentication mechanism there certainly better be a widely adopting standard for how emails get delivered.


It goes deeper than that. If emails are case sensitive, everything changes in the context of unique accounts. If you have jane.doe@ and Jane.doe@ attempts to login - what do you do?


You create a contact address from a normalized version of the entered address (after address verification) and an independent account ID. You can also generate an account ID derived from that normalized address.

The positive response of the address verification will tell you the address is deliverable and the user has access to it. Later if someone tries to register a capitalized form of the address it'll get rejected because of that account ID collision. Then the user can be pushed to a password recovery path where they'll need access to the e-mail/MFA to get control of the account.


My point was that I think it is bad user experience if my email is "jane.doe@", but autocorrect has me input "Jane.doe@" (something I have experienced before). As a user, I "entered the same thing". On a technical level, they are different, but a decision must be made as to what is the true representation.

Amusingly, the context of this thread was in using case-insensitive search for email fields, but if emails are truly case sensitive, this is all moot, because you can only do direct comparisons.


In practical terms e-mail addresses are case insensitive. So if on account creation your normalize the address (lower case, trim white space) and send a verification e-mail and they successfully verify you can safely derive an ID from that normalized address. It won't matter later if autocorrect tries a mixed case address since you normalize and compare it on the back end.

If you run into a case where their e-mail server enforces case sensitivity they have bigger problems to deal with. E-mail has long been a system that requires loose adherence to the specs.


Does anyone know a single example of a case sensitive email provider or email server implementation? I believe I saw a positive answer to this 10 years back (an old university mail server?) but these must be quite rare.


There's an extension for trigram similarity operators which is useful for a quick and easy fuzzy search for small things like email and name:

https://www.postgresql.org/docs/current/pgtrgm.html#id-1.11....


is case insensitive fulltext faster or slower than case insensitive index search on a varchar


The index should not be slower unless something is seriously wrong.


Do you need (or at least benefit from) your database to run in-process? Because that is the only advantage I can see to SQLite over Postgres. Which makes it the better candidate in many places, but not for anything like a server.


For me it's maintenance. Sysadmin level of effort on a SQLite file is near 0


I think this is true only if you're willing to give up resiliency.

You'll have to shut the app down to back it up. That problem gets worse if you want some kind of a cold standby, since backups should be frequent.

You could replicate it, but that requires running a separate daemon and starts to beg the "why not just have Postgres be the daemon?" question.

Sysadmin tasks start to get very difficult when someone starts with SQLite and expects to get Postgres-like features out of it. I'd rather run Postgres than try to replicate or continually back up SQLite.


> You'll have to shut the app down to back it up.

This is not the case:

1) SQLite recommends using the official backup API, rather than copying files on disk. The backup API can be used while the app is running.

2) Litestream is the hot new tool on the block. It streams incremental DB changes to a backup stored on S3, for up-to-date point-in-time recovery.


Adding to (1), you don't have to code anything to the backup API - the standard "sqlite3" command line has a ".backup" command that does it for you.


But how do you make sure a pool of web servers all have the same consistent view of your database?


Oh yeah, if the app is big enough to have multiple replicas, I'll use postgres over sqlite. Mainly sqlite for me is for little apps/services. I do know people that have a setup for distributed sqlite, but IMHO when you want more than machine talking to the db, postgres is a much better fit.


Yeah it's nice not to have to think too hard about that when you're trying to focus on a product getting shipped.

Plus, not everything is a public-facing website. It's amazing how little overhead it takes.


> not everything is a public-facing website

exactly. single-node availability is often "good enough." when using sqlite for a simple service, I've cranked things out from first-line-of-code to production in 30 minutes. When the app only gets a hit every minute or so and only from internal traffic, no need for the overhead of a highly available service.


That's fair, though I've never done any sort of sysadmin on a Postgres server either..


It makes the dev setup trivial, since there is no database server around.


Setting up postgres on the same machine is also trivial.

Modern, production-grade, web-scale machines are able to run more than one process, these days.


Sure, but an embedded database is still simpler than client/server. For many tasks, postgres does not offer any meaningful benefit over sqlite so why add the complexity?


I've never worked on such a project that didn't benefit from postgres, but if I did, I would still use postgres because upgrading from sqlite to postgres, and learning a second SQL dialect, and converting a SQL between dialects sounds like a useless nightmare that costs me nothing to avoid.


I feel it's modern hype on lonely/indie/solo developer creating new thing which changes the world. Not caring on "complex" things till the project reaches 1 request per minute and finally have some data to be lost, is totally fine with this goal/scale.


I would agree, I'm currently also playing with Postgres that triggers some data to sqllite data wrapper to be distributed by Litestream to servers so they can locally read data.


Why would you add sqlite here? Postgres can do all of that without the extra tech of litestream.


Yeah I'm super confused about what is going on in this architecture


I think the design is for local (edge node) read replicas?


One thing I miss in sqlite is numeric types that store binary-coded decimal. (Is there a plugin for that?) This is very useful for currency calculations.


I spent a decade using SQL Server for everything (back when Postgres wasn't close) and it is an exceptionally effective strategy. Deployments are simple, debugging is simple, operations are simple.

Do that until scale forces you to start specialisation.

It's surprising how far you can go with database-as-MQ. Even database-as-IPC can work for smaller systems.


The original implementation of MSMQ was built on top of MSSQL. I don’t remember when it switched to a bespoke storage technology though.


How do you handle high availability? Ideally I would want to not lose many transactions and have automatic failover. SQL Server has that, but it not the default config, and it becomes expensive af.


Do you really need it early? Got like 2 failures in 20+ years, i just let it fail.


Exactly.

A lot of startups are convinced they'll need Google scale from day one. Then, of course, the overwhelming majority fail in the first year.

Get a big, reliable, and cheap vhost/server somewhere, use as many "can't really go all that wrong" components like postgres, minio, etc and dockerize everything. If you want to get "fancy" use ZFS and setup some snapshots and backup. Most solutions don't even need 100% uptime. Communicate maintenance windows to customers and you'll be fine with a total of an hour of downtime (or whatever) in the first year. Most big, really complex, early over-engineered and unnecessarily "optimized" solutions have enough footguns you'll probably end up with more unscheduled downtime in the first year anyway.

In the rare event the startup really succeeds and customer demand, load, uptime requirements, etc demand it you can throw revenue/funding/etc at a K8s control plane on your favorite hosting provider, use a managed postgres/db/whatever, and S3 compatible object store, etc. Or, if things get really big skip all of that and hire in house talent to manage a couple of racks of leased hardware (same opex as cloud but almost always SUBSTANTIALLY cheaper) in geo redundant/distributed colocation facilities.

I've launched multiple startups with this strategy and it's gone very well. My current startups all run from the same big (but 10yr old) hardware that has loooooong since paid for itself even with lots of GPU, storage, etc upgrades over the years. People can be kind of scared of hardware but I've never had downtime or data loss caused by a hardware failure in almost 20 years of this approach.

People are always amazed when I do things with ML, TBs of data, lots of bandwidth, etc and I tell them my total hosting costs are $150/mo.


> My current startups all run from the same big (but 10yr old) hardware

I'd love to hear more about the setup. I suspect something as simple as disk failure would cause outage, although I suppose you can detect a soon to fail disk via SMART and resolve that with scheduled maintenance/downtime. But what about power supply failure? Do you keep redundant backup parts on hand?

There's definitely something nice about having a hardware error on a cloud VM result in that VM cycling out to new hardware automatically. In contrast, something as simple as buying a new off the shelf PSU feels like a ~1 hour downtime event (longer of you don't have purchase card authority, it's night time, you need to order online, etc.).


I like to use ZFS raidz2 at a minimum. More or less bulletproof from a storage/hardware standpoint.

The system I referenced currently has x8 4TB NVMe drives on ZFS raidz2 and x8 16TB rust drives also on raidz2. I use sanoid to snapshot like crazy (down to 15 minutes) and then syncoid to push snapshots and then some to the spinners ZFS array. Plus zfs send remote to offsite.

Modern switching power supplies are incredibly reliable. Then do proper “half load” dual power supplies from dual conditioned power feeds with UPS and generator. Losing power to a machine almost implies an extinction level event or complete incompetence on the part of a data center operator.


Man, I agree but I also really disagree.

I think you should keep complexity down by only using postgres, but just use it as a sequel database until you scale enough that it's a problem.

90% of the time it's never going to scale that far.

When it does, use the tools people have made to do those jobs correctly. Do not hack you way into half-made tools put into a database that isn't specialized in that job.

It seems like you're making your life more simple by having only one system, but you're having that one system to so many things at the same time that it's going to be an absolute kludge in the long term.

Does anyone have experience trying something like this? If you did, is this how it turned out or did it actually work all right?


Yep, let’s not forget that your “simple” Postgres cluster gets complicated really fast when you start having to graft layers of tools on top of it. There’s a reason why Spanner and DynamoDB exist. God forbid someone try to use Postgres to solve the same problems and actually get the configs wrong leading to years of inconsistent data. Definitely never seen that happen…


Its somewhat situation dependent, but I've developed healthy systems that scale to surprisingly high throughputs with just a couple well designed sql databases, read replicas and memcached.


Gall's Law:

> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.

A SQL database, maybe supplemented by a cache, can carry most projects as far as they'll ever go. And if you outgrow it, replace it with something that meets your new needs.

Any given project's "right tool for the job" can change over the project's lifetime, and optimizing for problems you don't have is quite harmful.


Nailed it.


I need to store ~50 million records with around 40 columns of strings, integers, decimals, various data. Only needs to be indexed by two string columns, but every day I want to "upsert" ~10 million records with data that has potentially changed, plus update a column that represents the date that the row was last updated on every row.

With Postgres it seems to be inefficient at storing rows with large amounts of columns, and "upserting" data (using UPDATEs or INSERT ON CONFLICT) results in huge amounts of disk writes, I think because Postgres writes the entire row even if a single column has been updated.

I could store some of the regularly updated columns in a separate table, but I'd thought I'd ask, is there a database out there that is more suited for this type of workload? Because I feel like "just use postgres for everything" is not the answer here.


> I think because Postgres writes the entire row even if a single column has been updated.

This is because of the consistency model postgres uses: Roughly, the old row remains in existence as long as transactions started before the UPDATE are still running, and the new row exists alongside it for that duration. Transactions started before the UPDATE can't see the new row, and transactions started after the update can't see the old row, because the transaction id (txid) is added to the query and is compared to hidden columns on each table/row (xmin and xmax).

This is one of the things VACUUM cleans up, actually deleting those old rows once all the old transactions have ended.


I mean, you could certainly try MySQL to see how it compares. It's hard to guess in advance because it seems like this is a very specific scenario.

But 40 columns should be a cakewalk for any RDBMS to handle. And I assume you're not storing KB's of data in the different columns? Because if your strings are large enough to be allocated as references to separate blob storage, rather than in the row, that could be a performance problem.

Bulk operations with millions of rows can also sometimes take way longer if you're doing them as a transaction that can be rolled back. If it's acceptable to disable that, that could be a huge improvement.

You can also sometimes find massive speedups in using bulk SQL statements (upserting 1000 rows per query, rather than 1 row per query) or CSV file import rather than SQL.

And obviously make sure you're using indexes wherever appropriate.

Because generally speaking, upserting ~10 million reasonably-sized records is the kind of thing that should only take a few minutes on an SSD. You're not going to do it in seconds, but it shouldn't be taking an hour or anything either.


The issue isn't so much the speed as yes I use bulk SQL statements and can upsert like 4-5k rows per second, it's more the disk writes which I find unsettling as after a few months I've noticed several 10s of TBs of disk writes from Postgres onto the SSDs which seems like unnecessary wear.


Then it seems your setup is totally fine. If you're upserting 10 million rows a day with lots of columns, then of course you're going to be seeing TB's of disk writes. And remember that SSD's can't even write individual bytes the way HDD's can, they necessarily write a whole page at a time, which might be 4K or 16K, even if you're just updating a single integer.

Your database and SSD are functioning totally normally, as designed.


Several 10s of TBs of writes over a few months is literally nothing to be concerned about for any recently modern enterprise ssd in a production setting.

This seems like a bizarre rationale to make a database choice.


> I think because Postgres writes the entire row even if a single column has been updated.

You might want to have a look at HOT [0] tuples if you haven't already.

[0] - https://www.cybertec-postgresql.com/en/hot-updates-in-postgr...



Thanks for the pointer, that looks very interesting!


If you want to do writes without a ton of amplification on PG this is the only way right now.

Eventually maybe PostgreSQL will get a second storage engine that uses an undo log instead, like the somewhat stalled zheap effort for instance.

That said the numbers of rows you are talking about are easy peasy for PostgreSQL.


It sounds like you are dealing with a denormalised data model. Relational databases do not handle the denormalised data models well, and 50 million records is a small dataset.

Consider remodelling the data model and going up to 3NF [0], or to a higher normal form if that is more appropriate for your use case. Throwing different relational database engine types at the problem might help with bandaiding the problem for a while, but one day it will come back and bite you again. Wikipedia has a good starting point [1] for the 1NF ↝ 2NF ↝ 3NF ↝ … design steps. Depending on the nature of your data, you might have to consider the Boyce–Codd normal form (3NF++ so to speak). Maybe not.

Once you have remodelled the data model, selects and upserts will be very efficient, quick, and they won't result in the constant index update/rebuild for the entire single main table. You will have to rewrite your queries, but if the data model is normalised, joins will be straightforward, fun to write and will only update what has actually has changed in each, separate table.

If the data model normalisation is absolutely impossible (e.g. you don't own the data model but rather your customer does), consider replacing upserts with the MERGE statement.

[0] https://en.wikipedia.org/wiki/Third_normal_form

[1] https://en.wikipedia.org/wiki/Database_normalization#Example...


Why would you want to replace upserts (I.e INSERT with ON CONFLICT) with MERGE? Without additional context I'd say it's an antipattern, for a few reasons.

First is that MERGE is a much more versatile operator (not a compliment if you need a specific narrow functionality like upsert) whereas INSERT / ON CONFLICT was specifically added to make thread safe upsert simple. How would you do thread safe MERGE in Postgres, off the top of your head? It's something that ON CONFLICT solves for you.

Secondly, MERGE is only available starting with Postgres 15, which has just been released.

So no, don't replace upserts with MERGE unless there's a gery specific reason.


I have suggested MERGE as the last resort (due to the the main table having 40(!) circa columns), not as the preferred option – please refer to the last paragraph in my comment. Yes, it is difficult indeed to recommend something more specific without knowing exact details.

MERGE does have a number of its own peculiarities that are database engine specific and the MERGE behaviour is not portable across RDBMS's.

For instance, Oracle simply does not care about the transaction context width in which a MERGE is used, and the transaction will either succeed or fail no matter how long the transaction takes. Whereas MS SQL Server is very, very touchy and will kill off the transaction if, say, an index update caused by a MERGE is taking too long (from the SQL Server perspective). So it is better to minimise the transaction context containing a MERGE for the MS SQL Server and carefully assess the index update performance. I would expect Postgres to have the behaviour closer to that of Oracle, but I have not looked into it.

More generally speaking (and if we ignore MERGE implementation specific details for a moment), remember that the UPSERT pattern has come about as a workaround for the missing MERGE functionality which was added into the specification very late, and vendors were slow to implement it. MERGE covers a few very useful use cases, especially for complex UPSERT scenarios, but it does not obviate UPSERT's in simple scenarios. Both are useful.


My main gripe is the idea that MERGE is somehow better than INSERT ON CONFLICT for upserts, which I don't think is ever true. I don't think I agree that it was added as a workaround - it was added rather as a proper solution, but for a more limited use case which is popular in OLTP workloads: simple thread-safe upsert. MERGE definitely has its place in complex ETL-style workloads, where you may also want to delete stuff, or change data in the source table as well as in the target table.

Regarding SQL Server - it's not true that SQL Server will kill transactions based on some timeout (unless there's a deadlock detected). I worked with SQL Server extensively, including using MERGE for upserts (btw, you have to use serializable isolation to make MERGE thread-safe, so it's not so easy to get rid of transactions altogether). SQL Server doesn't kill transactions, except when there's a deadlock. FWIW, in SQL Server community there's a well-known notion of MERGE being buggy (even though it's been supported for like 15 years!) and hard to reason about (especially when there's triggers on the involved columns), see for example: https://www.mssqltips.com/sqlservertip/3074/use-caution-with...

I agree with you that Postgres' MERGE is most likely modeled after Oracle, but haven't looked into it either.


> My main gripe is the idea that MERGE is somehow better than INSERT ON CONFLICT for upserts […]

I, for one, find the MERGE syntax to be easier to read, more flexible and versatile and, most importantly, standardised. It is, effectively, UPSERT++ if you like, as it also allows one to delete rows from a table if there is a condition match. Compare

  MERGE
    target
  USING
    source
  ON
    target.col1 = source.col1 AND
    target.col2 = source.col2 AND
    …
  WHEN MATCHED THEN
    UPDATE SET -- or DELETE
      target.col3 = source.col3,
      …
  WHEN NOT MATCHED BY TARGET THEN
    INSERT (…) VALUES (…)
with

  INSERT INTO target (…)
    VALUES
    (source.col1),
    (…)
  ON CONFLICT (target.col1)
  DO UPDATE
    SET target.col3 = source.col3;
Personally, I prefer the MERGE option, but your mileage may vary.

I also deem MERGE, as a technical term, to be more concise and much closer semantically to the intent it describes as opposed to INSERT ON CONFLICT. Rows are routinely updated in a database (it is its job after all), and a row update operation is not inherently a conflicting update. Conflicts (semantically) confer exceptional situations that have to be dealt with, well, in exceptional ways. But I digress as it is more of a lingustic subject.

> […] it was added rather as a proper solution, but for a more limited use case which is popular in OLTP workloads: simple thread-safe upsert.

As far as SQL (the language and the standard) is concerned, SQL is unaware of threads or the thread safety; SQL is concerned with transactions and with the transactional integrity. The database provides and ensures the ACID behaviour and guarantees, and it may or may not even use threads to accomplish it (as a matter of the fact, we know that most do but it is an implementation detail).

And MERGE is perfectly suitable for a variety of processing scenarios OLTP including. In one of my past projects from a few years back, replacing a series of handrolled UPSERT's with a single carefully written MERGE yielded a 1500% performance increase for real time freight item scan events flowing into the main table (granted, the inability to improve the shoddily designed schema was a hard constraint therefore a certain level of creativity was necessary). It was for a 10+ million processing events per hour scale, which is not huge but substantial.

> MERGE definitely has its place in complex ETL-style workloads

As an aside note, ETL workloads are not expected to modify a database in place which would otherwise make them one of the most vile and reviled integration anti-patterns. The (E) and (L) steps imply a distinct source and a distinct target, and the (T)ransform step takes place outside the DB, either in the integration layer or is done by a specialised ETL tool. That aside, I fail to see what is so ETL specific about MERGE; if there is a complex UPSERT scenario, a MERGE could be a good candidate (or not), the onus is on the engineer to carry out the analysis and make the right choice.

> Regarding SQL Server - it's not true that SQL Server will kill transactions based on some timeout (unless there's a deadlock detected).

It is true (or it was in SQL Server 2012). In the project I have mentioned earlier, I inherited a design and an implementation that were nothing short of a unmitigated dumpster fire, which also included «let's add another few random compound indices because what could possibly go wrong?». That had to be scrapped and re-engineered from scratch, as it turned out that SQL Server had a penchant for allotting a specific time window for a transaction to complete (irrespective of whether the transaction was serialised or not), and if an index update was taking longer than the time window allotted to the transaction, the DB engine would kill the transaction off. Such a peculiar behaviour was so unexpected that it caused multiple catastrophic cascading failures during the first production deployment attempt, and the deployment had to be aborted and rolled back. I had to enlist a DBA to work on the post-mortem and the subsequent redesign to understand the root cause. It turned out to be the documented SQL Server transaction engine behaviour. A painstakingly difficult, lengthy and time consuming low level index performance analysis and a subsequent meticulous complete index redesign solved the problem in the end.


> As far as SQL (the language and the standard) is concerned, SQL is unaware of threads or the thread safety; SQL is concerned with transactions and with the transactional integrity. The database provides and ensures the ACID behaviour and guarantees

When I say thread-safety, I just mean the general notion that running a certain SQL statement or procedure concurrently from multiple processes/threads/users/connections will result in correct execution without race conditions.

SQL is definitely concerned with something that is very close to the notion of thread-safety, namely transaction isolation (I in ACID). It's the property that controls concurrent execution of queries in the database, and it deals with what is called "phenomena" in SQL literature. Phenomena is essentially a set of specific types of race conditions which occur under different isolation levels.

And because default isolation level in most popular DBs is "Read Committed", - that is, a very relaxed isolation level allowing a lot of race conditions, - some of pretty basic operations such as upsert/merge are not thread-safe (or, if you dislike this term, you may say "have race conditions", or "do not avoid certain phenomena").

> It turned out to be the documented SQL Server transaction engine behaviour.

Would appreciate the link - it's either something that I haven't seen, or we're just using different terms for something deadlock-related.

>(T)ransform step takes place outside the DB In a perfect world probably yes, but in reality there's plenty of cases where you have some staging tables that are then merged with production tables - that's where MERGE is a good candidate, as it can handle all three of insert, update and delete.

Cheers!



That's not recommended. Column-oriented (relational) databases store data by column instead of by row, which helps in large-scale OLAP reads of data but is extremely slow when writing or updating individual rows (since multiple rows have to be grouped into a column segment).

Postgres and other OLTP databases are a better fit.


It might not be the best tool available, but Timescale's columnar capabalities [0][1] looks great to me. I mean, being able to achieve 90%+ compression on a row-oriented database is not something to be ignored.

[0] - https://www.timescale.com/blog/building-columnar-compression... [1] - https://www.timescale.com/blog/timescaledb-2-3-improving-col...


Yes I've been considering evaluating ScyllaDB, I think it could be a good fit for my use case.


ScyllaDB (a next-gen Cassandra clone) is a "wide-column" database, or better called "advanced key/value".

This is very different from columnar/column-oriented databases which are still (usually) relational and just store data by columns instead of by rows for OLAP (large-scale analytics) usage.


Correct. A wide-column database like ScyllaDB is still a row-based store. Usually described as a "key-key-value" because it has both a partition key for data distribution and a clustering key for sort-ordering within a partition.

Not the same-same as a "column-store" like Druid or Clickhouse or Pinot.

[Disclosure: I work at ScyllaDB.]


I mean, isn't the 40 columns the problem?

You might benchmark it against a couple other database flavors, I suspect most dbs would have issues, although maybe not the same issues as postgres


This sounds like a data lake or delta lake type of situation. If all you need is daily bulk updates but don't care about individual row updates nor concurrent access, I would probably not use a full-fledged database in the first place.

e.g. https://docs.delta.io/latest/delta-standalone.html


Postgres is a perfectly fine answer for the OLTP row-oriented updates that you're doing. It seems you don't need relational features and can probably use simpler key/value datastores but unless you actually have a performance problem, there's no issue here.


I was going to answer that the answer is column families [1], but I just realized that is a CockroachDB-specific feature.

[1] https://www.cockroachlabs.com/docs/stable/column-families.ht...


My current stack is PostgreSQL (on Supabase) + retool (or similar front end tools).

PostgreSQL Extensions: http, pg_cron, timescaledb

To help with development, I am using https://www.npmjs.com/package/sql-watch (written by myself) to do continuous development and testing (TDD/BDD).

The stack "doesn't scale" but the turn around time for development is crazy.


I prefer the model of event triggers writing to a DDL audit table and calling NOTIFY. Then your script can just LISTEN for changes after you run your idempotent scripts. Then you're capturing all changes to the DB, not just the ones you're expecting.

Because it's always that guy making manual DDL updates that screws everything up, isn't it Mr. Hosick. ;)


Big fan of retool too. I also like to throw FastApi in front of Postgres on lambdas or cloud runs.


What are some other frontend tools you've found to be effective?


I've quickly looked at a few but for admin applications, retool seems to be a good choice. I don't have a lot of insight into other tools.


Be careful when using Postgres as queue with priorites, especially if your messages are large. Removing rows doesn't free up the memory in Postgres and vacuum could remove rows only from the end of a 'page'. As a result of that queue will require enormous amount of space to work. The only way to free up space is to use VACUUM FULL which rewrites whole database and will lock queue for long time. I've had a lot of headaches when we've tried 'use Postgres for everything' on production :)


Also Postgres is too slow for large analytical databases. You need columnar database to make fast queries on >1Tb of data.


As always: it depends. For some workloads something like Citus [1] might allow you stay within the PostgreSQL ecosystem even when you are trying to do OLAP.

[1] https://github.com/citusdata/citus


1TB is peanuts. You can usually get by even with a lot more. Once that's expired though, you can just switch relatively easily to a different flavor of Postgres.

It's why AWS Redshift exists: Postgres with column-oriented storage.


Does anyone have experience with some postgres columnar store extension like https://github.com/citusdata/cstore_fdw ?


My experience was not enough support for common postgres features


Agree. Here is a list of the limitations: https://github.com/citusdata/citus/tree/main/src/backend/col...


AWS Redshift works wonderfully in that capacity.


You could use TimescaleDB which is a Postgres extension that adds support for columnar tables and time-based chunking. Works brilliantly IMO.


there's also things like pg_repack which don't lock the tables (but will incur high IO while freeing up the space).


Oh god I am just looking at a piece of software written by some really untalented folk (who are no longer with the client's company) and it's an unholy mess of caches, Kafkas, Elastics.. because they prematurely decided that Postgres just won't cut it as they have almost a 100k items to keep track of! And since that mess of random arhitecture started generating millions of Kafka events their last act of engineering was to add kubernetes to the mix, because hey, more instances will solve the problem, right? I'm begging them to let me rewrite the whole thing just in Postgres, I think I'd rather quit than fix the original.


Résumé driven development at its finest


  > Use Postgres as a message queue with SKIP LOCKED instead of Kafka (if you only need a message queue).
I wouldn't describe Kafka as a message queue. It's more of a distributed commit log.


Messaging is the first use-case given by the Kafka docs [1] and is the first core capability they mention on the Kafka landing page [2].

1. https://kafka.apache.org/documentation/#uses_messaging

2. https://kafka.apache.org/

I use (and think of) Kafka as a message bus, with distributed commit logging being the mechanism for how it accomplishes that task.


It's really not. It's a distributed log.

Because it's really a log and it's partitioned it has a whole host of issues that make it a very poor messaging system like hot partions, head of line blocking etc.

If what you want is a proper messaging system where the underlying model is still a distributed log then you should look at Apache Pulsar. It can provide the same semantics as Kafka but can properly implement job queues and messaging semantics ala SQS, Google PubSub, et al.


I have done this, and the one thing I would note is that if you are requesting a maximum number of rows per call, and if something in the queue breaks for all returned rows (like because it crashed with an exception or something and its status field isnt updated), then you can run a situation where every call to get more rows goes to the same broken rows and the queue becomes blocked. I hadn't figured out a great solution for this but otherwise my SKIP LOCKED queue worked great!


Yes I do agree. But looking into many startups, many of them use it as a message queue with the ability for transformations (E.g. Spark).


It's fine for streaming, i.e Spark and friends. I generally define streaming as the success of each individual message is entirely uniform, i.e the system can either process messages or it cannot. If this holds none of Kafkas drawbacks should significantly affect you.

When people talk messaging they normally mean something lower latency, potentially out of order and where the success of each message is likely independent of other messages. For these use-cases Kafka is not well suited. For these cases you will want something like Pulsar, SQS, PubSub, etc.


Unfortunately many (most?) people use Kafka as a message queue with larger message size.


> Use stored procedures

For God's sake, please don't


Hard disagree with this attitude but I see it all the time.

Stored procedures are much faster than writing logic in some remote server (just by virtue of getting rid of all the round trips), require far less code (no DAOs, entities and all that crap which simply serves to duplicate existing definitions), and have built-in strong consistency checking primitives - which can even be safely delayed until the end of the transaction.

And what people do is, they throw all these advantages away because they can’t be bothered working out how to integrate the stored procedure code ergonomically into their workflow.

I mean - I even use an IDE (JetBrains) to write pl/pgsql. It’s just another file in my repo. Get to this point and stored procedures are a game changer.


Like you said "they can’t be bothered working out how to..." That's a fact which is not going to change.


I second that. SPs/funcs have this weird tendency to always stay hidden in the fringes and out of sight, easily forgotten when adding new functionality, easily overlooked when making changes elsewhere.


I think stored procedures can be perfectly safe provides you follow these rules:

- they live in source control

- they are covered by automated tests

- they are applied using some form of automatic database migration system (not by someone manually executing SQL against a database somewhere)

If you don't have the discipline to do these things then they are likely best avoided.


> If you don't have the discipline to do these things then they are likely best avoided.

I'd go further and say you should avoid databases and maybe even persistence entirely if you don't have the discipline to do the above. Sprocs will be the least of your problems otherwise.


Aren’t those also the absolute bare minimum bar for any code in a production system?


The realization that database procedures are code, not data, even though they reside on the database (where the data lives) is the difficult part.


That’s baffling to me. Who doesn’t realize that that thing which looks and behaves exactly like all other code isn't code?


Before the development of decent migration systems it was incredibly common for database structure - including stored procedures - to be treated independently of source code in a repository.


True, of course. There were also undoubtedly a lot of production systems that didn’t even use version control for non-database code. Industry practices certainly evolve over time. But it’s difficult to imagine a scenario where a team is aware of version control, uses it for the things they realize are code, but somehow doesn’t realize that stored procedures are code.


I know a place that operated like this for years, so I don’t have to imagine.


These sorts of places also tend to have database admins in one team and programmers in another team. All database changes go through the database team with tickets or whatever. It's a huge pain in the ass to navigate and enable quick changes.


It's a common thing to miss. There's a reason SQL injections are (unless things have changed recently) among the most prevalent classes of web exploits.


The folks who treat databases as "that thing behind the ORM."


Yeah, I think so. But my hunch is that the majority of people who tell you never to use stored procedures have been burned by these techniques not being used for them.


Your hunch is off. We version-control DDL/DML/DQL/etc. like any other software.


"But my hunch is that the majority of people" - I'm not saying no-one does this, I'm saying I expect a lot of people don't do it.


> they live in source control

so that probably excludes 95% of legacy codebases out there from the 90s,00s


We tick all the boxes. Please contain your arrogant presumptions.


I was confused as to why you seemed to be taking offense here, then I realized that you posted the comment I was replying to.

For "you" in my comment, please read "one" instead:

> I think stored procedures can be perfectly safe provides one follows these rules:

> ...

> If one doesn't have the discipline to do these things then they are likely best avoided.


From my experience only if there are dedicated DBAs and you have too many systems running - then you forget one. If you only have server code and the stored procedures in the same repository, with migrations, this problem goes away.


I believe that stems from people frequently not including them in version control, or not doing tests.


We version-control our SPs/funcs. We have unit tests, we have integration tests.


> SPs/funcs have this weird tendency to always stay hidden in the fringes and out of sight, easily forgotten when adding new functionality, easily overlooked when making changes elsewhere.

This is the classic "carpenter blames his tools for crappy results" argument. Implementation isn't easy.


It's not. You're just making guesses.


If the developer doesn't know / doesn't document the project has code embedded in the database, that's on the developer, not the tools. Because the use of any developer tools requires a certain level of competence in order to use them successfully.


We version-control all of it. Your assumptions don't matter for reality.


Such is the life of picking complex tools


For some people I suppose


I thought so for 30 years but changed my opinion recently. I even argued with the author of Redis for some time to add some functionality so we didn't have to write Lua and have another deployment target.

Now I do think there is a benefit in stored procedures and triggers (E.g. for audits) if they don't contain too much logic or complexity.


> if they don't contain too much logic or complexity.

I think this is the catch. Most folks who are arguing against SP have been burned by huge complex stored procedures with nested dependencies with deeply intertwined business logic and rules. I completely agree that you shouldn't use a SP in that way. But to help perform maintenance, or to audit, or perform data correction all make sense when kept small and simple.


I've only given up trying to understand a system once. It was when I was handed over an application that used stored procedures for everything. Including recursive stored procedures... The rest could be figured out, but they were just too much.


I feel this. Once had something locking up a production SQL Server instance, and it turned out to be a dreadful partially-recursive web of sprocs, views, and TVFs that worked fine until apparently one day the query optimiser decided otherwise. Spent hours tracing what the heck was going on.


Why? What is your specific reasoning? Using EXPLAIN and SP's to help fix cache misses, slow queries, poor index performance, etc. is generally considered a good thing.

As a side note, I did not realize $diety was concerned about DDL/DML, so thanks for pointing it out. I never really thought about it.


Yes. Please use stored procedures. Don't listen to this guy.


Your app logic concerned with data executes somewhere, right?

So… why exactly would you exclude compute running close to the storage?

You can use that minimal latency.

Of course, people can create an uncontrolled mess of spaghetti code out of sps/funcs… like they can with any kind of code.


Stored procedures has some advantages (fast to debug/try out a query from your service without copy+paste all the time, etc), but also disadvantages (unreadable git diffs, big bang rollouts on changes)

We made this tool to get the best of both worlds:

https://github.com/vippsas/sqlcode


I’d go with this approach for my next startup, unless I had specific performance requirements. (Having built a company to Series B on the standard Django/Celery/Redis stack).

Very hard to overstate the benefit of having all data in one DB that your developers can trivially run, mutate, and step-debug in your application.


very hard to undersell just how bad this is. Now every action on your application it running database queries.


Which is totally fine for most early-stage applications.


It is, until it isn’t I suppose. I’ve worked at places where the ORM (or equivalent) was nicely utilized and we were able to cache things via redis as we got bogged down then migrate to a multi-layer structure with caching based on date + popularity across a DB cluster, nodql store and browser store (this was really nice.)

But: I’ve also had to deal with “SPAs” that evolved a few dozen independent routes a massive backend all coupled to the database with no ORM/class layer. This was not so nice so refactor when it came time to scale


Just one thing:

1. Teach people transaction isolation levels.

2. Have some standard written down rules about what SQL type to use in which situation and what approaches to use for table structure.

3. don't overuse triggers or stored plSQL procedures or similar they are hard to test and debug

The two main problems I have seen with SQL databases is:

1. People not understanding transaction isolation, most inconsistency bugs or strange behaviour no seem to be able to explain I have seen where due to this. As far as I can tell this is a HUGE problem, even through SQL databases are used much less and at least theoretically it's normally through in any bachelor level course about SQL.

2. people wasting time on discussion about types and table structure. For example weather this or that int type should be used,for a lot of use-cases it's just fine to use bigint (64bit) initially for everything. With 32bit there is often some edge case in which it isn't enough. For example if you do 5_000_000 increments per second on a 31bit (i32>=0) counter it overflows in ~7 minutes but if you the same with a 63bit counter it takes ~58494 years. Sure you sometimes might have to go back and optimize storage and there are cases with clearly constrained numbers, but it's the best default for not clearly bound integer numbers. Similar "initial start with" default approaches can be written down for most situations in just a single DIN-A4 page of paper or so.


I think (1.) is especially tricky b/c it works if you're small (experienced this myself with Hibernate, not understanding transaction levels) and then breaks with the number of concurrent transactions.

When the company rapidly grows you get mysterious bugs - in the worst case customers seing other customers data because of the wrong transaction levels.


Yes, thats why I think it's a huge problem for the industry. How the heck is it possible that so many developers with bachelor and master degrees which otherwise do a reasonable job don't even know that they have a dangerous knowledge gap there???

through that:

> in the worst case customers seing other customers data because of the wrong transaction levels.

should not happen even with wrong transaction levels, that indicates some additional serious design problems IMHO, likely related to premature optimizations


I feel fairly competent but would definitely struggle to debug a transaction isolation problem. I think part of the issue is that I have always operated at an ORM level, on personal projects and at large (>400 dev) companies.

From my understanding getting this kind of error would involve something going wrong at a pretty low level? I am not sure how a developer could cause this at the ORM level.


It's very easy to cause in the ORM because by default a chain of SELECTs may ready different committed data unless you add row level locks like FOR SHARE . In a horizontally scaled backend you need these locks usually, but it takes scanning the DB log to figure out how to get them in the right place.

You can't effectively use an ORM without detailed knowledge of the generated output unfortunately. It does not add locks for you, so it's probably just wrong in prod when horizontally scaled and requests span multiple statements.


This is one of the many problems I have with ORMs, you end up needing to know more about the ORM and the database than you would need to know if you just used stored procedures.

I don’t recall ever having a transaction isolation issue with my busy stored procedures - though I often use SELECT .. FOR UPDATE which is maybe cheating because an ORM can’t do that efficiently - but I’ve certainly seen them in ORMs.


> even through SQL databases are used much less

Really!!!


> 3. don't overuse triggers or stored plSQL procedures or similar they are hard to test and debug

Mandate automated DB unit testing [0] from day 1, just like you would for the rest of your code base.

[0] - https://pgtap.org/


> don't overuse triggers or stored plSQL procedures or similar they are hard to test and debug

I just don’t find this to be true at all, quite the opposite in fact.

It’s true that (AFAIK) there isn’t a stepping debugger for pl/pgsql, although I would love to hear that I’m wrong.

But it’s trivial to debug procedures using RAISE NOTICE commands, and you can run your tests non destructively (in a transaction that you abort) which makes setting up the state for debugging much easier.

I also personally find that I write fewer bugs in plpgsql simply because there is less abstraction (no DAOs, ORMs, caches etc) and I’m working much more closely with the actual data structures I care about. And the referential integrity checks tend to catch those I do write, early.


> although I would love to hear that I’m wrong.

You'd be wrong there: https://www.pgadmin.org/docs/pgadmin4/development/debugger.h...


> People not understanding transaction isolation, most inconsistency bugs or strange behaviour no seem to be able to explain I have seen where due to this. As far as I can tell this is a HUGE problem, even through SQL databases are used much less and at least theoretically it's normally through in any bachelor level course about SQL.

Well, same people won't have much success with NoSQL either, as they inevitably will skip reading on how this particular NoSQL DB handles it.


I really like postgres, but I'm starting to wonder if some of these articles are GPT or not.


Good to see I'm not the first one to notice. Some of the author's points are super vague:

> Use Postgres for caching instead of Redis with UNLOGGED tables and TEXT as a JSON data type.

> Use Postgres as a message queue with SKIP LOCKED instead of Kafka (if you only need a message queue).

I'm sure if we try hard enough we could find some sort of meaning in these points, but then the content is coming from the reader not the author.

Also from the author's mastodon:

> Using ChatGPT As a Co-Founder


I wonder if the author is trying out GPT as a kind of auto-complete for ideas. I have not had a lot exposure to GPT, but it does seem to have something like an "accent". I was listening to This American Life, and they had a mini game show about accents and ages. So, maybe that's just on my mind.


The "Use stored procedures" link points to ChatGPT, so I'd say that's a distinct possibility here.


I've just started at a company using MSSQL and it feels like a massive backwards step. This article indirectly shows what's missing with MSSQL - an 'ecosystem' of people developing products on top of it, and a much larger number of people working on what is, ironically, a free product.


Having experience with both, with small-medium size DB, they have pros and cons. MSSQL wins on ootb tooling, easier monitoring, collations, better query planner, ootb encryption, ootb data compression etc. I have seen running more than 100 databases of hundreds of GBs on a single MSSQL instance without a problem. I heard stories about data corruption, never happened to me.

Postgres wins on (lots of) datatypes, indexes types and plugin ecosystem, eg Postgis is great. Per process connection is meh, but can be mitigated. It loses a lot on all features based on operating system libraries, different OSes give different results, not great, not at all. Also if you are using postgres at enterprise level, better have some kind of support, which may cost a lot.

So far I had less issues with MSSQL than with Postgres. YMMV, ofc.


MSSQL is a beast of a RDBMS. There’s little it can’t do better than Postgres, so you either have to read up on it or the niche you’re in doesn’t fit.


The db that makes you use bit instead of boolean is a beast? I disagree. IMO it’s one of those technologies waiting to die.


If your niche is storing booleans in the DB… then I could see it matters just a bit? Otherwise it’s a stupid legacy nuisance with zero prod impact.

You could’ve pointed out the lackluster json support or lack of generalized indexes, but this? Come on.


MSSQL features that are missing in PostgreSQL are impossible to be made by community. ;)

I am using both PostgreSQL and MSSQL in the production since 2003 and they are simply not in the same league. Especially tooling around MSSQL is miles ahead.


As long as you have good interfaces in your code to make it possible to swap out Postgres to more specialized components when needed, I think it’s a good idea.

Saving ops resources by managing fewer services when getting started is a good idea, until scale necessitates dedicated technologies for certain components.


I wouldn't even worry about coding to interfaces. When you need to replace Postgres because you have more than 1M users, you will probably have the revenue to refactor your code to swap out Postgres dependencies. Because you have good integration and E2E tests, right?


At such time you will also have plenty of interesting new problems that require changing most of your code anyway.

I agree that hedging against having to move from a FOSS database is pure waste.


Good tests are a lot easier to write with good interfaces.


Database abstraction layers are leaky, if you don't test things like isolation levels or other tricky behaviors, those tests will have limited value.


Limited value is better than no value.


That will make you lose a lot of productivity boosts SQL can give you.

Obviously, if you anticipate to have very high traffic and applicable usage patterns, do use that advice, but if your apps anticipated usage pattern is in fact not "very high rps per buck made", then I recommend the opposite.

I've went the whole way from very-well-abstracted-away services/repositories to an almost complete lack of abstraction.

Direct SQL or ORM in your functions, operating on many models at once, with a real postgres database available to all your unit tests, treating the SQL code as part of your application logic. Transaction per test so that unit tests are fast to run.

I've been very happy since. SQL is very powerful if you use it as SQL and not as a glorified KV store.


To me, good modularity at the data layer doesn't mean abstracting away Postgres or even SQL. To me, it means having a separation of concerns between loading data, writing data, and _processing data as domain objects_. Don't have your core business logic operate over database rows, but in-memory objects. You can test the loaders/writers separately to your actual logic at that point.


> Don't have your core business logic operate over database rows

This is the part where I don't agree, as SQL is very expressive, and needlessly loading data to your app is also not performant (making the transition to something else required sooner).

I think writing parts of your business logic in SQL that make sense to be written in SQL is just fine.

If you only use it to load and write entities, then that is basically a glorified KV store.


Often writing abstract interfaces costs more then switching things out by changing your code. So it's often better to keep concerns separated and make it so that code is easy to change, but creating abstract interfaces is often not worth it for a lot of products.


YAGNI!


I've gone this route before and debugging your system can become problematic, especially when using adapters to format tables as JSON. But it works and the dev speed is unparalleled.


Though Postgres is great, I personally disagree. If you're starting a brand new project now I think you should use something redundant and distributed like CockroachDB or Vitess.


This seems like overkill to me; a single Postgres master with replication/standby gets you very far. Why pay the ops burden of Cockroach when it buys you nothing in your first N years of business pre-PMF?

Concretely speaking most startups should be using RDS or whatever hosted Postgres their cloud provider offers, not running their own.

I think it’s sensible to use what one knows best, but for default advice to others I think ”simplest viable option” is generally better.


> a single Postgres master with replication/standby gets you very far

So does a distributed database. A primary/replica has an ops burden too, and measuring that complexity is subjective.

With primary/replica you still take downtime during failover, you need to make sure you're testing the failover, upgrades aren't zero downtime in a lot of cases, etc.

You're essentially treating your database as a pet and a clustered option is much more cattle vs pet. This saves you time to develop other features (including working on reliability -- the most important feature)


"Why pay the ops burden of Cockroach" - sounds like you haven't used it. Considerably easier than postgres.


If I know how to use Postgres as a relational DB, it does not mean that I know how to turn it into message queue. I guess I can build some kind of implementation but that would take time to research and implement it. And I can spend that time to implement RabbitMQ. Which is likely million times more popular than Postgres for this particular task, so I'll have more information.

I don't agree to this post.

One should use old, proven and stable solutions. If you need message queue, you need rabbit MQ. If you need cache, you need Redis. Don't implement non-standard non-conventional solutions if you can avoid it. Find out what's industrial standard and use it. Avoid dying solutions, avoid new solutions.

But also remember that you should have something unique. You don't want your business to be easily repeated by someone, because you will compete to margin zero. It's not necessarily tech thing. But it might be tech thing. You don't build cloudflare using off-the-shelf nginx.


Add ClickHouse for analytic, columnar use case and you have everything.... Oh wait, Cloudflare already pointed that out XD:

https://about.gitlab.com/blog/2022/04/29/two-sizes-fit-most-...


> Use Postgres to generate JSON in the database, write no server side code and directly give it to the API.

Who does this? "give it to the API"... ? Still sounds like there's "server side code" there if there's an "API" involved. Surely they're not meaning let front-end code hit the database directly?


It's possible to have Postgres serve HTTP, actually. I've seen many experiments like this. I don't remember the name, but... actually... Yeah, thanks ChatGPT - It's called PostgREST: https://postgrest.org/en/stable/

I've obviously not used it and to be honest, I probably wouldn't. But IMO Postgres is a radically different type of framework. People just think of it as a SQL database, it can be much more though. Tooling is just a bit lacking.

Edit: Heh, Retool is sponsoring them. "PostgREST for the backend, Retool for the frontend". Cool idea. I might reconsider...


PostgREST is an external web server; postgres isn’t serving HTTP, though its effectively doing everything else that the backend requires besides serving HTTP when you use PostgREST.


Yes, postgrest isn't the one I was actually remembering that DID serve HTTP itself. I know there was an extension that did this though.

It doesn't matter anyway. Serving HTTP is not a problem particularly well solved by Postgres, and needing an extension instead of whatever hyper-optimized http server you usually use is increasing complexity, not decreasing it.


> thanks ChatGPT

I find the thing quicker than asking google, pretty often.

I really want a CLI interface to have an "ask the oracle" kind of feature.


In this particular case, it would actually have been slower. I can't phrase things on Google like I do on the chat; I first have to distill my query down. And when I actually am trying to remember something, it's easier for me to be descriptive, because the keywords Google needs can be the important ones I forgot.

PostgREST is page two of "postgres http server" for me.


You can go even farther and generate an entire GraphQL api with row level security, custom functions, etc. I've had a great experience developing with Postgraphile - https://www.graphile.org/postgraphile/


Looks and sounds great, but nodejs


Oracle database can do all of these things as well, and they even have a complete low code development and hosting environment running in the database, Oracle Apex. I know there are a couple of companies that actually serve their home page straight from their oracle database.


but it's also oracle db i.e. expensive, especially expensive support and tones of hidden annoying complexity not publicly well documented. For most companies doing the same in Postgres is just cheaper even if there is a bit less low-code provided out of the box.


Sure, my go to database is also Postgres rather than oracle. Good enough for most things, without the downsides of the company oracle. Just wanted to highlight that Postgres isn’t unique here, and that serving http from a database is actually not that special.


Sad to say, but the number of companies leaving oracle is increasing in my local bubbles


> Surely they're not meaning let front-end code hit the database directly?

Why not? You can make Postgres speak SQL or GraphQl it removes an additional network indirection and there are very clear ways about how Posgres does scale well and where it doesn't scale well. (EDIT: Or put a dump translation layer/service in-between they exist to and are cheap to run and manage.)

You can write your core logic in JS, Rust etc. and plug it into your database as virtual tables or shared procedures (it's surprisingly simple).

Any DDOS protection and similar is anyway in front of whatever you write as some form of proxyish thing.

And just needing to scale Postgres instead of a bunch of different systems is so much more simple. Depending on how you run it and what input characteristics you have it might even be more efficient and cheaper to scale that way (or it can be noticeable more expensive) and it's most times cheaper to manage.


> Surely they're not meaning let front-end code hit the database directly?

In this case they mean "let Postgres generate the JSON for API responses" instead of having the API interface do it. The "Generating JSON in Postgres" section in the linked article shows how this is done.


"let Postgres generate the JSON for API responses" would have been a far less confusing way of expressing that idea. I had read the linked stuff and none of it looked like front-end code hitting a PostgreSQL db directly.

yes, if you can create some JSON without having to transform in an intermediate code layer, that's handy/useful. I often don't run in to too many scenarios where things are trivial enough to allow for that - there's usually some app-level logic that comes in to play with respect to visibility/permissions/etc.


I do, works really well if you need json with nested objects, you can get an entire tree of objects with a single call and query to the database.


What's the favored way to run highly available postgres these days?


I'd like to know answers on this.

The main thing I like about MySQL-based solutions is the availability of Galera multi-master software, which for simple configurations is very easy to get going. Drop a fairly cookie-cutter config on each host, and you're done:

    # Galera Provider Configuration
    wsrep_on=ON
    wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so

    # Galera Cluster Configuration
    wsrep_cluster_name="test_cluster"
    wsrep_cluster_address="gcomm://First_Node_IP,Second_Node_IP,Third_Node_IP"

    # Galera Synchronization Configuration
    wsrep_sst_method=rsync

    # Galera Node Configuration
    wsrep_node_address="This_Node_IP"
    wsrep_node_name="This_Node_Name"

Combine with keepalived on each node which does a health check, and if one goes down/bad, the vIP is moved over to another host in the cluster with minimal fuss.


Is that part of Galera free or licensed?



My favorite route right now is running a postgres operator on Kubernetes & letting it do all the work for me.

Zalando's operator use Patroni under the hood, to create a cluster over streaming replication. It also has Spilo, which orchestrates pg_basebackup or WAL-E for point-in-time backup. https://github.com/zalando/postgres-operator#postgresql-feat...

CrunchyData operator seems to have built their own streaming replication system coordinated by Raft. https://access.crunchydata.com/documentation/postgres-operat...

Both are fantastically featureful well-integrated operators that are super well maintained. Both are very recommendable.


I use Patroni behind HAProxy setup to automatically point to the new active host in the case of a failover. It's not a huge scale setup though so I'm sure there will be better ways or other input on this.



AWS Aurora has been working well for me in general.


How painful is the cost?


It's been a while now but we saved about 10-100x moving from aurora to dynamo for a pure/simplish OLTP use case. Lost a lot of flexibility but the aurora costs were going to send us under.

We also had a few issues where aurora was giving us *incorrect query results* and aws support were not helpful. Most of the issues were denied, even with simple reproductions, and then got fixed years later (~2-3) after we had given up. We still use it today but stay away from cutting edge features, unusual query patterns, and high volume use cases.

Having said all that, SQL still seems like the correct choice for getting an MVP out quickly.


Not OP, but a big Aurora fan.

Not that bad actually, but it varies by use case.

Reasons aurora can be cheaper than you think:

1) Autoscaling. Adding a new reader takes 15 minutes. Instead of provisioning for peak traffic and paying for it 24/7, run an extra reader for a few hours each day. 2) Metered IO. Aurora IO can handle 400k+ iops when needed. Or it can hum along at 100 iops. You pay only for what you use, you don't have to provision for peak load 24/7.

Switching to Aurora saved us money over vanilla RDS. Self hosting postgres may be cheaper, but not by as much as would first appear.


Here's a cost calculator: https://calculator.aws/#/addService/RDSPostgreSQL

My rule of thumb is about $1,000 per month for a reasonably large DB in RDS.


Run Yugabyte maybe


Yup, I like the general idea of using PG for everything.

The author forgot to mention LISTEN/NOTIFY feature in PG. Perfect for low volume pubsub.


NOTIFY / LISTEN is handy too. (pubsub)


We actually use dapr which provides abstraction layer over all of those things including State management, Pub/sub, etc.. So it’s really easy to switch between them without changing the code at all.

https://docs.dapr.io/developing-applications/building-blocks...


I'm just about there and agree in general.

Currently integrating Procrastinate (https://procrastinate.readthedocs.io/en/stable/) to use Postgres as a job queue in a Django API backend.

Dropping dependencies is very nice especially when dealing with an MVP / small apps.


I do Django freelancing and have had a fair amount of small clients with relatively few users. Adding workers and a message broker adds a lot of complexity over just a single Django server. I even had the thought to build something like this but looks like someone already did! Thanks for sharing this, looks like a great tool to know about.


Why not the battle tested celery?


Battle tested Celery has dropped the ball so many times during operations and updates, that I would be very uncomfortable giving that recommendation.


Here's a really good article about why you might want to involve your database in queues (short version: transactions): https://brandur.org/job-drain


See article?

But also, Celery has an awful DX. I think the question is more, why do I need celery when Postgres can do the job itself?


I love postgres, keen to recommend it and did use a couple of the suggested tips in my last startup. However:

- After just the first couple of months we had to replace full text search with Elastic because it just wasn't up to the task fully. - We did introduce redis for cashing which was maybe 2 hours of setup overhead (using Rails, deploying on Render for $5/month).

YMMV.


If we are talking about really small business, yeah fine but I really don’t see the point of such articles without any numbers. Do some benchmarks and compare postgres with kafka, postgres with elasticsearch, etc. Define the use cases or minimum number of table rows or number of requests that performance deteriorates.


And not just performance, also costs related to scalability


Dear Stephan, in my experience (6 years as tech evangelist for AWS, met with thousands of startups and large companies in 6 continents - not to brag, just to give you context, AWS' success and relevance is only very slightly attributable to my tiny contribution), what you suggest might work if and only if the company is not really successful, and its IT needs do not change quickly over time.

If instead we're talking about a fast growing startup, a single solution like Postgres (btw, I adore Postgres, to be clear), it's going to become an issue.

You can dramatically improve your performance if you allow the use of Redis, as a start. And your title could have been "Just use Postgres + Redis for everything", and it would have been half as bad, or quite good.

I think it's ok to try to keep things simple, but limiting yourself to JUST Postgres is not going to work.


> might work if and only if the company is not really successful

What is your definition of success? I’ve seen Postgres scale to pretty large loads. I’ve also seen plenty of complex stacks where the scale never warranted the complexity.

That said, Redis + Postgres is a pretty simple stack, so I’m not complaining about your suggestion. I’m more curious about specifics.


I have seen postgres queries being answerd in low ms range. Why would I want to use redis ? Only if my loadtest suggest that postgres is a bottleneck and redis is faster.

Keep complexity down to min to make it functionally work. Loadtest and fix the bottlenecks. add complexity when you know the gain.


I'll tell you what Postgres, Kafka, Cockroach and Mongo DON'T do for your startup. Make sales.


A blog called “amazingcto” that isn’t loaded over HTTPS..


Oh wow. It shouldn't even work on http... it should redirect to https. If you calling yourself amazing CTO.


It loads via https for me


You can so use pgnotify for a simple pub/sub.


The best part of Postgres was it wasn't MySQL.

...geospatial types were just a nice addition later. ;)


Agree. In my case, I went from Excel stare'n'compare to MS Access and finally mySQL for several years.

mySQL worked great for my lil piece of company business involving a few million records. But then I volunteered to do some analysis with files an order or two of magnitude larger (10 mlllion-100 million). mySQL started coughing up with memory errors. Uh oh. Embarrassing.

I should mention that I'm doing all this on a Windows desktop, as that's what is installed and locked down on the machines they give me.

Enter postGRES. No matter what I throw at it, it gets to end of the queries without errors—in Windows, no less. With an nVME and 32 GB of ram, I can run a complex report on 70M records in seconds. Loading and indexing takes about 15 min.

mySQL was very very good to me for a long time, but for the really big stuff I'm sticking with PostGRES.


Can someone explain why I shouldn't just use something like cockroachdb, yugabyte or spanner for my primary database nowadays? I've seen the pain and resources wasted from teams needing to find ways to shard mysql and postgres and make them scale. Sure, you might say I might never need that scale but why not just avoid this problem entirely? If I'm building something I intend to support hundreds of millions of users why would I choose a technology I know will fail to scale to that?


Because you are working for a company that like 99% of companies have less than a million users?

You want data analytics, financial reports, the ability to query your data in new ways you didn't think of when you first built the product? NoSQL comes at a great cost.

When you hit a million users, the entirely of your codebase will be on-fire with scalability issues anyway so it won't make any difference?


Those databases I mentioned are not NoSQL, they are relational databases.


Using a sharded database is not just for scale.


So instead, add more complexity?

Isolation?

Please do elaborate.


Our open-source project [1] was built with nothing but Rust + Postgres on the backend side of things and it allows us to iterate quickly and have very simple deployments for self-hosting. The queue which is the core bottleneck of our stack will eventually get replaced by redis but for a scale of less than 10k/s jobs, it scale incredibly well.

[1]: https://github.com/windmill-labs/windmill


Just use the right tool for the job at hand. If it’s Postgres for now, sure. There’s really nothing more to this. Period.

But saying this would not make for a blog post. I can’t help but categorize this post as juvenile at best. Generalizations like “use XXX for everything’ should be avoided at all costs. No software product serves well to such sweeping generalizations. I’m surprised that it’s coming from a CTO. They should know better if they’re worth their salt .


It comes from a CTO (me ;-) coach who has seen dozens of startups entangle themselves with systems until they work more on tech than on delivering features - or who came to a standstill after VC driven layoffs b/c of complexity of their systems.

"Just use the right tool for the job at hand."

Yes, but I have heard exact that phrase for decades to rationalize tech decisions for tools that weren't needed.

I know you're different, but think of all the people who have the same problems.

If you need complexity b/c you're Netflix or Uber, go ahead. If you're the 95% others who don't need that complexity, then don't do it.

"I can’t help but categorize this post as juvenile at best."

Thanks, I guess this is the nicest thing to say to someone 50+!


> Thanks, I guess this is the nicest thing to say to someone 50+!

Your username is also juvenile, well done you!

Fwiw, my experience has also led me to be on the side of "just use Postgres" unless required otherwise. I've seen enough people glue whatever crazy technologies together where a relational database would've been enough.


"Your username is also juvenile, well done you!"

It was Codemonkeyism before (𝅘𝅥𝅮 "Code Monkey think maybe manager want to write god d** login page himself") but I thought I had grown to be king of the asylum.


It's often better to use the tool that is capable but isn't the absolute best possible option to solve multiple problems, rather than taking on the burden of running five different "best" tools for five different problems.


If you are at a startup, sure. But many of us work at big companies that have the scale that requires specialized solutions. I just frustrated that everyone always seems to assume everyone is working on POCs at startups.


Somebody just jumping on a new tool thinking it's the right tool for the job might still end up with worse performance than someone proficient with another tool (like Postgres) might get in less time.

No matter the scale or how big it is.

If you are talking of specialized solutions, that means you know exactly where an existing, well known solution is failing you and why.

Eg. if you drop all foreign key references and constraints in Postgres, you might get similar write performance to other databases which can't make those guarantees when you do need them.


I've worked at big companies, where I have championed introducing "the right" technology for scale reasons... and have in some cases later regretted it because with hindsight we would have been fine sticking with what we had already, at a greatly reduced cost in terms of time and complexity.


"Right tool for the job" is as foolish as any other approach: this discounts the experience someone or even a team might have with another tool that is not a perfect fit, but might still do the job just as well. If you ignore this and instead go with a different tool for every little thing, your project will be a dependency nightmare no developer can master easily.

Postgres is an all-purpose database that has it all: relational databases were created to model real world problems, and while they might not perform best for all the usecases, they can usually model them just fine while protecting from many programming errors.

And then Postgres has features on top: partial indexes, object DB features, replication etc.


> Just use the right tool for the job at hand. Period.

That's Postgres.


Engineering is trade-offs. Complexity is an enormous one best avoided unless you absolutely cannot work with the simpler system.


Context matters. If you’re going to be scaling huge, this advice is silly. If you’re just starting, or planning to rewrite later, or for good reasons never expect huge scale, staying as simple as possible makes sense. Also, tons of devs build to scale huge for systems that will never, ever scale huge.

I have two competing values in this space: someone else has solved problem x better than you will, and you will live to regret every dependency you introduce.


I just use MongoDB for everything. It's easy to scale both vertically but also horizontally. Also enhances availability and avoids data loss to have replica set with at least three MongoDB instances geographically distributed. MongoDB also keeps most frequent queries in memory which makes it super fast. ObjectID UUIDs are generated client-side making it fast writing. GridFS makes storing files easy, fast and robust etc...


I tend to use redis but in my experience Mongo works for a similar purpose (plus has querying which is nice.). Use Mongo/Redis as the working store (build data client side, sync to mongo/redis which then inserts into db when it gets the chance.)


My company has a similar philosophy. “Just use DynamoDB for everything.” Seemingly they don’t care about costs, but otherwise it is mostly working.


He forgot one, you can also use Postgres for ML tasks too:

https://postgresml.org/


Fully agree with this philosophy. With json support and timescale you can get really far. If you run into performance problems you’ve already won.


Can I use a few postgres tables as redis style in memory database? Can I finally have more connections to postgres without using pgbouncer?


What's wrong using pgbouncer? Today it's not rare to run http proxy in sidecars for every service.


This is good advice, unless it doesn't fit your use case. As an example, using pg instead of redis is great until you need to handle hundreds or more requests per second.

Using pg instead of kafka works great until you have to ingest a few gigs of data a second.

Using pg instead of mq works until you have a few thousand messages a second.

Etc etc.


We tried using postgres for a rewrite of an existing system but the lack of support for columnar data made it not possible.


AWS Redshift. Same Postgres application drivers with a columnar/analytic focus.


"Use Postgres for caching instead of Redis with UNLOGGED tables and TEXT as a JSON data type."

What the heck?! No!


More info on why not?


Because it is not a cache. High entropy tables require a lot of vacuuming, which adds overhead to actual database operations. While you can set autovacuum parameters per table, they still are not free .

If you need something like Redis or memcache, use that for anything that is heavy on writes and deletes.


If the suggestion truly is to try squeeze as much utility as possible out of a single RDMBS setup, and if you already need and use an RDBMS, then I would consider it as an option. If not, it's an absolutely idiotic choice. And that's the problem with this article - it really comes of as a suggestion to supplant your Redis setup, a dedicated key-value store, with Postgres, a full-fledged and highly complex relational database.


You offload some load like pg connections from postgres to redis. Your pg shared buffers would also be available for other stuff.


This seems like a collection o terrible advice. These things are all very conditional, and such blanket statements are nonsense. Postgres is great, but it’s not all the things for all the applications. I wouldn’t hire (or work for!) this guy, fwiw.


I think a better way of handling this is creating layers of abstraction between data storage and code.

Annotate whether this is a hot cache or cold storage, and be able to flip out the infrastructure.

ETL frameworks support this. Probably should be more widely used (web devs, etc).


Then there is http://postgis.net/ for Geospatial requirements.

Serious question, though: when the data span several instances, how well does PostGreSQL do compared with Elasticsearch?


Why does the article talk about client complexity, then say Postgres can be used for everything, but then not address client complexity. Does Postgres have a nifty server-side rendering pipeline that could replace front-end complexity?


Yes: HTML. Seriously, just generate HTML on the backend and serve web pages like it's 1999. Easiest way to get 100% on all Lighthouse scores.


So long as you never have multi-page forms and don't worry about error handling on bad submissions and the back button, you're golden!

I still have nightmares of post-redirect-get complexity and giant balls of mud in session scope to support the back button.


Don’t write any code, easiest way to not have bugs.


You joke, but I have literally never had as great of a success at work as that one time my team didn't end up building the software.

We had great ideas about scheduling and caching and task priorities...

...and then we asked the customer what they wanted, and they wanted none of it.

So we built none of it, and produced a solution that just did the stupidest thing, and did it without any edge cases, without ever crashing, reliably, as a cronjob, once on Sunday night.

Some people would be disappointed that they couldn't put this on their CV because it didn't involve FancyTech #413, but damn it, I am still proud of that stupid thing.


I interviewed a (junior) candidate once that had tried and failed at a hackathon to build an Rails system that connected restaurants and shops with excess food to charities that gave food away to those that needed it.

I asked him what he’d done to work around the technical difficulties, and it turned out that he’d set up a Wordpress site with a phone number of the guy running the scheme, and a Google sheet to manage contact details.

A better definition of MVP I’m yet to see.


My take is that there's many unnecessary complexities (at least in many cases), where using an old, battle tested, "boring" technology suffice. The graph just seems to be an illustration of that, but it focus on storage (there could be a better graph that focus on the database only, but I guess the message is conveyed)


"You probably don't need as many back-end services as you think you do" isn't quite as good a title.


If only there was a infrastructure in a box that gave you all the features he mentioned in a well maintained distribution.

Then rather than saying just use Postgres you say just use X.

I like postgres but I've only used it in production once. My experiences were fine. We used alembic for database migrations. It was a Python Flask app.

We also used it as a message queue and stored JSON form data.

There's still hand crafted code for using Postgres as a message queue.

A bit like Linux distributions which are aggregations of desktop or server software collections of well integrated software.

I tried to build a stack that could be spun up with all goodies included.

But I would not want to inherit something hand stuck together.

But at the same time, the thought of setting up Kubernetes for everything from scratch is also a lot of work.

If you need to use cloud, that's also a lot of work.

https://devops-pipeline.com


So I should replace our 200 node Kafka cluster with Postgres?

Ok, I’ll get started on Monday.


> Use Postgres as a message queue with SKIP LOCKED instead of Kafka (if you only need a message queue).

you have a 200 node Kafka cluster just as a message queue?


Tell me how it went ;-)


Be like a former boss of mine. Replace your entire backend strategy with stored procedures. Fire your backend developers. Save cash. Get a promotion for saving the company millions a year.


postgresql handles queues well. the tricks are skip lock, hiding the db behind protocols/micro services and judicious use of unlogged tables.

here is a 2015 talk on critical messaging using pg.

https://github.com/jmscott/talk/blob/master/ams-pgopen-20150...


The truth lies somewhere in the middle most likely.

Replacing redis with PG for example feels very "when you have a hammer everything looks like a nail" to me.


I used to think this, but now that I know a bunch of these cloud tools; I think these frameworks standardise things between companies.

Kafka is well documented and understood. Your custom postgres solution is not.

Best case you write high quality maintainable code and have a single dependency. Worst case is you create a complete mess that takes months to onboard people.

Right now I'm contracting on a project that uses AWS BATCH, docker, kubernetes, terraform, Nextflow, Django and postgres. It took me 2 days to on-board myself and start delivering features because everything was standard.


I've worked for a while now in the field and I never worked with kafka. Nor most of the peers I've worked with. And I wouldn't even call it the most used message queue or data stream solution, so I don't understand why people treat it as ubiquitous software. SQL, on the other hand, is. And while it's certainly not built for message queues, that's not the point of the article.


That's not my argument. I'm saying it'll be easier to onboard developers onto a standard message queue (like Kafka) than to onboard developers onto some custom thing you wrote.

The original article is rebranding "Not invented here" as "Minimise dependencies"


But kafka isn't a standard message queue, and therefore won't make onboarding easier except for the ones who already know it, which is not a significant majority, right?

Or at least we'll have to agree on the definition of standard here. For the equivalents of SQL, I'd say we have AMQP or MQTT, and kafka does not seem to support any (my goggling tells me it has its own protocol); deployment-wise, it doesn't seem easier to set up than, let's say, SQS; while used and supported in most common languages, most queue/job frameworks do not seem to support it OOTB (looked into ruby, python, php, javascript), so you'd have to write your own custom kafka handling code from scratch. So what would make it easier than just riding postgres?


For me standard means used by more than one company and well documented. It also means that most companies use it in the same way. Basically all 3rd party Frameworks and tools are "standard"

Custom means that you wrote a unique solution for yourself. No-one else uses your solution. New hires have to learn it from scratch either by reading your code or your own custom documentation.

There is obviously more work to onboard people onto a custom solution. You can hire people who understand the standard solutions.

Postgres isn't a queue or a message broker. You are writing your own queue that depends on postgres. Your custom queue will have complexity that new hires have to learn. No new hire will have used your custom in house queuing system before.

If you want to use SQS that's a great idea. I support using standard solutions.


The link to full-text search uses a relatively small dataset. Can anyone provide advice about how large a dataset has to be before this breaks down?


I don't know when it breaks.

But I'm on a single medium sized Linode and doing full-text search over 120M small (< 1KB) to medium (< 10KB) text documents and it's still so fast I'm not sure that I will ever need to consider an alternative.


Ok but postgres is the wrong choice for everything - never ever have I seen a database so out of touch with actual database requirements.

Working replication - barely and still awful 2 decades after MySQL

In place upgrades - nope

Access control that doesn't require root access to filesystem - nope

The list goes on...

Disclosure: I do postgres in production for 10000s of machines in like 49 countries, I understand how databases work and have been doing this for 20 odd years, it's a bad choice because postgres is written by purists, not people who deal with it in the real world.


Funny you say that, because I got the same feeling from MySQL. I mean, dropping data, ignoring constraints, all done silently. A database that cannot keep my data safe is just not a database.


Ah yes, the old MySQL sucks argument based entirely on a bug fixed a decade ago


I'm not trying to pick a fight with you. Nor am I trying to say "MySQL sucks". I'm commiserating with you, or at least trying to, at your level.

You're the one bursting in here shouting "Postgres sucks!" "based entirely" on a lack of features added more than "a decade ago". You have your choices, we have ours. You'd rather have the convenience of features, even at the risk of losing your data; we'd rather have the peace of mind of knowing our data is safe, and build features that don't exist out of whatever else we have, however we can.

It's not always about what bugs or features exist or don't exist now. It's also about the nature of a project and how it's developed, and how it's going to evolve or otherwise behave in the future. After all, we're talking about databases here; we trust our data to them.


Not sure what you’re talking about wrt. access control, but PostgreSQL replication works just fine for most people, and the lack of in-place upgrades isn’t really a problem unless you’re working on a massive scale with no downtime tolerance.

You run things at an unusual scale, you have unusual problems.


Having to edit a flat file for access control is absurd, perhaps 2 decades ago when it was common sure but everyone figured out how to do it without, also it's not really at that much scale when every other database implementation makes this painless (even Oracle!)

Postgres still lacks basically everything that other replication implementations do, if they have to mention log shipping by rsync etc they've already got it wrong.

For clarity they only recently figured out "streaming" replication which again, is what everyone else has been doing for years.


Annoyingly can't reply to your reply below but:

Replication can't (unless it's changed recently) be granted via ddl, and isn't propagated either - but otherwise sure

It's not just that though, the general behaviour is counter to how real world deployments work, like for example basically anyone at scale moving away to something else because the admin and operational overhead is way too much


> Replication can't (unless it's changed recently) be granted via ddl, and isn't propagated either - but otherwise sure

It can, and is, at least as of the five year old (and now unsupported) release 10.


"Client authentication for replication is controlled by a pg_hba.conf record specifying replication in the database field. For example, if the standby is running on host IP 192.168.1.100 and the account name for replication is foo, the administrator can add the following line to the pg_hba.conf file on the primary:"

https://www.postgresql.org/docs/current/warm-standby.html#ST...


Again, not sure what you’re talking about here. `pg_hba.conf`? Normally you don’t need to modify that regularly (if ever).

Normal access control can be done with GRANTs, just like MySQL or Oracle.

As for streaming replication, that was added in PostgreSQL 9.0, six major versions ago. Sure, it was late to the party, but it’s not exactly a recent addition.


Ah for some reason I can reply now:

Regardless being 2 decades late to the party is unacceptable - it's still not complete and makes real world management a chore, I still admire the purist attitude it has its place but you can't also then claim it's a useful competitor - which is what they do.

Unfortunately it's a perl postgres vs the world thing - both are irrelevant.


That they were late to the party doesn’t really matter _now_. It might have mattered five years ago.

As for the config, one might argue that reconfiguring replication isn’t a common task, usually something you do when setting up the cluster and never mess with again.

I understand that there are pain points running PostgreSQL at massive scale. But 99.9% of projects needing a database never reach that scale. And once they do, they probably need to re-engineer big chunks of the application anyway.


Well this is one aspect - application should not have to care about underlying db topology - if it does then the db is wrong, my main issue tbh is that other databases implement the inevitable like proxying or sharing while postgres leaves it to horrific perl scripts from the last century, it's really not admissable as a production solution


> while postgres leaves it to horrific perl scripts from the last century

Your knowledge of what Postgres supports and how to administrate it seems to be a few years out of date. There's plenty of examples in this thread alone.


> PostgreSQL 9.0, six major versions ago

Correction: thirteen major versions ago, approx. thirteen years ago. 9.x were all major releases.


In my next backend, I want to use serverless functions (Cloudflare), serverless Postgres (Neon), and nothing more.


We will certainly make it possible. There is something to also consider: software development lifecycle and native support for previews, schema PRs, and deploys. Lots to build to give lots of confidence to a developer that a PR won't break and won't regress performance. We will make Vercel + Neon to support this extremely well.


The graphic is misleading:

- You can easily do prerendering and asset packing in the frontend build step

- You don't need MVC on the backend

- You can use a third party for authorization

- Frontend can just be a React app

- VirtualDOM doesn't add any complexity in addition to the frontend View

- You don't need MVC on the frontend

- Frontend can fetch data from the API and cache it with cache-control header

The simplest setup is actually to host a React app from an S3 bucket and use Lambda / Cloud Functions to respond to network requests.


> The simplest setup is actually to host a React app from an S3 bucket and use Lambda / Cloud Functions to respond to network requests.

You'd have to pry my monolith-on-a-vps out of my cold dead hands before you can call that the simplest setup.


If you’re going to “use for everything” why not be more creative?

Use DynamoDB for everything? Use SQLite + S3 for everything. Heck use Redis for everything.

I understand that Postgres is insanely flexible. It supports pretty much any type of data storage you want.

But if you’re going to focus all your developer knowledge on on product, consider whether another DB might be better.


> Use Postgres with TimescaleDB as a data warehouse.

How does this stack compare to Snowflake or Redshift?


Redshift is just a heavily customized Postgres


It uses Postgres wire protocol sure, and may even contain some Postgres code; otherwise this is like saying “a tank is just a heavily customised car”


C++ is heavily customized C. The heavy customization make Redshift a columnar database and more ideal for querying large amounts of data quickly. How does Timescale help Postgres in this area?


Timescale is built around a concept they call "hypertables", which automatically partition data into a set of smaller tables segmented by time range. Timescale exposes the time-series data as if it was a single table, but behind the scenes is managing queries against the individual table partitions and automatically creating new partitions as data is inserted.

By tuning the chunk sizes so their data fits in memory, many common queries gain a lot of efficiency. It's built around some assumptions of time-series data: Most inserts and queries are for recent data and are generally ordered.

I've had great experience with TimescaleDB for small-medium time-series loads such as sensor or analytics data; I've found it's pretty plug-and-play and have used it to store tables with ~1B time-series rows of geospatial data, sensor values, etc.


Hot take but I prefer dynamo for storing persistent state. Much easier to scale.


Apples and oranges. DynamoDB is wicked fast for what it does. If you know your use cases ahead of time, you can figure out a data storage strategy where queries and secondary indexes can work wonderfully. $deity help you when the use cases change though. Rearchitect->dump->transform->reload.

I really like DynamoDB. Its functional overlap with a fully-featured relational database is very narrow though. No graph queries or joins or advanced data validation or...basically anything on this list: https://www.sql-workbench.eu/dbms_comparison.html

Postgres can be made to work like DynamoDB: simple queries with no joins, table partitioning and tablespaces for I/O parallelization, etc.

DynamoDB barely scratches the surface the other way around. PartiQL is the limit and it's a very basic limit.


I love Postgres as much as the next dev, but why are COUNTs so slow?


Because of how MVCC is implemented in Postgres. However, it's not terribly hard to add efficient count tracking that works with this implementation of MVCC, using row triggers. I use it often in production.

This might change in the future with the zHeap storage backend.


Ouch. That first image has really bad kerning. It hurts my eyes.


I correct it....just use the filesystem for everything.


Not read it. But. I second the title.


I don't know if I wanted to iterate with a datastore that required me to define a schema.


You might want to look into tools for "evolutionary databases", or "database migration" tools as they are better known today.

Btw, your comment strikes me like saying "I don't want to iterate with a strongly typed language".


I think, strongly typed is okay. Same goes for statically typed.

Just nominally typed isn't my thing.


Proper databases like Postgres are pretty strongly and statically typed.

Eg. try inserting a string into an int column and watch it complain. It even supports custom types out of the box, including enums.


You can use a spoon as a knife, it will just take you more effort.

Probably you can use Magic the gathering cards as a computer since it is turing complete, but it will just be inconvenient as hell.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: