Tangentially related: my personal open source project (ntfy.sh) is starting to o...

aseipp · on Oct 27, 2023

If by "outgrow" you mean you want to improve concurrent read/write performance, availability, and throughput via replication, then I would -- personally, for peace of mind and ease -- look at just buying a managed RDBMS from someone with predictable pricing and using that. Buy PlanetScale, Aurora, Neon, CockroachDB Cloud, etc and let them handle it. PlanetScale and Aurora in particular have plans that let you fix the I/O costs and only pay for compute and storage, which is pretty attractive. They also use standard MySQL/Postgres, so you can migrate in/out to other providers later. It does mean you must spend time on the migration though, and spend time understanding the implications of your new storage choice (latency, failure modes, etc.)

If you just want peace of mind and better disaster handling, and your current uptime, concurrent workload split, and performance is fine -- maybe your hosting provider eats it for an hour or two and you don't want to be hamstrung -- I would suggest just using a tool like Litestream, replicating the DB to S3, and setting up a hot standby server in some alternative region that you can fail over to and that actively synchronizes the working set. Always keep them up to date with deployment automation. Set up some alerting, and if a failure happens, terminate the main instance with prejudice, let the replica catch up by replaying any latest changes, and reorient your load balancer to point to it (Cloudflare tunnels are a good low-tech solution you could use to do that.) This might imply a small downtime window for the hot failover. For bonus points, you can automate this whole task, and actively perform it regularly, swapping between servers on regular cadence -- thus turning the design from having a primary/standby to simply having two interchangeable systems that swap roles. And then you have good confidence in hot-failover disaster recovery. e.g. just do this whole dance every week on Monday at 1am UTC, and you can have confidence it works and will stay working.

I do not know what other architectural constraints you might have. These are just suggestions. Good luck!

robertlagrant · on Oct 27, 2023

I know it's heresy, but could you defer the decision by sharding? E.g. customers with an email that hashes to a positive number goes to instance A; others to instance B. That might buy you some time.

simonw · on Oct 27, 2023

Can you throw more RAM and CPU at the problem instead?

binwiederhier · on Oct 27, 2023

I have and it works really well. This is less about SQLite not being able to handle it -- because it does -- and more about reliability and availability of the overall system. I don't want one server being down to affect the entire system.

JohnnyGault · on Oct 27, 2023

https://github.com/maxpert/marmot is SQLite replication thats eventually consistent and doesnt require a master server.

binwiederhier · on Oct 27, 2023

Fascinating, thanks. I'll check it out.

yawaramin · on Oct 27, 2023

Congrats, that's a great position to be in.

DANmode · on Oct 27, 2023

> some SQLite-replication thing

This feels like an uncharitable interpretation of the maturity of projects like LiteFS/rqlite/dqlite.

binwiederhier · on Oct 27, 2023

The vagueness of this comment and its perceived derogatory nature is merely a result of my ignorance around these technologies. It was not meant the way you perceived it.

That said, I looked at 2/3 a while ago:

- rqlite looks great and mature, though it has a HTTP interface, so I'll need to rewrite my database layer anyway to use it. At that point might as well to Postgres

- dqlite seemed like a failed Canonical experiment to me last time I checked, but looking again I think I may have had the wrong impression.

mharig · on Oct 28, 2023

There is also Litestream.

And: the SQLite homepage runs on - guess it. Maybe they have some doc/blog describing how they manage their traffic.