What is the difference between one server with many TB vs multiple servers with ...

Drdrdrq · on Jan 26, 2020

With multiple servers you can add space to each of them. With a single one there is a much lower limit to what you can do - that's the idea behind vertical/horizontal scalability. That, and the systems with multiple nodes can be made more reliable than single node servers.

heipei · on Jan 27, 2020

One server is still going to run out eventually. To give you a very concrete example: The dedicated boxes I use at Hetzner have 2x1TB NVMe SSD by default. I can order additional disks for sure, but even so you'll struggle to get above a few TB of NVMe per box. But, adding another box is cheap and easy.

Plus, a single server with many TB is a big single-point-of-failure, and if you want to scale it (vertically) you still have to take it down.

cjp222 · on Jan 27, 2020

If the data is active, it's not enough to just throw more storage on the server. More storage means more of other things: Memory. I/O bandwidth. Processing power. Could keep adding those as well, but eventually it's faster and much cheaper to add additional servers.

scurvy · on Jan 27, 2020

Backups are easier/faster? A machine with a 5TB table will take forever to dump with a single thread, but 5 servers with 1TB shards will dump it more quickly.

AkshatM · on Jan 27, 2020

Vertical scaling (getting a beefier machine) is less preferable than horizontal scaling in the case of cloud providers, because a) it usually costs more to upgrade instance types than to run multiple smaller instance types, and b) eventually you hit a ceiling of available instance types as you grow.

But you can still make this work. In our case, we'd ended up going the first route, and we ended up adding AWS EBS as the filesystem block store for our Postgres database, which is easy to resize dynamically without incurring downtime or other issues.

The downside of a vertical scaling approach is, well, you don't get HA "for free". You have to manually configure followers and standby nodes for the sole machine. You have to worry about your own replication. If failover and management is abstracted out for you, as in RDS, then it's easy to live with - otherwise very painful.

tl;dr go with managed DBs if you just need a DB and don't particularly have to optimise query performance outside of DB config.

d_t_w · on Jan 26, 2020

Availability is one, most distributed systems also replicate.