Hacker News new | past | comments | ask | show | jobs | submit login

5.7 TB is small by database standards. I work at a much smaller company and deal with "proportionally" much more data.

I don't know if it would be worth the engineering effort to try and archive old data in a way that still makes it transparently accessible to users that go looking for it--especially when modern databases ought to be able to scale up and out without manual archiving.




5.7 TB for an OLTP database is small?! I must be living in a different world. Obviously I know you can go that big, but I thought the number of use-cases would be limited.


Why does my browser routinely eat 8GB while it used to only require 32MB 25 years ago? because it can. Web services likewise come up with features and data to fill databases.

For $8/hr you can rent a DB with 500 GB of memory and 64 cores, complete with redundancy, automated backups, and failover. For the hourly rate of an oracle consultant you can rent a DB with 2TB of memory for the day.

Bear in mind that many of these workloads are trivially shardable (e.g. any table keyed off customer ID) and can be scaled across hundreds of DBs as required.


If data is sharable, it doesn't mean that it is trivial. In your example with shards by user, simple message sent between users in app becomes are very non trivial dance to be done reliably.


If you recognize that there isn't a good automated solution for the DB to smartly join messages than it becomes a fairly straightforward problem once again.

e.g. the simple solution is to denormalize the table and have each message keyed by recipient. In a dating app you'll roughly double your message count this way, and even in most messenger apps the proportion of messages sent person to person is likely the most significant.

A smarter solution is to key each conversation by a unique key in a sharded table and then store the set of "conversations" that a user is engaged in in the sharded users table. Fetching the messages for a user then becomes a simple 2 query process - fetch/filter the conversations, then fetch the messages. No duplication of messages, and likely just a few extra bytes per message for the key.

It would be great if the DB could manage the above application side sharded join internally, but we're unfortunately a few steps away from that today.


It doesn't matter how you arrange data, the moment you need to commit to 2 shards transactionally you are either having consistency trouble or performance trouble.

Both your schemas require writes to at least 2 shards transactionally.


Then you don't do chats on your primary RDBMS.


its huge by database standard, i worked in large multinationals and dealt with some of the their largest databases

5.7 is enormous by database standard , there is no way you can get good query performance on a 5.7 tb database without solid physical partitioning and heavily optimized queries, and most normal companies even with 200-500 GB database use datamarts to have good performance without a super complex architectures and geniuses working fro you in db admin department

the more i think about it, the more i think that 5.7 TB would be unusably huge, and if you have this much data, most wont even bother to partition, the db will be broken into several (hundreds) smaller databases


5TB is not that large and it's not that difficult to get good performance. We operated a 50TB single instance of MySQL for 5 years with a tiny team before migrating to Vitess, and it was basically zero maintenance. We did partition our largest table, which just requires a little extra SQL, but is otherwise transparent.


You guys are talking past each other, as your workloads appear to be different. With a traditional RDBMS, size usually wouldn't be a bottleneck, as long as you can add enough disks to a single server.

Write operations per second, that's the metric I would care about. A 50TB instance with low amount of write operations can be zero maintenance, while a 500GB instance with high amount of write operations can be a real pain.


Thank you for validating my troubles with a write-heavy ~500GB db in the midst of a TB-measuring contest :)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: