I wonder what's the standard of "big" in these context--for example I always tho...

ants_a · 2024-06-20T11:19:19.000000Z

For RDBMS we typically consider millions to be tiny to small. A billion is somewhere on the boundary of medium to big.

Dylan16807 · 2024-06-20T10:00:11.000000Z

Whatever big is defined as, it needs to at least be that your data can't fit into RAM on a high end server.

There's also the threshold where your indexes don't fit into RAM. And the threshold where your data no longer fits into PCIe SSDs on a single server. (The combined bandwidth of the SSDs will rival the RAM, but with more latency.)

munchbunny · 2024-06-20T17:54:26.000000Z

These days I’d probably describe “big” as “doesn’t make sense to use SQL anymore”.

Qualitatively, I think it becomes “big” when you have to leave the space of generic “it just works” technologies and you have to start doing bespoke optimizations. It’s amazing how far you can get these days without having to go custom.

jandrewrogers · 2024-06-20T14:51:54.000000Z

These days terabytes is a medium-sized database. A trillion rows of indexed data will fit on a single cloud VM and be reasonably performant.

I think a good definition of "large" is "several times larger than will fit on a practical modern server". Servers with a petabyte or more of fast attached storage are increasingly common, so that threshold is pretty high. Machine-generated data models (sensing, telemetry, et al) routinely exceed this threshold though.

riku_iki · 2024-06-20T18:05:58.000000Z

> A trillion rows of indexed data will fit on a single cloud VM and be reasonably performant.

create and subsequently update index on trillion rows is very untrivial task performance wise.. Which DB you would suggest to use for this?

winrid · 2024-06-20T18:19:02.000000Z

You just have to plan and do it in the background at those data sizes.

riku_iki · 2024-06-20T19:04:52.000000Z

> just have to plan and do it in the background

this can mean building some untrivial infra, so the task is much more complicated than just using cloud vm.

My point is that bottle neck and why you need cluster and not single VM is likely CPU and iops, and not data storage.

winrid · 2024-06-21T16:44:38.000000Z

> this can mean building some untrivial infra

As someone that does it all the time on 10b+ record tables, not really? if you don't have extra resources to build the occasional index your DB is under-provisioned and you're close to falling over, cluster or not.

riku_iki · 2024-06-21T16:47:56.000000Z

10b+ record table and trillions indexed rows on single machine are very different tasks.

winrid · 2024-06-22T03:46:32.000000Z

ah I forgot grandparent comment said trillion. yeah that's an order of magnitude I would distribute across a cluster if we were in active development and indexes were changing etc. if you're storing that much data you should be able to afford it :)

hnthrowaway0328 · 2024-06-20T09:46:56.000000Z

I think 1pb at least.