If my business depended on it? I can click a few buttons and have a 8TiB Supermicro server on my doorstep in a few days if I wanted to colo that. EC2 High Memory instances offer 3, 6, 9, 12, 18, and 24 TiB of memory in an instance if that's the kind of service you want. Azure Mv2 also does 2850 - 11400GiB.
Bigquery and snowflake are software. They come with a sql engine, data governance, integration with your ldap, auditing.
Loading data into snowflake isn't overegineering.
What you described is over-engineering.
No business is passing 6tb data around on their laptops.
I personally don't but our computer cluster at work as around 50,000 CPU cores. I can request specific configurations through LSF and there are at least 100 machines with over 4TB RAM and that was 3 years ago. By now there are probably machines with more than that. Those machines are usually reserved for specific tasks that I don't do but if I really needed it I could get approval.
Right, which is why you can mmap way more data than you have ram, and treat it as though you do have that much ram.
It’ll be slower, perhaps by a lot, but most “big data” stuff is already so god damned slow that mmap probably still beats it, while being immeasurably simpler and cheaper.
Really depends on the shape of the data. mmap can be suboptimal in many cases.
For CSV it flat out doesn't matter what you do since the format is so inefficient and needs to be read start to finish, but something like parquet probably benefits from explicit read syscalls, since it's block based and highly structured, where you can predict the read patterns much better than the kernel can.
The "(multiple times)" part probably means batching or streaming.
But yeah, they might have that much RAM. At a rather small company I was at we had a third of it in the virtualisation cluster. I routinely put customer databases in the hundreds of gigabytes into RAM to do bug triage and fixing.