In the distributed computing world, the rule is you start to scale horizontally when your compute workload is too large to fit in the memory of a single machine. So it depends on your compute workload and your hardware. (There’s no fixed number for what a large dataset is)
DuckDB itself doesn’t have any baked in limits. If it fits in memory, single machine compute is usually faster than distributed compute — and DuckDB is faster than Pandas, and definitely faster than local Spark.
DuckDB itself doesn’t have any baked in limits. If it fits in memory, single machine compute is usually faster than distributed compute — and DuckDB is faster than Pandas, and definitely faster than local Spark.