> Databricks has a product called Delta Lake that covers the infinitely scalable...

alexott · on May 27, 2022

No, it’s not just API layer, but significant addon on top of existing tech… It brings transactions, faster queries, time travel, etc.

FridgeSeal · on May 27, 2022

It’s a parquet file, with some json as metadata to define the version boundaries.

That’s all.

You too can have transactions if you implement your access layer appropriately.

It’s a useful format, but let’s note pretend it’s any more magic than it actually is. I’ve also not noticed any improved performance above what partitioning would give you.

manish_gill · on May 27, 2022

Delta metadata based min/max statistics, Parquet row groups and min/max in-file, combined with the right combination of bucketing and sorting, allowed us to actually ignore all partitioning and drastically improve performance on PB scale. The metadata can be a very useful way to do indexing/allows you to skip a lot of things.

It's not magic, and you can understand what the _delta_log is doing fairly easily, but I can testify that the performance improvement over partitioning can be achieved.

alexott · on May 27, 2022

And some queries, like, max(col) could be answered without touching the data…

Epa095 · on May 27, 2022

>You too can have transactions if you implement your access layer appropriately.

Yes, we can all "just" make it ourself, still nice when others make it.

Personally I very much like that it's conseptually easy and that it builds on an open source format (parquet). I also expect there to be a dragons, both small and large, in actually getting the acid stuff right,so I am happy we can fight them together.

alexott · on May 27, 2022

Regarding performance- try version 1.2.x or higher - it includes data skipping, etc. from what was Databricks-only implementation before.

Regarding “that’s all” - can you give an estimate how much time it will take to reimplement it from scratch on multiple cloud platforms?

riku_iki · on May 27, 2022

Maybe, but "infinitely scalable storage" part is because of lower lever layer.

Epa095 · on May 27, 2022

If course, but there are many ways one could make the storage format such that it would not scale nicely with the underlying layer.

riku_iki · on May 27, 2022

Like how?