It’s a parquet file, with some json as metadata to define the version boundaries...

manish_gill · on May 27, 2022

Delta metadata based min/max statistics, Parquet row groups and min/max in-file, combined with the right combination of bucketing and sorting, allowed us to actually ignore all partitioning and drastically improve performance on PB scale. The metadata can be a very useful way to do indexing/allows you to skip a lot of things.

It's not magic, and you can understand what the _delta_log is doing fairly easily, but I can testify that the performance improvement over partitioning can be achieved.

alexott · on May 27, 2022

And some queries, like, max(col) could be answered without touching the data…

Epa095 · on May 27, 2022

>You too can have transactions if you implement your access layer appropriately.

Yes, we can all "just" make it ourself, still nice when others make it.

Personally I very much like that it's conseptually easy and that it builds on an open source format (parquet). I also expect there to be a dragons, both small and large, in actually getting the acid stuff right,so I am happy we can fight them together.

alexott · on May 27, 2022

Regarding performance- try version 1.2.x or higher - it includes data skipping, etc. from what was Databricks-only implementation before.

Regarding “that’s all” - can you give an estimate how much time it will take to reimplement it from scratch on multiple cloud platforms?