Hacker News new | past | comments | ask | show | jobs | submit login

> Databricks has a product called Delta Lake that covers the infinitely scalable storage part

I think it is just API layer on top of existing storage systems like s3 and hdfs: https://docs.delta.io/latest/delta-storage.html#amazon-s3




No, it’s not just API layer, but significant addon on top of existing tech… It brings transactions, faster queries, time travel, etc.


It’s a parquet file, with some json as metadata to define the version boundaries.

That’s all.

You too can have transactions if you implement your access layer appropriately.

It’s a useful format, but let’s note pretend it’s any more magic than it actually is. I’ve also not noticed any improved performance above what partitioning would give you.


Delta metadata based min/max statistics, Parquet row groups and min/max in-file, combined with the right combination of bucketing and sorting, allowed us to actually ignore all partitioning and drastically improve performance on PB scale. The metadata can be a very useful way to do indexing/allows you to skip a lot of things.

It's not magic, and you can understand what the _delta_log is doing fairly easily, but I can testify that the performance improvement over partitioning can be achieved.


And some queries, like, max(col) could be answered without touching the data…


>You too can have transactions if you implement your access layer appropriately.

Yes, we can all "just" make it ourself, still nice when others make it.

Personally I very much like that it's conseptually easy and that it builds on an open source format (parquet). I also expect there to be a dragons, both small and large, in actually getting the acid stuff right,so I am happy we can fight them together.


Regarding performance- try version 1.2.x or higher - it includes data skipping, etc. from what was Databricks-only implementation before.

Regarding “that’s all” - can you give an estimate how much time it will take to reimplement it from scratch on multiple cloud platforms?


Maybe, but "infinitely scalable storage" part is because of lower lever layer.


If course, but there are many ways one could make the storage format such that it would not scale nicely with the underlying layer.


Like how?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: