The author is suggesting that Databricks should sell a product with infinitely s...

bayan1234 · on May 27, 2022

I think the author isn't talking about changing the product, but the market positioning and how it's communicated to customers.

Rimbo · on May 27, 2022

Yeah. The clickbait title suggests that big data tools themselves are going away; the actual contents are that the needless hype for their companies, folded into their sales pitches, is going away.

majormajor · on May 27, 2022

Delta is a fairly recent entrant, and when some of my coworkers evaluated it, it didn't really seem as compelling as Snowflake or BigQuery. Databricks was something our data science department liked, but it didn't have a compelling sell to our (much larger) analytics org.

ramraj07 · on May 27, 2022

Yeah, we moved away from spark for a reason (you kinda have to babysit it or it’ll crash on anything actually big data). Snowflake took care of things much better than that. Their lake offering is like the worst of both worlds.

disgruntledphd2 · on May 27, 2022

Spark allows you to distributed model training though, which is really nice when you need it.

mr_toad · on May 27, 2022

> Databricks was something our data science department liked, but it didn't have a compelling sell to our (much larger) analytics org.

So they’ll just have to suffer running their non-SQL workloads locally on corporate issued laptops? Or they’re not really going to do data science at all?

majormajor · on May 27, 2022

Who's they here? The non-data-science analysts? They were running SQL in the cloud on other products. Non-sql workloads were the domain of data engineers and data science in that org.

The data scientists were using Databricks, but data eng was instructed to try to replicate a few core pieces because it was $$$$ as far as "a way to run notebooks and such" went.

zjaffee · on May 27, 2022

I don't think Delta is meant to compete with Big Query, although it's been a long time since I've used Big Query, I recall upserts not being something you were able to do within BQ and that's essentially the main selling point of delta.

riku_iki · on May 27, 2022

> Databricks has a product called Delta Lake that covers the infinitely scalable storage part

I think it is just API layer on top of existing storage systems like s3 and hdfs: https://docs.delta.io/latest/delta-storage.html#amazon-s3

alexott · on May 27, 2022

No, it’s not just API layer, but significant addon on top of existing tech… It brings transactions, faster queries, time travel, etc.

FridgeSeal · on May 27, 2022

It’s a parquet file, with some json as metadata to define the version boundaries.

That’s all.

You too can have transactions if you implement your access layer appropriately.

It’s a useful format, but let’s note pretend it’s any more magic than it actually is. I’ve also not noticed any improved performance above what partitioning would give you.

manish_gill · on May 27, 2022

Delta metadata based min/max statistics, Parquet row groups and min/max in-file, combined with the right combination of bucketing and sorting, allowed us to actually ignore all partitioning and drastically improve performance on PB scale. The metadata can be a very useful way to do indexing/allows you to skip a lot of things.

It's not magic, and you can understand what the _delta_log is doing fairly easily, but I can testify that the performance improvement over partitioning can be achieved.

alexott · on May 27, 2022

And some queries, like, max(col) could be answered without touching the data…

Epa095 · on May 27, 2022

>You too can have transactions if you implement your access layer appropriately.

Yes, we can all "just" make it ourself, still nice when others make it.

Personally I very much like that it's conseptually easy and that it builds on an open source format (parquet). I also expect there to be a dragons, both small and large, in actually getting the acid stuff right,so I am happy we can fight them together.

alexott · on May 27, 2022

Regarding performance- try version 1.2.x or higher - it includes data skipping, etc. from what was Databricks-only implementation before.

Regarding “that’s all” - can you give an estimate how much time it will take to reimplement it from scratch on multiple cloud platforms?

riku_iki · on May 27, 2022

Maybe, but "infinitely scalable storage" part is because of lower lever layer.

Epa095 · on May 27, 2022

If course, but there are many ways one could make the storage format such that it would not scale nicely with the underlying layer.

riku_iki · on May 27, 2022

Like how?