The end of big data

zjaffee · on May 26, 2022

I've worked in this space in the past, and I think the big weakness of databricks is that for them to be successful as an independent company, they need to charge a lot more than AWS/GCloud/ect, and while historically they had better performance for a lot of Spark jobs, this hasn't been the case for several years now.

What Databricks offers that you can't find in essentially the exact same products from other cloud providers, is that they have a pretty UI and a much more integrated setup for more advanced operations like upserts on a data lake instead of having to manage your own Hudi or Iceberg setup. Their notebook is a better offering than others have, but the money for big data companies is in the long running daily or weekly jobs, not one off queries.

Historically, databricks had been offering major performance improvements to their customers while not updating the Spark open source code with all of the same query optimizations, and other various major spark performance improvements like Dynamic Pruning and Adaptive Execution, they've since done an about face for the most part and have started pushing these improvements into the open source. But as I noted before, for these longer running daily jobs, you'd spend 10x less per instance hour in some cases running on a cloud providers generic offering rather than paying for databricks, as they've made performance improvements of their own such that the price performance almost universally will always beat databricks.

The point is that databricks is absolutely a very innovative company, it's just that it's impossible to compete against the bigger players. The only real major lasting strength of databricks is that Azure fully integrated them into their service as opposed to trying to compete directly like AWS and Gcloud have.

ramesh31 · on May 27, 2022

It all comes down to the price of pushing bits. The big cloud providers have brought that cost so low, that you have to provide an incredible amount of value with a SAAS offering now for it to make financial sense anymore.

anonymousDan · on May 27, 2022

In retrospect the Spark authors should have had a licence that forbids cloud providers from running their own hosted services.

zjaffee · on May 27, 2022

It's questionable as if spark every would've been nearly as successful of a product if they had done this.

Sparks success is that it came out of Berkeley, was better than traditional map reduce in a number of ways initially, but more than that it was able to spread because a lot of map reduce product providers were able to push it for their clients because of it's apache license explicitly.

jiggawatts · on May 27, 2022

> Azure fully integrated them [Databricks] into their service

Could you elaborate on this? To my knowledge (limited!) there isn't any specific integration. I thought they just offered a turn-key managed service, but all that seems to do is spin up a preconfigured VM scale set...

alexott · on May 27, 2022

No. Azure Databricks is a first class service - officially its a Microsoft product. With all UIs, diagnostic logging, encryption, consolidated billing, etc.

jiggawatts · on May 28, 2022

I tried it and it literally just spun up a pool of VMs.

First class in my mind would be if Databricks could push down compute to the Storage Account nodes or something…

MrPowers · on May 27, 2022

The author is suggesting that Databricks should sell a product with infinitely scalable storage and is queryable via SQL or Python... isn't that exactly what they're doing?

Databricks has a product called Delta Lake that covers the infinitely scalable storage part. Here's a talk from a Delta Lake user that's writing hundreds of terabyes / up to a petabyte of data a day to Delta Lake: https://youtu.be/vqMuECdmXG0

Databricks recently rewrote the Spark query engine in C++ (called Photon) and provides a new SQL query interface for data warehouse type developers.

Databricks recently set the TPC-DS benchmark record, see this blog post: https://databricks.com/blog/2021/11/02/databricks-sets-offic...

This article doesn't align with my view of reality.

bayan1234 · on May 27, 2022

I think the author isn't talking about changing the product, but the market positioning and how it's communicated to customers.

Rimbo · on May 27, 2022

Yeah. The clickbait title suggests that big data tools themselves are going away; the actual contents are that the needless hype for their companies, folded into their sales pitches, is going away.

majormajor · on May 27, 2022

Delta is a fairly recent entrant, and when some of my coworkers evaluated it, it didn't really seem as compelling as Snowflake or BigQuery. Databricks was something our data science department liked, but it didn't have a compelling sell to our (much larger) analytics org.

ramraj07 · on May 27, 2022

Yeah, we moved away from spark for a reason (you kinda have to babysit it or it’ll crash on anything actually big data). Snowflake took care of things much better than that. Their lake offering is like the worst of both worlds.

disgruntledphd2 · on May 27, 2022

Spark allows you to distributed model training though, which is really nice when you need it.

mr_toad · on May 27, 2022

> Databricks was something our data science department liked, but it didn't have a compelling sell to our (much larger) analytics org.

So they’ll just have to suffer running their non-SQL workloads locally on corporate issued laptops? Or they’re not really going to do data science at all?

majormajor · on May 27, 2022

Who's they here? The non-data-science analysts? They were running SQL in the cloud on other products. Non-sql workloads were the domain of data engineers and data science in that org.

The data scientists were using Databricks, but data eng was instructed to try to replicate a few core pieces because it was $$$$ as far as "a way to run notebooks and such" went.

zjaffee · on May 27, 2022

I don't think Delta is meant to compete with Big Query, although it's been a long time since I've used Big Query, I recall upserts not being something you were able to do within BQ and that's essentially the main selling point of delta.

riku_iki · on May 27, 2022

> Databricks has a product called Delta Lake that covers the infinitely scalable storage part

I think it is just API layer on top of existing storage systems like s3 and hdfs: https://docs.delta.io/latest/delta-storage.html#amazon-s3

alexott · on May 27, 2022

No, it’s not just API layer, but significant addon on top of existing tech… It brings transactions, faster queries, time travel, etc.

FridgeSeal · on May 27, 2022

It’s a parquet file, with some json as metadata to define the version boundaries.

That’s all.

You too can have transactions if you implement your access layer appropriately.

It’s a useful format, but let’s note pretend it’s any more magic than it actually is. I’ve also not noticed any improved performance above what partitioning would give you.

manish_gill · on May 27, 2022

Delta metadata based min/max statistics, Parquet row groups and min/max in-file, combined with the right combination of bucketing and sorting, allowed us to actually ignore all partitioning and drastically improve performance on PB scale. The metadata can be a very useful way to do indexing/allows you to skip a lot of things.

It's not magic, and you can understand what the _delta_log is doing fairly easily, but I can testify that the performance improvement over partitioning can be achieved.

alexott · on May 27, 2022

And some queries, like, max(col) could be answered without touching the data…

Epa095 · on May 27, 2022

>You too can have transactions if you implement your access layer appropriately.

Yes, we can all "just" make it ourself, still nice when others make it.

Personally I very much like that it's conseptually easy and that it builds on an open source format (parquet). I also expect there to be a dragons, both small and large, in actually getting the acid stuff right,so I am happy we can fight them together.

alexott · on May 27, 2022

Regarding performance- try version 1.2.x or higher - it includes data skipping, etc. from what was Databricks-only implementation before.

Regarding “that’s all” - can you give an estimate how much time it will take to reimplement it from scratch on multiple cloud platforms?

riku_iki · on May 27, 2022

Maybe, but "infinitely scalable storage" part is because of lower lever layer.

Epa095 · on May 27, 2022

If course, but there are many ways one could make the storage format such that it would not scale nicely with the underlying layer.

riku_iki · on May 27, 2022

Like how?

Lucasoato · on May 27, 2022

I love Databricks, we can consider it as the Data Engineer's heaven. Everything makes sense, works smoothly and delta tables are truly revolutionary. Now that they are open sourcing even Delta table optimizations they will really become the n1 solution as a Data Engineering platform.

The only downside is their marketing team. Sadly it's full of passive/aggressive people that won't let your company's engineers work because they'll bury you with hundreds of business questions just to resell a cool story internally.

I've seen projects and deals going down for this. I really can't understand why they're trying so desperately to waste their customer's time.

benjaminwootton · on May 27, 2022

Yep, they absolutely nail the product. One platform for ETL, analytics, data scrience etc in your preferred language. Nice Notebook interface etc. I’m not sure why anyone would try to roll their own analytics stack.

symfrog · on May 27, 2022

> Now that they are open sourcing even Delta table optimizations

Has Databricks recently open sourced additional Delta table features that were previously only available with a paid license? I can't seem to find a relevant announcement.

alexott · on May 27, 2022

https://delta.io/blog/2022-05-05-delta-lake-1-2-released/ - big part is open sourced and more is coming (see roadmap on GitHub: https://github.com/delta-io/delta/issues/920)

mr_toad · on May 26, 2022

> But we never connected these dots during our evaluation because the story Databricks told was buzzwords. Snowflake’s was boring, and all we wanted—at least at first—was boring.

How do you describe a chocolate gâteau to someone who has only ever eaten rice?

> Databricks is a big, fast database that you can write SQL and Python against

It really isn’t. Spark is not a database, certainly not an RDBMS. This reminds me of the people in the office who used to call the computer on their desk “the hard drive”.

If you’ve never worked with tools like Pandas, or R, or SAS, you’re just not going to understand Spark and Databricks.

garciasn · on May 26, 2022

As someone who was a SAS developer first and later got into Pandas and R for a short time before going straight BigQuery SQL for the last 5+ years, please explain what you mean here because I don’t follow at all.

ACow_Adonis · on May 27, 2022

For the record, he's literally listed off most of my languages: I have years of experience in SAS, R, Python, and SQL, and I also have no idea what he's saying.

Also, for the record, that's largely how I think of databricks ideally. A big distributed database/storage layer into which I can write queries in my chosen language...

So i...admit equal confusion...

adeelk93 · on May 27, 2022

As someone who does understand the distinction, for most end-users, I think the distinction doesn’t matter. It’s like the culinary vs botanical categorization of a tomato.

Eridrus · on May 27, 2022

The performance of different queries is different across a typical RDBMS and something like Spark/BigQuery.

A traditional database is efficient (i.e. cost effective) at doing a lot of the same query repeatedly, e.g. looking up a customer's account balance, whereas these query engines are good at doing infrequent queries with lots of complicated, expensive logic on very large datasets.

You could run simple account lookup queries in a CRUD app with Spark, but you'd be setting a lot of compute/money on fire, and your latency would probably be terrible.

alexott · on May 27, 2022

There are a lot of optimizations in query engines as well - results cache, local caches of files stored in cloud, data layout optimization and skipping (to avoid reading not necessary files), etc.

dijksterhuis · on May 27, 2022

TL;DR

Wikipedia: In computing, a database is an organized collection of data *stored* and accessed electronically.

Apache Spark (the thing that powers Databricks) doesn't store data, it only processes it.

---

I'll give this a go... A lot of this involves some massive simplifications but hopefully this might be helpful. Let's say you have a file like this:

    Customer_Name, Order_Amount
    Sarah, 15
    Billy, 10
    James, 20
    Mukesh, 18
    Kate, 42

If you want to find out the total number of orders in that file you'd load it as a dataframe (pandas/R) then sum the Order_Amount column. How about if you wanted to join this file up with another file:

    Customer_Name, Company_Name
    Sarah, Big Multinational Conglomerate
    Billy, EZ Groceries
    James, The Coffee Shop
    Mukesh, EZ Groceries
    Kate, The Coffee Shop

To find the Sum of Order_Amount by Company_Name you'd load both files, join the two dataframes together and sum by Company_Name.

Where is the data storage happening in this example? It's the files, right? The data is in the files and you've just run a query against those files. Once you close your python interpretor your dataframes cease to exist -- but the files still live on.

---

What if you wanted to access this data from multiple computers very quickly becuase, let's say, you have some software that should show the user their company's current order amount? Enter the RDBMS (Relational Database Management System) a.k.a. "the database". Now multiple people/computers can run queries at the same time without needing to:

    1. read the files from disk
    2. load each file's data as a dtaframe
    2. join the dataframes together
    3. Sum the Order_Amount by user's company

Instead they run a SQL query which handles all of that for them, as if by *magic*.

Where is the data storage happening here? It's the database tables, right? After your query runs the data is still stored in the database tables. It doesn't disappear.

The "MS" part of RDBMS is the query execution part. The "SQL magic" that lets you do left joins and so on. It's not the data storage part ==> it's the query engine part.

---

Now scale this up to PETABYTES of data. Billions upon billions of rows of data. The pokey little 8 core CPU server your database runs on might not be able to handle processing all that data in any sort of reasonable time for a web browser based application (any longer than 3 seconds = the web application is broken).

As mentioned by MrPowers, Apache Spark solves this problem as a distributed processing query engine. The short version is:

    Stage 1. split all the data into thousands of "partitions" (literally split the files into thousands of pieces)
    Stage 2. run thousands of mini "Sum of Order_Amount" queries on each partition
    Stage 3. combine the results of the mini queries to get the final "Sum of Order_Amount"

The clever thing about distributed batch processing is Stage 2. You can run the thousands of mini queries on thousands of indpendent servers, meaning your total execution time for Stage 2 maxes out at the time to run a single mini query. The part Apache Spark deals with is Stages 2 + 3 ==> executing a query in a cleverly distributed way, meaning the query is processed much faster than the pokey little 8 core CPU database server.

Where is the data storage happening here? Stage 1, right? The data lives in all those partitioned files.

But Apache Spark deals with Stages 2 + 3 ...?

>> Databricks is a big, fast database that you can write SQL and Python against

> It really isn’t. Spark is not a database, certainly not an RDBMS.

You could store all your partitioned data in thousands of different database instances. Or store each partition as individual files on the thousands of servers used to run the mini queries. You can store it in a partitioned parquet file format in an Amazon S3 bucket. In all of these cases Spark will read the data from those locations, process it, then give you your result.

Much like loading that dataframe from the original simple file, the "Sum" query calculation is the data processing and not the data storage part. People often conflate "Database" with "Things I can run SQL queries against". The two are not the same thing.

Make sense? Anything unclear?

Addendum:

I can see why people think of these tools as databases. It’s the dunning Kruger effect in action.

Because I know the ins and outs of what databricks are (likely) using behind the scenes it means I can’t see it as a “database”.

But for folks who have a less detailed understanding of the, pardon the pun, bricks and mortar behind it, it makes sense as “the do all the data storage stuff for me == database”.

Which is the right/wrong way to think about it? I have no idea. Depends what you want to do with it I would guess.

herodoturtle · on May 27, 2022

Heya, I learnt so much from reading your comment.

Just wanted to say thank you.

Do you have a blog? Your writing / teaching style is very effective.

nus07 · on May 27, 2022

That was a fantastic answer.Likewise learned a lot. As someone stuck doing old school RDBMS Sql and Python for APIs , what’s the best resource for me to learn about databricks and spark ?

tacon · on May 27, 2022

This books has an excellent reputation for the foundations of data-intensive software:

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

https://www.amazon.com/Designing-Data-Intensive-Applications...

alexott · on May 27, 2022

It could be a good second step, imho…

alexott · on May 27, 2022

For short intro I would point to “Learning Spark, 2ed” - it was available for free from Databricks site (maybe still available, look via Google) - it gives intro into the Spark and Delta Lake. The “Data Analysis with Python and PySpark” is a bit deeper, but doesn’t cover Delta.

jiggawatts · on May 27, 2022

Funnily enough, I've been looking for the best method to query terabytes of logs dumped into blob storage, and now I have a far better understanding of how that should be done than after watching literally hours of marketing-driven product demo videos...

mr_toad · on May 27, 2022

Spark is a query engine, much like Trino/Presto are, but using primarily non-SQL languages and schema-less data. The ability to run SQL doesn’t make something a database.

MrPowers · on May 27, 2022

Yea, agreed, Spark definitely isn't a database, it's a query engine.

I agree that the traditional Spark notebook query interface was a lot easier for people with pandas (or DataFrame experience) to understand. But their new SQL query interface is easily understandable for DBAs.

nonrandomstring · on May 26, 2022

> the end of an over-hyped era

Hype is part of what drives technology and I don't see it going away any time soon - because a lot of people make a buck purely by milking hype cycles.

I've come to see there are two modes of intelligence. The ability to understand something and say 'yes' to it is complemented by bullshit detectors, a more mature ability to see through other people muddying the waters trying to look smart, and reject their nonsense.

Every cycle has its real innovators and hype-pedlars. At the height of blockhain madness a decade ago they had worked themselves into a frenzied fetish of near-supernatural woowooism. Once all the smoke and hullabaloo has cleared, and all the grandiose promises have fallen to the wayside, what remains is a core of genuinely useful and relatively simple core concepts that enter the canon, along with standards and protocols.

This line in TFA sums it up:

> I needed a database.

The same happened with "data". The hype problem there has always been a lack of telos reminiscent of the underpants gnome's flawed "profit?" scheme. Somewhere is a crucial missing step, where we ask "Why?". And the answer is "don't worry, the data plus magical "AI" fairy dust will reveal why". There is a quite religious (faith-like) flavour to this.

I blame the NSA. The idea that "collect it all" is anything but a thugish brute force excuse to waste of billions dollars building data-centres, has led to commercial obsession with "data" as a panacea.

It isn't "big data" per se that's a problem. Some applications, like in medicine or environmental science, absolutely thrive on large sets and sophisticated analytics. But the fact is that in most applications it's mostly useless, burdensome, energy-consuming, and space-wasting. But "data-hoarding" and over-analysis is pushed by those with sledgehammers to sell for cracking nuts.

Swizec · on May 26, 2022

> Once all the smoke and hullabaloo has cleared, and all the grandiose promises have fallen to the wayside, what remains is a core of genuinely useful and relatively simple core concepts that enter the canon, along with standards and protocols

has this happened with blockchain yet? I'm not seeing it.

Maybe I haven't been around the block enough because blockchain is the first hype tech I've seen that keeps not doing anything that existing tech can't do better. And yes I've looked, a lot.

The underlying tech sounds super promising and I hope someone figures something out. Not much luck so far.

phil21 · on May 26, 2022

> has this happened with blockchain yet? I'm not seeing it.

No one wants to admit it for some reason, but Bitcoin is the killer app for Blockchain.

> The underlying tech sounds super promising and I hope someone figures something out. Not much luck so far.

Plenty of luck for me at least. Bitcoin solved exactly the problem it told me it would when it was created - permissionless digital "cash" transactions that cannot be reversed. It's a very specific use case, but one I am quite happy exists when I need it.

Those who have never had their money stolen by the banking system or denied the ability to use their money as they see fit likely won't understand my take. Those that do are using the currency today.

All the other crud around blockchain is line noise hype. Cryptocurrency already solved exactly the problem it set out to, it's just not easy to get rich off it.

_ofdw · on May 26, 2022

>Bitcoin is the killer app for Blockchain

If the killer app of the Blockchain is Bitcoin, what does it say about the underlaying technology that the "killer app" is useful only as an extremely volatile speculation vehicle, and an ecologically disastrous one at that, with a tertiary use case of paying for illegal goods and services?

astrange · on May 27, 2022

It's actually primarily useful for buying legal goods that credit card companies won't touch, like research chemicals.

It's not good for crimes because Chainalysis can track the coin movements afterward - though maybe it's good enough. And porn doesn't seem to use it, maybe because the transaction costs are too high or it's too unethical even for them.

And it's not good for speculation because it turns out it's just correlated to US tech stocks.

ByteJockey · on May 27, 2022

> a tertiary use case of paying for illegal goods and services?

This is the use case mentioned. The OP is saying that everything else about it is useless.

Having the option to do something illegal should the need arise is a benefit to some people, and government control of the financial sector is not beloved by everyone.

macintux · on May 27, 2022

But it seems like everyone is (mostly indirectly) impacted by the ransomware epidemic that Bitcoin has enabled.

ByteJockey · on May 27, 2022

Yup. That's considered an acceptable trade-off in the circles I'm talking about.

correlator · on May 26, 2022

I think OP was saying the ledger mechanics are the useful part here, not the volatile speculation.

_ofdw · on May 26, 2022

The ledger mechanics have no practical use beyond Bitcoin, which itself useless except for the cases I outlined.

steelframe · on May 27, 2022

> Those who have never had their money stolen by the banking system or denied the ability to use their money as they see fit likely won't understand my take.

Neither I nor any of my friends or family members have ever had their money "stolen" by the banking system.

People I know have lost their wallet keys and lost significant value in their Bitcoin "investments" they made in just the last year.

IAmEveryone · on May 27, 2022

> has this happened with blockchain yet? I'm not seeing it.

It's a unique situation in that hype is easily and definitely measurable: market cap. You would arguably need to subtract any use for real-world applications, i. e. the drug business using it, which can easily be approximated (to the closest full percentage point of market cap) by the formula x = 0.

snidane · on May 26, 2022

> Take real-time products, for example. Most businesses have little use for true real-time experiences. But, all else being equal, real-time data is better than latent data.

Only if you stare at a chart on a dashboard. If you do serious analytics and collaborate with other people you need the data to be stable and not changing under your hands as you work with it.

Daily immutable snapshots produced by a daily batch are not only cheaper to produce but easier to work with and that's why the entire industry prefers batch.

The only use case for streaming are 'standing queries' in which query result is continuously updated with new rows incoming. Eg. real time anomaly detection in access logs, ML inferrence, fraud detection, plotting numbers on a dashboard.

All analysis is done offline in a batch setup. There is no 'stream analytics'. Stream processing only makes sense if the database is capable of hosting serving layer of the model inferrence application.

KptMarchewa · on May 27, 2022

>If you do serious analytics and collaborate with other people you need the data to be stable and not changing under your hands as you work with it.

Apache Iceberg and Delta Lake take care of that. The fact that you're reading a particular snapshot of data does not prevent you from adding, updating or deleting data in the meantime.

Also, your view is super narrow and focuses only on traditional analytics.

fnordpiglet · on May 27, 2022

I think this is super narrow view of the world that applies to a lot of things but not everything. Any data use that involves reactionary decision making will benefit from real time streaming data. Financial services is replete with this, but so are monitoring applications, traffic analytics, advertising, etc. A lot of my career has been turning batch into real time and the commercials were there.

gadflyinyoureye · on May 27, 2022

Even then, there are specific forms of fraud for which streaming detection makes sense. If your upstream data is batch, there is little good reason to implement a streaming approach.

slt2021 · on May 27, 2022

just because most business historically were content with batch, does not mean they will not benefit from streaming/realtime.

batch is business as usual, while realtime is disruption.

compare to banking: payments historically were batch (ACH), but realtime payments have disrupted it (zelle, venmo, fed wire, etc.)

chevman · on May 27, 2022

Living all day every day in enterprise land, Databricks current growth has been set on fire by their integration into Azure. All the bigCo's moving their legacy infra to Azure and finding out Databricks is available OOTB is fueling huge growth.

I have multiple friends from Oracle, IBM, Salesforce, Adobe, etc that are now AEs at Databricks making 3 or 4x total comp they were making previously.

Fun times!

baazaa · on May 27, 2022

The author is exactly right. Indeed I work for government and we're buying into databricks and the people responsible for that don't even know what Spark is. The pitch which sold them the product is the one the author basically identified, infinitely scalable storage that is queryable, even though there are plenty of other products that offer that.

While government is an extreme example, I'm sure there's a lot of non-tech companies in a similar position. The people doing the procurement don't know anything about tech, but they do know that there's random databases all over the place that are a constant headache and they want to centralise everything in a single platform. Secondly they steadfastly refuse to hire competent tech people, if databricks ends up being a very expensive replacement for an OPs team then that's a big winner for any organisation that is unable to, or refuses to, hire an OPs team but has lots of money to burn.

PLenz · on May 26, 2022

Big data didn't lose. It became so common that now what was once big data is just data.

zthrowaway · on May 27, 2022

Agreed. The hype around Big Data is dead, but a decent portion of the tech that came into existence during this hype cycle is not dead. Things like Hadoop have their stay in Fortune 150 and 500 companies, there's plenty of Hadoop jobs still out there. It's certainly not sexy anymore though.

Hadoop may be losing its foothold though from what I've seen, these days people are getting by just running beefy Linux machines and doing everything in Python using Dask. Most people are not doing petabyte scale data analysis.

dehrmann · on May 27, 2022

I think this is the point the article actually makes, but I went into it expecting to see how it's been hard to extract business value from all this data, and then expected it to explain how the demand these easy-to-scale DBs induced resulted in even less value, but it didn't really touch on that.

0xbadcafebee · on May 26, 2022

The author seems to be suggesting that Databricks/Snowflake are selling "storage", "language integration", and "a better Postgres", which is like saying that Toyota sells an air conditioner, entertainment unit, and leather chairs. Technically right... but missing the car for the parts.

datalopers · on May 27, 2022

Databricks/Spark is overpriced garbage for Python-toting data scientists/engineers/analysts who don't know how to model data or write SQL. Exactly like Hadoop was 15 years ago for the Java-toting data scientists/engineers.

threeseed · on May 27, 2022

Spark is open source and free.

Databricks is their hosted version just like Aurora is a hosted PostgreSQL.

The fact you don't seem to understand the difference makes me question if you are qualified to be making statements that it is garbage and we can simply write SQL. Because having worked on many large, big data projects SQL is often not the right tool for every use case.

clpm4j · on May 27, 2022

That might be hyperbolic, but directionally I agree. The interesting part is that there are enough of those people out there in enterprise who fall into that "don't really know what they're doing but want to try something cool" bucket to keep Databricks flush with cash for years on end.

datalopers · on May 27, 2022

Yep. MongoDB earned $873M in revenue over the past 12 months servicing tech debt created between ‘10 - ‘16. Databricks will be around a long time for the same reason. It’s a viable business strategy for sure.

michael_j_ward · on May 27, 2022

If one does not need petabytes of scale, but am otherwise interested in the data lineage / observability / workflow being sold by databricks, what would you suggest?

Some evaluation Criteria:

- Ease of maintenance and operation is almost paramount.

- It's fine if the solution never lives anywhere but 1 single virtual server that scales vertically (data might grow to a couple TB, but not PETA BYTES)

- Similarly, 20 9's is not a criteria. If the machine fails and it takes an hour till someone goes and re-deploys, that's fine.

- Declarative, reproducible deployment with an easy upgrade story would be great

- Ideal if the deployment can be run locally for quick developmnet

nooorofe · on May 28, 2022

Talend Studio?

https://www.talend.com/products/talend-open-studio/

lysecret · on May 27, 2022

As a former "Python-toting data scientists/engineers/analysts who don't know how to model data or write SQL" I wholeheartedly agree.

nojito · on May 27, 2022

Spark is many orders of magnitude faster than "properly modeled" data

alexott · on May 27, 2022

Databricks includes pure SQL implementations - Databricks SQL, for example. Or for example, Delta Live Tables allow to write you pipelines using only SQL…

antipaul · on May 27, 2022

It’s not really the end of “big data”. There is and will be tons of data to go around for the foreseeable future.

Instead, it feels more like the end of “data science as the sexiest job of the 21st century”.

DS is mostly running reports (so BI) or stats (perennially important niche) or manufacturing hype via “latest research”.

Engineering has been, is, and will continue to be the foundation for success.

ElemenoPicuares · on May 27, 2022

Engineers usually think so. After spending some time with GitHub Copilot, as annoying as it is, I think we've got a decade, max, before the lower 3/4 of the software development trade is rendered unnecessary.

ElemenoPicuares · on May 27, 2022

Downvote all you want but prove me wrong.

CRConrad · on May 29, 2022

Hacker News doesn't work that way: We won't be able to reply to your comment in ten years when we know how it went.

ElemenoPicuares · on May 29, 2022

I was using the word prove colloquially. That many professions are being automated is a given, and with NN people are trying extra hard to automate knowledge work that requires more trade-like knowledge than theoretical knowledge, like paraprofessionals and functionaries with domain knowledge. Recent developments in software lead me and many others to believe the automation of utilitarian development jobs is impending. Even Forbes thinks so and I guarantee you they hadn't seen GitHub Copilot when this article came out: https://www.forbes.com/sites/forbestechcouncil/2021/02/23/11...

So, given that enough industry analysts believe development will be heavily automated to make top-10 lists, and recent developments provide even stronger evidence (copilot writes better comments on existing code than my coworkers,) can someone providence a strong a priori argument or maybe even empirical evidence that utilitarian software development is somehow uniquely immune to this?

CRConrad · on May 30, 2022

It may be the case that I wasn't being entirely serious.

mywaifuismeta · on May 27, 2022

No idea what the author is trying to say. It's just a clickbait title with a rant and no content.

"The end of big data" but then goes on to rant about how they bought "big data" products and ends with "Big Data is finally starting to live up to its potential" - this is below Medium quality.

bigcat12345678 · on May 26, 2022

Can someone summarize what the article want to say?

I am not happy about spending 10 minutes on it but realized I need another 10 minutes, and in the end would still be confused, and would need post this comment anyway.

So I just stopped there and wrote this comment.

Rip writing clarity, and the effective communication...

0x008 · on May 26, 2022

Author is trying to say something like:

If you want to sell people a tesla, first and foremost, you have to tell them it's a car and it behaves exactly like a car. If that is not a given, they won't buy it, no matter which bells and whistles it has additionally to getting you from A to B.

So if we go back to the topic of the article, it means: If you want to sell a big data platform, first you must make people believe it can replace their old traditional database, but make it faster.

rossdavidh · on May 26, 2022

I think it's trying to say that the way to sell the new thing, is first to say that you can do exactly the old thing except better (limitless storage, faster, or whatever). Then, once the customers have the new thing, they may find that they like some of the new features. But don't try to sell them on the new features, because that's not what they're looking for.

jerglingu · on May 27, 2022

I felt the same reading this. Seems like everything analytics/data is like this these days. You have to read things multiple times over to either translate the abstractions and ball-hiding into plain English, or until you recognize that there is no point and it's mostly stream-of-thought rambling. See especially: anything data mesh.

non_sequitur · on May 27, 2022

Random observation as this quote stood out: "Snowflake is worth $70 billion, even after its stock’s brutal five-month slide"

The article was written April 8, so ~6 weeks ago. As of May 26, Snowflake is worth $39B. Crazy how fast things can change.

swyx · on May 28, 2022

if you liked it at 70, you should LOVE it at 39 :) unfortunately their latest earnings showed a significant decline in revenue growth as well down from the often cited 100%yoy number down to 80%

slt2021 · on May 27, 2022

databricks is not only database. spark is not only for analytics.

spark is distributed and fast runtime for applications. it can be a platform powering the entire company/startup and its backend data pipeline.

rdbms like postgres cannot compete, because sql is about storage and querying (you need application code to do complex processing) and sql is hard to scale (unless you want to shard your db into million pieces and have ways to manipulate millions of db shards). also - how good luck shuffling data between thousands of rdbms shards, if you believe rdbms can replace spark.

the problem with databricks is that their offering is not differentiated enough from opensource spark. So their only value add is "we will manage your managed spark cluster, pls pay us, instead of your Ops/IT team?"

dumbfounder · on May 27, 2022

We evaluated databricks a few years ago. Didn’t get how we start and do the simplest thing. For our company that didn’t know how to take the first step it was daunting. Then snowflake came along and it was so easy and it just worked and we understood it and could see how it could be extended. Maybe databricks is more powerful, I have no idea, it was going to be way too much effort to find out. It was a nonstarter for us.

alexott · on May 27, 2022

Maybe you can look again - things are changing very fast, few years there were mo DBSQL, Photon, Delta Sharing, etc.

dumbfounder · on May 27, 2022

I am out of that game, but adding a bunch of shiny new stuff does not mean it is as simple as Snowflake, it usually means the opposite. With Snowflake the onboarding was so clean and then you land can expand. Databricks was a massive suite of complicated stuff and you don't know where to start. Have they reoriented to fix this?

alexott · on May 27, 2022

It became better

pea · on May 27, 2022

I wonder if the quality of other data processing frameworks is also getting 'good enough' for many use-cases, and they often have less complexity than Spark. With Hadoop, many companies bought clusters because they thought they needed them, but they rotted because people didn't actually have suitable workloads. Does Databricks have a similar risk? Just based on users of our Python reporting framework[0] we are starting to see Dask/Ray displacing some workloads that would have been in Spark a few years ago. How quickly things progress in that space (e.g. Arrow/DuckDB[1]) is pretty incredible to watch.

[0] https://github.com/datapane/datapane

[1] https://duckdb.org/

nemo44x · on May 26, 2022

Databricks didn’t just do 800m revenue - they did it at over 100% growth. Shame they didn’t IPO last year though! But I’m sure when they do they’ll rake it in.

kietay · on May 27, 2022

So the article is basically criticising databricks for only being worth 39bn? And then it’s titled “The End of Big Data”… what?

clpm4j · on May 27, 2022

Eventually the majority realizes that 'big data' doesn't apply to them. It just doesn't work for what they need to do. I think it follows that the same reasoning applies to ML - works great for a few, but not at all for most.

asasidh · on May 27, 2022

Death of the narrative that SiliconValley pushed to regular companies and made them believe that they had a big data problem. Mostly the companies that invented the solution had the problems. 99.99% of others didn’t.

mborch · on May 27, 2022

The challenge for Databricks and Snowflake is that big cloud vendors are catching up with their own serverless products such as Redshift Serverless [0].

And they now provide Spark engines as well with associated notebooks. What's the unique offering in Databricks seen in this light?

[0] https://aws.amazon.com/redshift/redshift-serverless/

tomkwong · on May 27, 2022

100% click bait title and opening statement. Is journalism that bad these days? Ironically I quite like some of the points in the main content.

MafellUser · on May 27, 2022

substack is basically a blogging platform. Are bloggers journalists? Are cats actually dogs?

fdgsdfogijq · on May 26, 2022

I don't see how databricks doesn't end up like hortonworks, cloudera, or MapR.

slt2021 · on May 27, 2022

fair point, I dont think that AWS/GCP will just let databricks eat their lunch. big cloud providers will try hard to steal market share from DB

RhodesianHunter · on May 27, 2022

>I dont think that AWS/GCP will just let databricks eat their lunch.

Why not? The data is still being stored in S3/Cloud storage and computed on EC2/GCE so why should Amazon and Google care that they've been relegated to selling the pick-axes into to the gold rush?

mr_toad · on May 27, 2022

AWS already has EMR. If they added some of the delta lake features they could do that.

Azure seems to be at the ‘embrace’ stage with Databricks. I’m waiting for the ‘extend’ phase, and we know what comes after that.

whoevercares · on May 27, 2022

Author has been using a great, modern data stack from the beginning of this article - this is so not realistic for most Enterprises which still live in caves

la6472 · on May 27, 2022

What about unstructured data or is the realization dawning that there cannot be any data that is unstructured?

unixhero · on May 27, 2022

So Databricks is a hosted Apache spark? What is their data layer then, is it postures?

hansonw · on May 27, 2022

a bunch of Parquet files in S3 ;)

KptMarchewa · on May 27, 2022

That's very reductionist, but yes. That, and metadata json files are delta lake.

unixhero · on May 27, 2022

Good writeup, but with a weak conclusion.

ngcc_hk · on May 27, 2022

China