I'm surprised on the amount of hate this post has received.
Also, no one talked about what Haoyi has accomplished here (and in the general Scala Community).
There are too many "icons" in the tech industry that are just techlebrities.
Li, OTOH, is one of the most brilliant engineers I've ever seen. It's one of the fewest people I truly admire and I'm sure that most of the fires he describe in his post were already put off, thanks to his initiatives.
Just wanted to bring something nice to the conversation :)
The libs are implemented with advance Scala programming and expose elegant, Python-like interfaces. Li's ability to make beautiful abstractions is incredible. His book Hands on Scala Programming shows you how to create different applications using his library ecosystem. It's a magnum opus that's worth reading, even if you're not interested in Scala.
Li's Principle of Least Power blog post is another great read. Scala would be a much more popular language if the community just used his libraries and followed his design patterns.
We've used databricks in my previous company (Skyscanner). The product worked well and in general better than having our own cluster. Our dev enablement team have build tooling around creating new DAGs and spinning up a new project was relatively easy.
Personally, I found the experience much better that what the company used before which was custom Spark clusters.
Quick aside. I just want to call out that I remember back in the day finding so many wonderful flight deals on Skyscanner... I think it lost its edge but for a time was truly special
What's that saying about software everyone hates Vs software noone uses?
Much as I have some reservations with 'Spark all the things' Personally I think Databricks are world changing in terms of how much they opened up pretty non trivial applications to companies that would not be able to run a massive spark cluster.
Interesting to hear his opinion on joining companies with lots of fires to put out. That feels pretty risky to me, since putting out fires is collaborative, and those who let the fires be set in the first place are probably still around.
As for me, I just don't need the stress. It's possible to have hard, rewarding problems to solve without them being fires that need to be put out ASAP. Life's too short.
On the subject of the quality of Databricks, I'm half impressed and half annoyed. As a source of ad hoc, notebook-based compute it's magical. But stuff like proper git integration simply took too long. Further, the lengths that data scientists at my org go through to leverage Databricks in MLOps pipelines could be minimized with a better design.
> However, that doesn't mean the fires have gone out.
You can say that again.
It kind of horrifies me to know that is used to be that much worse; databricks as a company is the king of the MVP.
Does it work? Kinda?
Quick, let’s early access it with some suckers and get them to find all the bugs (like got integration that sometimes checks out your files and sometimes doesn’t). …or maybe just drop it in as a feature we never bother to test actually worked (azure sas token authentication for example…), or just never really worked (R), or just was too hard to do (scala in credential pass thru clusters, or hell, even being able to explain the dbu/infra cost split in billing).
Time to move on to the next feature, whatever of the ten things you never originally provided (currently, but not limited to user level auth, ingestion, task DAGs, source control all at the same time).
After my 2 years working with the platform, my interest in how hard the problems are and how nice the folk there are is… pretty limited.
I think databricks is not that well positioned. Every Cloud vendor has cloned it already and when going from exploring data to deploying code it is not that appealing. We had big vendors locking Enterprise in the past, I am not so sure we want big vendors blocking things again especially when most of the final product is just code.
One of our guiding principles, even though it is a SaaS has been to do as much on the client's side: using your own Kubernetes clusters for compute jobs, training, and even interactive notebooks.
Using integrations with external data sources. We're moving many things to the client's infra and automating it. This simplifies many things, including billing given you won't need to budget for yet another SaaS, but also security policies given you control your own clusters (granted, they could be on a public cloud provider, but all we need is an endpoint, a certificate authority and voilà).
Couldn't disagree more; developer experience, integration, and performance all still matter, not super impressed with the major cloud offerings thus far.
The tech is kinda good but not great. Their strategy is probably going to slow it down not speed it up. My opinion is if a snowflake or firebolt or new startup sets out to be spark native or compatible in a truly deep way then they can completely supplant databricks.
Hey Will I’m sorry it didn’t work out. We’ve actually adjusted really well to COVID (in no small part thanks to our amazing employees).
We care a lot about our hiring experience. I’d be happy to chat about this over email. Feel free to drop me a message at Bilal dot Aslam at databricks dot com.
Nothing fancy, way too small. We built it initially because we do a lot of consulting projects for enterprise clients and it took a toll on us. We either needed to hire people who knew it all (data, models, deployment, infra), or had people tapping on others' shoulders for support.
Incidentally, we use MLflow for automatic experiment tracking: this means you don't need to pollute notebooks with tracking code or to learn MLflow API. We detect params, metrics, and the models itself automatically and save to S3. It's all done for multi-user, behind authentication.
Anyway, my contact information is in my profile. I'd be interested in what your problems are.
As I said, not a competitor of something huge like Databricks. My competitor is my colleague's laptop or a "powerful workstation with GPU" in some office.
Not to put too fine a point on it, but MLflow alone isn't MLOps; moreover -and somewhat surprisingly- Databricks doesn't really do MLOps well, despite having creating one of its most useful components.
Spark may be a mature solution for truly big data, in a SQL like fashion, 1TB and more. But I constantly see it being misused, even with datasets as small as 5GB. Maybe the valuation of the company reflects this 'growth' and 'adoption'. And data locality is a thing. You can't read terabytes from object storage (over http). The batch oriented, map reduce is not going to be conducive to too many ML algorithms where state needs to be passed around.
> But I constantly see it being misused, even with datasets as small as 5GB.
I witness frequent desire from engineers to use it because they see it as a competency/expertise that will unlock jobs at bigger, more lucrative companies. Also, startups kind of beg for it because the business keeps asking, "Will this tech scale 100x?" If you ask for a solution that scales 100x, and your problems aren't well-defined yet, and by the way it would be nice if it does streaming, too, since we might need that someday, your engineers are going to err on the side of using a big, complete solution.
If you can completely stay away from Python/pandas, get all your work done with typed languages like Scala/Java, that's good. A lot of scientists and non-CS folks are using Python/R. They need to avoid mish mash of bringing in Spark and SQL for some bits and then getting back to Python/R. Native Python, especially, offers mature ways to handle data in the 100s GB data. Learning to incorporate Dask and Numba is going to be far easier than teaching all these folks distributed programming and spinning up Spark clusters, when that can be un-necessary in many cases.
Depends for what, Im a Java guy at heart, but honestly for quick little analysis, pandas is way faster, and I barely can code my way out of a paperback in Python.
I even hate python and would never use it ... but I cant find better than pandas for my crazy large time series and always bespoke questions from the biz.
Also, no one talked about what Haoyi has accomplished here (and in the general Scala Community). There are too many "icons" in the tech industry that are just techlebrities. Li, OTOH, is one of the most brilliant engineers I've ever seen. It's one of the fewest people I truly admire and I'm sure that most of the fires he describe in his post were already put off, thanks to his initiatives.
Just wanted to bring something nice to the conversation :)