Hacker News new | past | comments | ask | show | jobs | submit login
Reflecting on Four Years at Databricks (lihaoyi.com)
158 points by yla92 on Oct 11, 2021 | hide | past | favorite | 35 comments



I'm surprised on the amount of hate this post has received.

Also, no one talked about what Haoyi has accomplished here (and in the general Scala Community). There are too many "icons" in the tech industry that are just techlebrities. Li, OTOH, is one of the most brilliant engineers I've ever seen. It's one of the fewest people I truly admire and I'm sure that most of the fires he describe in his post were already put off, thanks to his initiatives.

Just wanted to bring something nice to the conversation :)


Yeah he’s a special person. His book is a joy to read (and learn from!)


The article author is the creator of the "lihaoyi ecosystem" of Scala libraries: https://github.com/com-lihaoyi

The libs are implemented with advance Scala programming and expose elegant, Python-like interfaces. Li's ability to make beautiful abstractions is incredible. His book Hands on Scala Programming shows you how to create different applications using his library ecosystem. It's a magnum opus that's worth reading, even if you're not interested in Scala.

Li's Principle of Least Power blog post is another great read. Scala would be a much more popular language if the community just used his libraries and followed his design patterns.


I always got a warm feeling when coming across a library of his and using it - such a joy to code with.


> The people at Databricks seemed nice, smart, and I could get along with them. This may seem like a pretty low baseline...

You would be surprised how difficult it is to meet that seemingly low baseline.


We've used databricks in my previous company (Skyscanner). The product worked well and in general better than having our own cluster. Our dev enablement team have build tooling around creating new DAGs and spinning up a new project was relatively easy.

Personally, I found the experience much better that what the company used before which was custom Spark clusters.


Quick aside. I just want to call out that I remember back in the day finding so many wonderful flight deals on Skyscanner... I think it lost its edge but for a time was truly special


Surprised about the negativity in this thread. I think Databricks is astoundingly good tech.

Getting developers, analysts and data scientists on one platform and using their preferred languages is very well done.

Nice collaborative interface.

Massive cost avoidance in terms of building spark clusters and tooling.

Combining warehouses and lakes.

Spark is a bit tricky to use vs some of the alternatives, but I think they absolutely nail it.

Dinged me for a job at the qualification stage but I won't hold that against them.


What's that saying about software everyone hates Vs software noone uses?

Much as I have some reservations with 'Spark all the things' Personally I think Databricks are world changing in terms of how much they opened up pretty non trivial applications to companies that would not be able to run a massive spark cluster.


> Surprised about the negativity in this thread. I think Databricks is astoundingly good tech.

Is it possible that things with which you have good experience others may not have the same?


Interesting to hear his opinion on joining companies with lots of fires to put out. That feels pretty risky to me, since putting out fires is collaborative, and those who let the fires be set in the first place are probably still around.

As for me, I just don't need the stress. It's possible to have hard, rewarding problems to solve without them being fires that need to be put out ASAP. Life's too short.

On the subject of the quality of Databricks, I'm half impressed and half annoyed. As a source of ad hoc, notebook-based compute it's magical. But stuff like proper git integration simply took too long. Further, the lengths that data scientists at my org go through to leverage Databricks in MLOps pipelines could be minimized with a better design.


> However, that doesn't mean the fires have gone out.

You can say that again.

It kind of horrifies me to know that is used to be that much worse; databricks as a company is the king of the MVP.

Does it work? Kinda?

Quick, let’s early access it with some suckers and get them to find all the bugs (like got integration that sometimes checks out your files and sometimes doesn’t). …or maybe just drop it in as a feature we never bother to test actually worked (azure sas token authentication for example…), or just never really worked (R), or just was too hard to do (scala in credential pass thru clusters, or hell, even being able to explain the dbu/infra cost split in billing).

Time to move on to the next feature, whatever of the ten things you never originally provided (currently, but not limited to user level auth, ingestion, task DAGs, source control all at the same time).

After my 2 years working with the platform, my interest in how hard the problems are and how nice the folk there are is… pretty limited.

This is pain in the ass platform to work with.


I know many will disagree. But to me when some software is written in Scala, most of the focus is on Scala and not software.


That seems very subjective why do you think that?


I think databricks is not that well positioned. Every Cloud vendor has cloned it already and when going from exploring data to deploying code it is not that appealing. We had big vendors locking Enterprise in the past, I am not so sure we want big vendors blocking things again especially when most of the final product is just code.


Yes. One of our gripes with many of the solutions was vendor lock-in in some way, be it in data, compute, or simply the APIs.

I wrote a comment above https://news.ycombinator.com/item?id=28830306

One of our guiding principles, even though it is a SaaS has been to do as much on the client's side: using your own Kubernetes clusters for compute jobs, training, and even interactive notebooks.

Using integrations with external data sources. We're moving many things to the client's infra and automating it. This simplifies many things, including billing given you won't need to budget for yet another SaaS, but also security policies given you control your own clusters (granted, they could be on a public cloud provider, but all we need is an endpoint, a certificate authority and voilà).


> Every Cloud vendor has cloned it already

Couldn't disagree more; developer experience, integration, and performance all still matter, not super impressed with the major cloud offerings thus far.


The tech is kinda good but not great. Their strategy is probably going to slow it down not speed it up. My opinion is if a snowflake or firebolt or new startup sets out to be spark native or compatible in a truly deep way then they can completely supplant databricks.


Hey Will I’m sorry it didn’t work out. We’ve actually adjusted really well to COVID (in no small part thanks to our amazing employees).

We care a lot about our hiring experience. I’d be happy to chat about this over email. Feel free to drop me a message at Bilal dot Aslam at databricks dot com.


I think you replied to the wrong comment on hiring processes


The post was edited to remove comments about hiring


Is this supposed to be one of those Twitter style "clapbacks"? What a terrible look.


I mean, not really? I’m just a product manager at Databricks who cares about our customers and candidates. Take that for what you will, I guess.


Why did you bring up hiring? His comment doesn’t even mention it. Unless he edited it.


The post was edited


Is Will a customer, or a former candidate?


For a moment, my eyes switched two letters and the title I read was "Reflecting on Your Fears of Databricks".

I've never used their services and my company needs some MLOps, but the contents of this thread made me fear it. So, mission accomplished, dear eyes.


>I've never used their services and my company needs some MLOps, but the contents of this thread made me fear it. So, mission accomplished, dear eyes.

I'd say I wouldn't discourage you from looking at several solutions. There's an MLOps community and they have a Slack: https://mlops.community/

Also, a plug: I recently submitted a "Show HN: Use your own clusters to train, track, deploy, and monitor ML models (iko.ai)"

https://news.ycombinator.com/item?id=28777589

https://iko.ai

Nothing fancy, way too small. We built it initially because we do a lot of consulting projects for enterprise clients and it took a toll on us. We either needed to hire people who knew it all (data, models, deployment, infra), or had people tapping on others' shoulders for support.

Incidentally, we use MLflow for automatic experiment tracking: this means you don't need to pollute notebooks with tracking code or to learn MLflow API. We detect params, metrics, and the models itself automatically and save to S3. It's all done for multi-user, behind authentication.

Anyway, my contact information is in my profile. I'd be interested in what your problems are.

As I said, not a competitor of something huge like Databricks. My competitor is my colleague's laptop or a "powerful workstation with GPU" in some office.


Not to put too fine a point on it, but MLflow alone isn't MLOps; moreover -and somewhat surprisingly- Databricks doesn't really do MLOps well, despite having creating one of its most useful components.


Spark may be a mature solution for truly big data, in a SQL like fashion, 1TB and more. But I constantly see it being misused, even with datasets as small as 5GB. Maybe the valuation of the company reflects this 'growth' and 'adoption'. And data locality is a thing. You can't read terabytes from object storage (over http). The batch oriented, map reduce is not going to be conducive to too many ML algorithms where state needs to be passed around.


> But I constantly see it being misused, even with datasets as small as 5GB.

I witness frequent desire from engineers to use it because they see it as a competency/expertise that will unlock jobs at bigger, more lucrative companies. Also, startups kind of beg for it because the business keeps asking, "Will this tech scale 100x?" If you ask for a solution that scales 100x, and your problems aren't well-defined yet, and by the way it would be nice if it does streaming, too, since we might need that someday, your engineers are going to err on the side of using a big, complete solution.


I'd take a lazy, typed data manipulation language over pandas all day


If you can completely stay away from Python/pandas, get all your work done with typed languages like Scala/Java, that's good. A lot of scientists and non-CS folks are using Python/R. They need to avoid mish mash of bringing in Spark and SQL for some bits and then getting back to Python/R. Native Python, especially, offers mature ways to handle data in the 100s GB data. Learning to incorporate Dask and Numba is going to be far easier than teaching all these folks distributed programming and spinning up Spark clusters, when that can be un-necessary in many cases.


Depends for what, Im a Java guy at heart, but honestly for quick little analysis, pandas is way faster, and I barely can code my way out of a paperback in Python.

I even hate python and would never use it ... but I cant find better than pandas for my crazy large time series and always bespoke questions from the biz.


Databricks IPO is upcoming, with valuation of ~40B.

But I predict it will be sold to either Microsoft, Amazon or Salesforce within a year or so, may be sooner.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: