A lot of people using PySpark are moving to Dask for significantly faster performance. Dask is also built for kubernetes - which is a huge deployment win.
Spark is still potentially faster for SQL-like workloads due to the existence of a query optimizer. Dask works at a different level of abstraction and does not have a query optimizer.
It’s not apples to oranges with respect to my point though.
Most operations on dataframe-like objects can be described in SQL operations. Spark supports these operations and Catalyst can optimize query plans for these.
You are correct in that Dask does not optimize for this because Dask operations are more primitive hence it does not have the correct level of abstraction to do query optimization, only task graph optimization. Which reinforces my point that if you have a SQL-like workload on Dask.dataframes, chances are Dask may not outperform Spark.
IMHO i dont agree with you.
Spark's SQL is a consequence of the need to work across languages - Scala and Python. So SQL gives the abstraction necessary to be usable in both places. It is also the language in which data scientists communicate with Spark production engineers.
Every Spark production engineer i know, translates the SQL written by data scientists back into high performance RDD code.
That's the advantage of Dask - there is no SQL abstraction needed. Pandas Dataframes are already the lingua franca of data scientists.. in fact, orders of magnitude more than SQL ever will be.
TLDR - Dask doesnt need SQL because the people who will push Dask to production already are far more comfortable in Dataframes than they ever will be in SQL.
You may still argue that spark RDD is faster than dask (and you may indeed by right)...but not having a SQL engine is not a problem for Dask.
> Spark's SQL is a consequence of the need to work across languages - Scala and Python
Not really. The Spark API has equivalent calls in both Scala and Python, with Scala being the superset. Spark's SQL is a high-level abstraction that internally mapped to these operations.
> Every Spark production engineer i know, translates the SQL written by data scientists back into high performance RDD code.
This would be very unusual and rarely advisable with Spark > 2.0. Spark Dataframes are generally more memory-efficient, type-safe and performant than RDDs in most situations, so most data engineers work directly in Spark Dataframes -- dropping to RDDs only in specific situations requiring more control.
If you know data engineers who are somehow translating SQL into RDDs (except in rare circumstances) you might want to advise them to move to Spark > 2.0 and change their paradigms. They might be working with older Spark paradigms [1] and might have missed the shift that happened around 2.0 and missing out on all the work that has been done since.
> That's the advantage of Dask - there is no SQL abstraction needed.
SQL is only a language to access the dataframe abstraction (Spark Dataframes, Pandas dataframes, etc.) -- the fact it is higher-level means it is amenable to certain types of optimization.
If you take the full set of dataframe operations and restrict it to the set that SQL supports (group by's, where's, pivots, joins, window functions, etc.) you can apply query optimization.
Dask does not, and hence allows more powerful lower-level manipulations on data, but it therefore also cannot perform SQL-level query optimization, only task-graph level optimizations.
This Spark vs Dask comparison on the Dask website provides more details [2].
> Pandas Dataframes are already the lingua franca of data scientists.. in fact, orders of magnitude more than SQL ever will be.
I wonder if this is where our misunderstanding lies -- I sense that you might be thinking of SQL strictly as the syntax, whereas I use SQL as a shorthand for a set of mathematical operations on tabular structures -- which is equivalent in the subset.
The examples are all in Rust, so it’s very hard to make a non-toy demo. Usually if one uses the RDD API, one has some sort of library code that’s already in Java or Python and it’s impractical to port that code to make the job run. Or more likely, somebody will write an initial version using that code and then port / optimize the job later.
Native dependencies usually mean you’ll need docker. Spark pre-dates docker, and just relatively recently added the Kubernetes runner, which makes dockerized jobs easy. But historically it hasn’t been easy to run a job in a containerized environment with the native dependencies you need. You can ship native deps with your job, but that’s not easy, especially if you need a rebuild with each job.
The main advantage of Spark is flexibility and interoperability. You save time by not having to write something optimized on day 1 (for something you might throw away). And you get SQL support, something Beam / Hadoop don’t have (certainly not for Python). There are lots of benchmarks where Spark SQL is not a winner, but the point is Spark will help you save development time.
The author of the repo here. It is still very much in the POC stage. It will definitely have python APIs in the future. One of the primary reasons to choose a native language is to have better python integration. I intend to have to APIs almost identical to Spark, so that it will be easy to migrate. It is still very early to assure this, but it is the objective.
This sounds too good to be true. If it is this easy to be orders of magnitude faster than spark on JVM, why haven't the spark developers ported spark to native code already?
The author of the repo here. It is definitely not orders of magnitude faster. I didn't mention it anywhere also I guess. But yeah, JVM is sometimes a problem for in-memory computing for big data processing. Spark itself tried to address this. This is what their tungsten engine does. They circumvent huge Java Objects by using native types through JNI(sun.misc.Unsafe). This is the reason why Dataframes are generally much faster than RDD(which typically uses Java objects). This is the reason only certain native types are allowed in Dataframes. This project was just for exploring the feasibility of implementing itself in the native language. Closure serialization can be a nightmare here. If it actually translated to even 2-4X better performance than Spark which itself is very difficult to achieve considering years of optimizations went into Spark, it can be a good alternative and can reduce cloud costs a bit, especially if the Python APIs remain compatible. Spark Dataframes are already highly optimized. Therefore I just thought of open-sourcing it and if others see the benefits, it will automatically grow with the help of the community. It is still a long, long way to reach Spark level maturity. Spark is indeed a very huge ecosystem built upon an already big Hadoop ecosystem.
> if others see the benefits, it will automatically grow with the help of the community
There’s nothing automatic about it, you or someone else will need to put a lot of work into leading the community, merging pull requests, debugging, etc.
I didn't mean it in that way. Instead of sitting idly on my laptop, it might at least be useful for someone, and if it really proves to be beneficial, then people might contribute to it. Yeah, not denying your point, it does require huge effort from some people to get it into a mature production-ready stage.
I know that Spark has had a lot of work put into it, but my personal experience with it has been pretty negative. I've spent a lot of time at my job trying to tune it to our workflows (extremely deep queries), with only moderate success. I've just POC'd a custom SQL execution engine that was 200x faster than spark for the same workflows. Now, our requirements are pretty non-standard, but I find it pretty easy to believe these benchmarks.
McSherry et al's paper "Scalability! But at what COST?" is worth reading. A single threaded, single core implementation typically outperforms Spark.
The best rule of thumb I'm aware of is: unless you can't fit your computation on a single machine or your jobs are likely to fail before completing from the size and length involved, you are generally better off without Spark or similar systems. And if sampling can get you back onto a single machine, then you're really better off.
In my experience too I observed that distributed code introduces a lot of redundancy and it requires a lot of data to beat the performance of a single-threaded/single machine implementation. Check out McSherrys' Timely Dataflow, it is truly an amazing piece of work.
It is indeed my opinion too. In non-standard workflows, handcrafted code/application will most likely beat generic frameworks(not true for some cases). I have conflicting thoughts about this. Nowadays industries are very fast-moving, they generally can't afford to do it all for each of their use cases. So they tend to pick up generic frameworks. But I have seen many managers picking the wrong tools for the job and vastly overestimate their future needs. Everyone thinks that they are going to process petabytes of data, and they make the decision to use these generic distributed frameworks from the beginning to avoid the future scale. It rarely happens. Most of the time, they end up spending money on Cloud because making something distributed comes with a lot of redundancy to provide fault tolerance and yet not as performant as single machine performance due for data up to few TBs. Even here, if you take that parquet example, my hand-coded Rust code beats the Rust RDD version by 4x. I guess we can't change this attitude. So it is better to aim for improving these libraries.
I completely agree that we need better generic libraries. I was mostly commenting that I really believe that there are huge wins that can be achieved in the "generic distributed execution engine" space, and that people shouldn't be intimidated by the work that has already gone into spark.
I'm kind of surprised it took this long for someone to do this. It was clear very early on that the JVM was a bad match for what Spark was trying to do.
I attended a talk at Strata a few years back by a Spark committer who was talking about how Spark was stretching JVM memory allocations far past how the JVM was originally designed. Do a couple searches for "spark JVM OOM" and you'll see some discussions about similar things.
I'm wondering how much could be gained if one used all possible optimizations: e.g. by analyzing the data flow graph - expressed using DSL - and generating native node programs, using CPU thread pinning and user space network stack (like ScyllaDB does [1]).
Spark (and Apache Spark) is a trademark of the Apache Foundation. If the title were SPARK in all caps, I'd understand, but how often do you read articles about SPARK where the name is written as "Spark?"
I like to see people re-implement things and share their (better) results. Even if the results isn't better than the 'battle-tested' existing solutions, at least we can learn something in the process.
Scylla [1] for instance is a C++ rewrite and a drop-in replacement of JVM-based Cassandra, and from what I've read is fairly stable and performs much faster.
I assume you mean Graal native-image mode, i.e. ahead of time compiled.
JIT based execution still beats AOT due to the continuous profiling advantage, costs a bit more in terms of resources, but for long-running processes, JIT wins.
Thanks for that. Now that I've had a chance to read through it, a question:
The examples seem to be implemented in pure Rust. No one is going to port their Spark jobs to Rust in the shot term. Have you evaluated perf with Python etc?
If you're still seeing significant speedups, you might want to bottle this up and seek VC because a managed service along the lines of 'databricks but 10x faster' would certainly get traction.
It is in a very initial POC stage and distributed mode is pretty basic, but it is moving faster than I expected. Python integration is definitely one of the primary objectives as I suspect that no one is going to learn Rust for this, although I feel that it is not that hard. In fact, it can have a better integration story with python than Spark as Rust has good C interop. Regarding performance, yeah it is pretty good from what I have seen for CPU intensive tasks and once blockmanager is implemented with compression and other optimizations like Spark, shuffle tasks also will improve. There are a lot of unnecessary allocations here than I would prefer just to keep it in safe Rust as much as possible and there is still plenty of optimizations possible here. I am doing this in my free time only. I feel that it is too early to compare witn Spark given how many features Spark has. Maybe in a couple of months after it matures a bit and if there is enough traction for this, then we can look for sponsors.
This is an economy where content competes for clicks, not clicks competing for content. The author of that content wants me to see it, Medium doesn't want me to see it. I don't care enough to try to circumvent their arrangement.
Given the number of votes on my root comment, it seems neither do most people.
You can also use Reader Mode on Safari, which not only avoids the modals and popups but gets rid of the top and bottom bars as well. Long-click on the Reader Mode button and you can set it to always use it on medium.com.
I haven't actually monetized it. I think without medium distribution, it will be limited to my followers. That is the only reason I switched on distribution. I have decided to just use Github for my future blog.
Nice, but I can't find any reason to choose Spark over modern Distributed SQL databases (CockroachDB, CitusDB, TiDB etc. or cloud vendor-specific SQL DBs)
Spark is specifically useful for querying streaming data. How would a distributed database help with that? You'd have to build your own stream executor on top of that.
Agreed. at the same time, Building stream execution pipeline is not rocket science. I am not saying modern distributed SQL Databases are exact replacement or clones of Spark. I am saying with little more help from the application server they are much more capable than Spark.
You can use the following options individually or in combination.
Option 1 : Pipeline DB extension (PostgreSQL)
Option 2 : Service broker in commercial SQL databases or building PUSH/PULL queue if not supported. There are many libraries in each programming language which tries to do that. Also see option 4.
Option 3 : Using CDC or Replication for synchronous or asynchronous streamed computation on single or multi node cluster
Option 4 : Transducers. For example, you can compose many sql functions or procedures to act on a single chunk of data instead of always doing async streamed computation after each stage of transformation.
How would you efficiently keep continuously updated complex metrics (aggregations, windowed functions, etc) calculated on top of unbounded/streaming data using a database? I'm not saying that Spark is the ideal solution, but there are a set of problems that require tools such as Spark.
tl;dr we use it for a similar set of tasks that one would use Airflow for.
Unlike Airflow, this lends itself to microbatching and streaming. Plus a bunch of housekeeping items ticked off that Airflow never got around to. With a bit of devops engineering time, you can have perfect manage the size of your worker cluster on k8s and scale it up/down with ingest demand, etc.
I'll say one thing though. The Perfect website used to be a lot more technical and explicit about what it is and isn't. Now it's mostly sales gobbledegook. Maybe not a good sign. I've seen this happen before with dremio.
Do you run dask on k8s ? I have been concerned that dask does not leverage kubernetes HPA for autoscaling...but instead chooses to run an external scheduler.
Spark stacks inevitably end up with PySpark though. It's rework for people who already committed to Spark, sure. And for bigger projects that committed to Spark this change isn't justifiable. But for a greenfield project, choosing Spark is just silly today.
Say for an example, I am using PostgreSQL 12 + CitusDB extension
Data cleaning -> PL/SQL and various inbuilt functions for the transformation of data (or new UDF if required at all)
Processing -> PostgreSQL Parallel processing on the local node and Citus DB extension for distributed computing and sharding
Analytics -> Many options here. Materialized views OR Triggers OR Streaming computation with PipelineDB extension OR Using Logical replication for stream computation
ML -> PG support variety of statistics functions. It also supports PL/R and PL/Python extension to interface with ML libraries.
Yeah that's not going to work for what people call analytics workloads today.
PG is great but it's not suitable to be a feature store and sure as hell not suitable to fan out ML workloads. In a modern ML stack, PG might play the role of the slow but reliable master store that the rest of the ML pipeline feeds off.
> I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.
There's no question about this. If you can express your task in terms of PG on a single instance, then you probably should.
When you get to more complex tasks, like running input through GloVe and pushing ngrams to a temporal store, PG offers very little - which is fine, it's not at all what PG is designed for. Inter-node IO eclipses single node perf, which is why Spark is used despite being a terribly inefficient thing (although in the case of Spark, it's so inefficient that for interim sized workloads you'd actually be better off vertically scaling a single node and using something else). PG won't help at all with these tasks.
Also, that smorgasbord of extensions GP listed isn't offered by any cloud vendor as a managed service afaik, meaning you must roll and manage your own. Depending on your needs, that might be a show stopper.
> Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.
Correct. PG has no place in this workload other than being the final store for the model output. And even then, you'd be using a column store like Redshift or Clickhouse. PG not even suitable for the ngram counters because its ingest rates are way too slow to keep up with a fanned out model spitting out millions of ngrams per second in addition to everything else going on in the pipeline.
You -could- probably do it all in PG. But that'd be a silly esoteric challenge exercise and not something anyone would try on a project. I am sure you recognise that.
A typical twitter post will have about 50 2/3/4-grams. Let's ignore skipgrams. The twitter decahose will throw about 600 of these at you per second. That's 30k barebones ngrams per second to keep with the decahose.
But you have a year worth of historical data that you want to work with. If you're able to process 1m ngrams per second, it'll take a couple of days to get through that. You probably want to get closer to 10m/s if you're tweaking your model and want to iterate reasonably quickly. Of course there's ways to optimise all that and batch it and whatnot, but basically any big data tasks with the need to work on historical data and iterate on their models, quickly end up with kafka clusters piping millions of messages per second to keep those iteration times productive.
Ultimately this post is about Spark, and the comment that started this was someone listing PG 'replacements' for traditional ML pipeline components. If you need Spark, you're at scales where PG has no place.
That's why I mentioned scale in my first comment. For sub-TB datasizes with 16 cores CPU and NVME raid (you can get such machine for less than $1k nowdays) PG will be just fine.
Also in typical ML pipeline as I mentioned you can generate ngrams in input function of your model (Dataset API in TF), you don't need to store it somewhere.
train a set of sklearn models one each per a random partition of the data (computed distributed). then combine all those models using averaging and evaluate them all against an even larger dataset. how do you do that in SQL
Sharding the table can help scale the problem across many machines and as I mentioned earlier you can use PL/R or PL/Python language extension to lift all sorts of ML functions to SQL functions.
ML models - I already mentioned how to uplift R and Python functions to SQL function. even if you are not using PostgreSQL many other databases help you with uplifting and interfacing with existing ML libraries through FFI
Spark is still in-between yarn and kubernetes.