Itβs not apples to oranges with respect to my point though.
Most operations on dataframe-like objects can be described in SQL operations. Spark supports these operations and Catalyst can optimize query plans for these.
You are correct in that Dask does not optimize for this because Dask operations are more primitive hence it does not have the correct level of abstraction to do query optimization, only task graph optimization. Which reinforces my point that if you have a SQL-like workload on Dask.dataframes, chances are Dask may not outperform Spark.
IMHO i dont agree with you.
Spark's SQL is a consequence of the need to work across languages - Scala and Python. So SQL gives the abstraction necessary to be usable in both places. It is also the language in which data scientists communicate with Spark production engineers.
Every Spark production engineer i know, translates the SQL written by data scientists back into high performance RDD code.
That's the advantage of Dask - there is no SQL abstraction needed. Pandas Dataframes are already the lingua franca of data scientists.. in fact, orders of magnitude more than SQL ever will be.
TLDR - Dask doesnt need SQL because the people who will push Dask to production already are far more comfortable in Dataframes than they ever will be in SQL.
You may still argue that spark RDD is faster than dask (and you may indeed by right)...but not having a SQL engine is not a problem for Dask.
> Spark's SQL is a consequence of the need to work across languages - Scala and Python
Not really. The Spark API has equivalent calls in both Scala and Python, with Scala being the superset. Spark's SQL is a high-level abstraction that internally mapped to these operations.
> Every Spark production engineer i know, translates the SQL written by data scientists back into high performance RDD code.
This would be very unusual and rarely advisable with Spark > 2.0. Spark Dataframes are generally more memory-efficient, type-safe and performant than RDDs in most situations, so most data engineers work directly in Spark Dataframes -- dropping to RDDs only in specific situations requiring more control.
If you know data engineers who are somehow translating SQL into RDDs (except in rare circumstances) you might want to advise them to move to Spark > 2.0 and change their paradigms. They might be working with older Spark paradigms [1] and might have missed the shift that happened around 2.0 and missing out on all the work that has been done since.
> That's the advantage of Dask - there is no SQL abstraction needed.
SQL is only a language to access the dataframe abstraction (Spark Dataframes, Pandas dataframes, etc.) -- the fact it is higher-level means it is amenable to certain types of optimization.
If you take the full set of dataframe operations and restrict it to the set that SQL supports (group by's, where's, pivots, joins, window functions, etc.) you can apply query optimization.
Dask does not, and hence allows more powerful lower-level manipulations on data, but it therefore also cannot perform SQL-level query optimization, only task-graph level optimizations.
This Spark vs Dask comparison on the Dask website provides more details [2].
> Pandas Dataframes are already the lingua franca of data scientists.. in fact, orders of magnitude more than SQL ever will be.
I wonder if this is where our misunderstanding lies -- I sense that you might be thinking of SQL strictly as the syntax, whereas I use SQL as a shorthand for a set of mathematical operations on tabular structures -- which is equivalent in the subset.
That's apples to oranges - because dask does not expose a SQL syntax that needs a query optimiser.
Also pyspark has the additional issue of serialisation between python and jvm. Turns out that just getting rid of that is a huge performance boost.