FastSpark: A New Fast Native Implementation of Spark from Scratch

sandGorgon · on Oct 31, 2019

A lot of people using PySpark are moving to Dask for significantly faster performance. Dask is also built for kubernetes - which is a huge deployment win.

Spark is still in-between yarn and kubernetes.

wenc · on Nov 1, 2019

Spark is still potentially faster for SQL-like workloads due to the existence of a query optimizer. Dask works at a different level of abstraction and does not have a query optimizer.

sandGorgon · on Nov 1, 2019

Are you talking about the spark SQL catalyst optimiser ?

That's apples to oranges - because dask does not expose a SQL syntax that needs a query optimiser.

Also pyspark has the additional issue of serialisation between python and jvm. Turns out that just getting rid of that is a huge performance boost.

wenc · on Nov 1, 2019

It’s not apples to oranges with respect to my point though.

Most operations on dataframe-like objects can be described in SQL operations. Spark supports these operations and Catalyst can optimize query plans for these.

You are correct in that Dask does not optimize for this because Dask operations are more primitive hence it does not have the correct level of abstraction to do query optimization, only task graph optimization. Which reinforces my point that if you have a SQL-like workload on Dask.dataframes, chances are Dask may not outperform Spark.

sandGorgon · on Nov 2, 2019

IMHO i dont agree with you. Spark's SQL is a consequence of the need to work across languages - Scala and Python. So SQL gives the abstraction necessary to be usable in both places. It is also the language in which data scientists communicate with Spark production engineers.

Every Spark production engineer i know, translates the SQL written by data scientists back into high performance RDD code.

That's the advantage of Dask - there is no SQL abstraction needed. Pandas Dataframes are already the lingua franca of data scientists.. in fact, orders of magnitude more than SQL ever will be.

TLDR - Dask doesnt need SQL because the people who will push Dask to production already are far more comfortable in Dataframes than they ever will be in SQL.

You may still argue that spark RDD is faster than dask (and you may indeed by right)...but not having a SQL engine is not a problem for Dask.

wenc · on Nov 2, 2019

> Spark's SQL is a consequence of the need to work across languages - Scala and Python

Not really. The Spark API has equivalent calls in both Scala and Python, with Scala being the superset. Spark's SQL is a high-level abstraction that internally mapped to these operations.

> Every Spark production engineer i know, translates the SQL written by data scientists back into high performance RDD code.

This would be very unusual and rarely advisable with Spark > 2.0. Spark Dataframes are generally more memory-efficient, type-safe and performant than RDDs in most situations, so most data engineers work directly in Spark Dataframes -- dropping to RDDs only in specific situations requiring more control.

If you know data engineers who are somehow translating SQL into RDDs (except in rare circumstances) you might want to advise them to move to Spark > 2.0 and change their paradigms. They might be working with older Spark paradigms [1] and might have missed the shift that happened around 2.0 and missing out on all the work that has been done since.

> That's the advantage of Dask - there is no SQL abstraction needed.

SQL is only a language to access the dataframe abstraction (Spark Dataframes, Pandas dataframes, etc.) -- the fact it is higher-level means it is amenable to certain types of optimization.

If you take the full set of dataframe operations and restrict it to the set that SQL supports (group by's, where's, pivots, joins, window functions, etc.) you can apply query optimization.

Dask does not, and hence allows more powerful lower-level manipulations on data, but it therefore also cannot perform SQL-level query optimization, only task-graph level optimizations.

This Spark vs Dask comparison on the Dask website provides more details [2].

> Pandas Dataframes are already the lingua franca of data scientists.. in fact, orders of magnitude more than SQL ever will be.

I wonder if this is where our misunderstanding lies -- I sense that you might be thinking of SQL strictly as the syntax, whereas I use SQL as a shorthand for a set of mathematical operations on tabular structures -- which is equivalent in the subset.

[1] https://databricks.com/blog/2016/07/14/a-tale-of-three-apach...

[2] https://docs.dask.org/en/latest/spark.html#

andersson42 · on Oct 31, 2019

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

raja_sekar · on Oct 31, 2019

Author of the article here. Sorry about Medium restriction. I will just use Github to host my content hereafter. For now, you can use this friend link. https://medium.com/@rajasekar3eg/fastspark-a-new-fast-native...

choppaface · on Oct 31, 2019

The examples are all in Rust, so it’s very hard to make a non-toy demo. Usually if one uses the RDD API, one has some sort of library code that’s already in Java or Python and it’s impractical to port that code to make the job run. Or more likely, somebody will write an initial version using that code and then port / optimize the job later.

Native dependencies usually mean you’ll need docker. Spark pre-dates docker, and just relatively recently added the Kubernetes runner, which makes dockerized jobs easy. But historically it hasn’t been easy to run a job in a containerized environment with the native dependencies you need. You can ship native deps with your job, but that’s not easy, especially if you need a rebuild with each job.

The main advantage of Spark is flexibility and interoperability. You save time by not having to write something optimized on day 1 (for something you might throw away). And you get SQL support, something Beam / Hadoop don’t have (certainly not for Python). There are lots of benchmarks where Spark SQL is not a winner, but the point is Spark will help you save development time.

raja_sekar · on Oct 31, 2019

The author of the repo here. It is still very much in the POC stage. It will definitely have python APIs in the future. One of the primary reasons to choose a native language is to have better python integration. I intend to have to APIs almost identical to Spark, so that it will be easy to migrate. It is still very early to assure this, but it is the objective.

latenightcoding · on Oct 31, 2019

Also relevant: https://github.com/thrill/thrill

A Spark inspired framework written in modern C++.

Joeri · on Oct 31, 2019

This sounds too good to be true. If it is this easy to be orders of magnitude faster than spark on JVM, why haven't the spark developers ported spark to native code already?

raja_sekar · on Oct 31, 2019

The author of the repo here. It is definitely not orders of magnitude faster. I didn't mention it anywhere also I guess. But yeah, JVM is sometimes a problem for in-memory computing for big data processing. Spark itself tried to address this. This is what their tungsten engine does. They circumvent huge Java Objects by using native types through JNI(sun.misc.Unsafe). This is the reason why Dataframes are generally much faster than RDD(which typically uses Java objects). This is the reason only certain native types are allowed in Dataframes. This project was just for exploring the feasibility of implementing itself in the native language. Closure serialization can be a nightmare here. If it actually translated to even 2-4X better performance than Spark which itself is very difficult to achieve considering years of optimizations went into Spark, it can be a good alternative and can reduce cloud costs a bit, especially if the Python APIs remain compatible. Spark Dataframes are already highly optimized. Therefore I just thought of open-sourcing it and if others see the benefits, it will automatically grow with the help of the community. It is still a long, long way to reach Spark level maturity. Spark is indeed a very huge ecosystem built upon an already big Hadoop ecosystem.

fourthark · on Oct 31, 2019

> if others see the benefits, it will automatically grow with the help of the community

There’s nothing automatic about it, you or someone else will need to put a lot of work into leading the community, merging pull requests, debugging, etc.

(Sad to say, promotion too, in a lot of cases.)

raja_sekar · on Oct 31, 2019

I didn't mean it in that way. Instead of sitting idly on my laptop, it might at least be useful for someone, and if it really proves to be beneficial, then people might contribute to it. Yeah, not denying your point, it does require huge effort from some people to get it into a mature production-ready stage.

Barraketh · on Oct 31, 2019

I know that Spark has had a lot of work put into it, but my personal experience with it has been pretty negative. I've spent a lot of time at my job trying to tune it to our workflows (extremely deep queries), with only moderate success. I've just POC'd a custom SQL execution engine that was 200x faster than spark for the same workflows. Now, our requirements are pretty non-standard, but I find it pretty easy to believe these benchmarks.

madhadron · on Oct 31, 2019

McSherry et al's paper "Scalability! But at what COST?" is worth reading. A single threaded, single core implementation typically outperforms Spark.

The best rule of thumb I'm aware of is: unless you can't fit your computation on a single machine or your jobs are likely to fail before completing from the size and length involved, you are generally better off without Spark or similar systems. And if sampling can get you back onto a single machine, then you're really better off.

raja_sekar · on Oct 31, 2019

In my experience too I observed that distributed code introduces a lot of redundancy and it requires a lot of data to beat the performance of a single-threaded/single machine implementation. Check out McSherrys' Timely Dataflow, it is truly an amazing piece of work.

Barraketh · on Oct 31, 2019

I expected that there is overhead to distributing the computation, but I was surprised by the magnitude of the speedups available.

raja_sekar · on Oct 31, 2019

It is indeed my opinion too. In non-standard workflows, handcrafted code/application will most likely beat generic frameworks(not true for some cases). I have conflicting thoughts about this. Nowadays industries are very fast-moving, they generally can't afford to do it all for each of their use cases. So they tend to pick up generic frameworks. But I have seen many managers picking the wrong tools for the job and vastly overestimate their future needs. Everyone thinks that they are going to process petabytes of data, and they make the decision to use these generic distributed frameworks from the beginning to avoid the future scale. It rarely happens. Most of the time, they end up spending money on Cloud because making something distributed comes with a lot of redundancy to provide fault tolerance and yet not as performant as single machine performance due for data up to few TBs. Even here, if you take that parquet example, my hand-coded Rust code beats the Rust RDD version by 4x. I guess we can't change this attitude. So it is better to aim for improving these libraries.

Barraketh · on Nov 1, 2019

I completely agree that we need better generic libraries. I was mostly commenting that I really believe that there are huge wins that can be achieved in the "generic distributed execution engine" space, and that people shouldn't be intimidated by the work that has already gone into spark.

mindcrime · on Oct 31, 2019

That's mondo righteous. I think I may have finally found a reason to learn Rust.

gok · on Oct 31, 2019

I'm kind of surprised it took this long for someone to do this. It was clear very early on that the JVM was a bad match for what Spark was trying to do.

mistrial9 · on Oct 31, 2019

this sounds far too simplistic, so .. reference?

bcbrown · on Nov 1, 2019

I attended a talk at Strata a few years back by a Spark committer who was talking about how Spark was stretching JVM memory allocations far past how the JVM was originally designed. Do a couple searches for "spark JVM OOM" and you'll see some discussions about similar things.

gok · on Nov 2, 2019

Dead on

krcz · on Oct 31, 2019

I'm wondering how much could be gained if one used all possible optimizations: e.g. by analyzing the data flow graph - expressed using DSL - and generating native node programs, using CPU thread pinning and user space network stack (like ScyllaDB does [1]).

[1] https://www.scylladb.com/product/technology/

wmf · on Oct 31, 2019

A lot. https://www.weld.rs/

MaxBarraclough · on Oct 31, 2019

For the confused: this is about Apache Spark, not the Ada-based SPARK language. [0]

Perhaps I'm alone here but I'd prefer the title say Apache Spark explicitly.

[0] https://en.wikipedia.org/wiki/SPARK_(programming_language)

orhmeh09 · on Oct 31, 2019

Spark (and Apache Spark) is a trademark of the Apache Foundation. If the title were SPARK in all caps, I'd understand, but how often do you read articles about SPARK where the name is written as "Spark?"

Nursie · on Oct 31, 2019

Nor Spark (Sparkjava), the http framework!

aabbcc1241 · on Nov 1, 2019

I like to see people re-implement things and share their (better) results. Even if the results isn't better than the 'battle-tested' existing solutions, at least we can learn something in the process.

wenc · on Nov 1, 2019

Scylla [1] for instance is a C++ rewrite and a drop-in replacement of JVM-based Cassandra, and from what I've read is fairly stable and performs much faster.

[1] https://www.scylladb.com/

mariusae · on Oct 31, 2019

See also bigslice (https://bigslice.io) for another take on this.

splix · on Oct 31, 2019

> You’ve reached the end of your free member preview for this month

:( I guess I can read Medium posts only during first couple of days in a month

vojta_letal · on Oct 31, 2019

Anonymous windows have found a second use-case.

wiradikusuma · on Oct 31, 2019

I wonder if Spark, compiled with Graal, would produce much better performance (compared to plain Spark), so no need to rewrite.

andersson42 · on Oct 31, 2019

I assume you mean Graal native-image mode, i.e. ahead of time compiled.

JIT based execution still beats AOT due to the continuous profiling advantage, costs a bit more in terms of resources, but for long-running processes, JIT wins.

missosoup · on Oct 31, 2019

"You’ve reached the end of your free member preview for this month"

Stop hosting your content on a platform that holds it hostage so that it can make money off it without giving anything back to you.

raja_sekar · on Oct 31, 2019

Sorry about that. I will just use Github to host my content hereafter. For now, you can use this friend link. https://medium.com/@rajasekar3eg/fastspark-a-new-fast-native...

missosoup · on Nov 1, 2019

Thanks for that. Now that I've had a chance to read through it, a question:

The examples seem to be implemented in pure Rust. No one is going to port their Spark jobs to Rust in the shot term. Have you evaluated perf with Python etc?

If you're still seeing significant speedups, you might want to bottle this up and seek VC because a managed service along the lines of 'databricks but 10x faster' would certainly get traction.

raja_sekar · on Nov 1, 2019

It is in a very initial POC stage and distributed mode is pretty basic, but it is moving faster than I expected. Python integration is definitely one of the primary objectives as I suspect that no one is going to learn Rust for this, although I feel that it is not that hard. In fact, it can have a better integration story with python than Spark as Rust has good C interop. Regarding performance, yeah it is pretty good from what I have seen for CPU intensive tasks and once blockmanager is implemented with compression and other optimizations like Spark, shuffle tasks also will improve. There are a lot of unnecessary allocations here than I would prefer just to keep it in safe Rust as much as possible and there is still plenty of optimizations possible here. I am doing this in my free time only. I feel that it is too early to compare witn Spark given how many features Spark has. Maybe in a couple of months after it matures a bit and if there is enough traction for this, then we can look for sponsors.

ieatpies · on Oct 31, 2019

Try incognito mode

missosoup · on Oct 31, 2019

Why?

This is an economy where content competes for clicks, not clicks competing for content. The author of that content wants me to see it, Medium doesn't want me to see it. I don't care enough to try to circumvent their arrangement.

Given the number of votes on my root comment, it seems neither do most people.

jdminhbg · on Oct 31, 2019

You can also use Reader Mode on Safari, which not only avoids the modals and popups but gets rid of the top and bottom bars as well. Long-click on the Reader Mode button and you can set it to always use it on medium.com.

koolba · on Nov 1, 2019

> Long-click on the Reader Mode button and you can set it to always use it on medium.com.

Learning this just made my day!

tandav · on Oct 31, 2019

or (sign out & remove medium cookies)

teej · on Oct 31, 2019

The author has chosen to monetize this article. They have the choice to make it free to everyone.

raja_sekar · on Oct 31, 2019

I haven't actually monetized it. I think without medium distribution, it will be limited to my followers. That is the only reason I switched on distribution. I have decided to just use Github for my future blog.

GaryNumanVevo · on Oct 31, 2019

Don't sweat it, Medium has really ramped up some tricky dialogue prompts to get authors to paywall without their knowledge

truth_seeker · on Oct 31, 2019

Nice, but I can't find any reason to choose Spark over modern Distributed SQL databases (CockroachDB, CitusDB, TiDB etc. or cloud vendor-specific SQL DBs)

aloknnikhil · on Oct 31, 2019

Spark is specifically useful for querying streaming data. How would a distributed database help with that? You'd have to build your own stream executor on top of that.

truth_seeker · on Nov 1, 2019

Agreed. at the same time, Building stream execution pipeline is not rocket science. I am not saying modern distributed SQL Databases are exact replacement or clones of Spark. I am saying with little more help from the application server they are much more capable than Spark.

You can use the following options individually or in combination.

Option 1 : Pipeline DB extension (PostgreSQL)

Option 2 : Service broker in commercial SQL databases or building PUSH/PULL queue if not supported. There are many libraries in each programming language which tries to do that. Also see option 4.

Option 3 : Using CDC or Replication for synchronous or asynchronous streamed computation on single or multi node cluster

Option 4 : Transducers. For example, you can compose many sql functions or procedures to act on a single chunk of data instead of always doing async streamed computation after each stage of transformation.

dominotw · on Oct 31, 2019

isn't all data streaming data? whats so specail about streaming data.

mvitorino · on Oct 31, 2019

How would you efficiently keep continuously updated complex metrics (aggregations, windowed functions, etc) calculated on top of unbounded/streaming data using a database? I'm not saying that Spark is the ideal solution, but there are a set of problems that require tools such as Spark.

truth_seeker · on Oct 31, 2019

One example could be PipelineDB extension for PostgreSQL

http://docs.pipelinedb.com/introduction.html

kermatt · on Oct 31, 2019

https://www.pipelinedb.com/blog/pipelinedb-is-joining-conflu...

truth_seeker · on Nov 1, 2019

I know. But at the same time, 1.0 is pretty much capable. Give it a try and you will realize it.

madenine · on Oct 31, 2019

depends what you’re doing. For querying large datasets? 100% with you.

For data cleaning, processing, analytics, ML on decently large datasets? Spark wins out

missosoup · on Oct 31, 2019

What does spark win at exactly?

Dask+Perfect is a much better experience all round including perf, with virtually none of the cluster management hell involved.

sandGorgon · on Oct 31, 2019

Could you talk about Prefect ? we are in the process of moving from Spark to Dask. I have never heard of prefect. what do you use it for ?

missosoup · on Oct 31, 2019

tl;dr we use it for a similar set of tasks that one would use Airflow for.

Unlike Airflow, this lends itself to microbatching and streaming. Plus a bunch of housekeeping items ticked off that Airflow never got around to. With a bit of devops engineering time, you can have perfect manage the size of your worker cluster on k8s and scale it up/down with ingest demand, etc.

I'll say one thing though. The Perfect website used to be a lot more technical and explicit about what it is and isn't. Now it's mostly sales gobbledegook. Maybe not a good sign. I've seen this happen before with dremio.

sandGorgon · on Nov 1, 2019

This is super interesting!

Do you run dask on k8s ? I have been concerned that dask does not leverage kubernetes HPA for autoscaling...but instead chooses to run an external scheduler.

How has your experience been ?

ai_ja_nai · on Oct 31, 2019

Very interesting. Can't find references to "Perfect", though; could you please point to a link?

missosoup · on Oct 31, 2019

https://www.prefect.io

Not the most SEO-friendly choice of name. Great product though.

vamega · on Nov 1, 2019

Are you using their cloud product? The core/open source product doesn't have a way to persist schedule data.

maximente · on Oct 31, 2019

looks like dask is python-only, so it's a nonstarter (loser) for already existing JVM code that runs on spark

missosoup · on Oct 31, 2019

Spark stacks inevitably end up with PySpark though. It's rework for people who already committed to Spark, sure. And for bigger projects that committed to Spark this change isn't justifiable. But for a greenfield project, choosing Spark is just silly today.

truth_seeker · on Oct 31, 2019

Say for an example, I am using PostgreSQL 12 + CitusDB extension

Data cleaning -> PL/SQL and various inbuilt functions for the transformation of data (or new UDF if required at all)

Processing -> PostgreSQL Parallel processing on the local node and Citus DB extension for distributed computing and sharding

Analytics -> Many options here. Materialized views OR Triggers OR Streaming computation with PipelineDB extension OR Using Logical replication for stream computation

ML -> PG support variety of statistics functions. It also supports PL/R and PL/Python extension to interface with ML libraries.

Also, there are various kinds of Foreign Data Wrappers supported by PG - https://wiki.postgresql.org/wiki/Foreign_data_wrappers

missosoup · on Oct 31, 2019

Yeah that's not going to work for what people call analytics workloads today.

PG is great but it's not suitable to be a feature store and sure as hell not suitable to fan out ML workloads. In a modern ML stack, PG might play the role of the slow but reliable master store that the rest of the ML pipeline feeds off.

riku_iki · on Oct 31, 2019

> hell not suitable to fan out ML workloads

depends on the scale? Not everyone processes petabytes of data.

> PG might play the role of the slow

You have any benchmark in your hand to support this? I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.

missosoup · on Nov 1, 2019

> I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.

There's no question about this. If you can express your task in terms of PG on a single instance, then you probably should.

When you get to more complex tasks, like running input through GloVe and pushing ngrams to a temporal store, PG offers very little - which is fine, it's not at all what PG is designed for. Inter-node IO eclipses single node perf, which is why Spark is used despite being a terribly inefficient thing (although in the case of Spark, it's so inefficient that for interim sized workloads you'd actually be better off vertically scaling a single node and using something else). PG won't help at all with these tasks.

Also, that smorgasbord of extensions GP listed isn't offered by any cloud vendor as a managed service afaik, meaning you must roll and manage your own. Depending on your needs, that might be a show stopper.

riku_iki · on Nov 1, 2019

> like running input through GloVe and pushing ngrams to a temporal store

why exactly you think PG will not do this well?

missosoup · on Nov 1, 2019

Tell me how you'd do it and I'll tell you why it won't work :)

riku_iki · on Nov 1, 2019

gloves are stored in table: token -> vector. Function tokenizes text and store in another table: texd_id, token

Then you join first and second table.

Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.

missosoup · on Nov 1, 2019

> Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.

Correct. PG has no place in this workload other than being the final store for the model output. And even then, you'd be using a column store like Redshift or Clickhouse. PG not even suitable for the ngram counters because its ingest rates are way too slow to keep up with a fanned out model spitting out millions of ngrams per second in addition to everything else going on in the pipeline.

You -could- probably do it all in PG. But that'd be a silly esoteric challenge exercise and not something anyone would try on a project. I am sure you recognise that.

riku_iki · on Nov 1, 2019

I would say "fanned out model spitting out millions of ngrams per second" is much more unusual exercise comparing to using PG for ETL workload.

missosoup · on Nov 1, 2019

A typical twitter post will have about 50 2/3/4-grams. Let's ignore skipgrams. The twitter decahose will throw about 600 of these at you per second. That's 30k barebones ngrams per second to keep with the decahose.

But you have a year worth of historical data that you want to work with. If you're able to process 1m ngrams per second, it'll take a couple of days to get through that. You probably want to get closer to 10m/s if you're tweaking your model and want to iterate reasonably quickly. Of course there's ways to optimise all that and batch it and whatnot, but basically any big data tasks with the need to work on historical data and iterate on their models, quickly end up with kafka clusters piping millions of messages per second to keep those iteration times productive.

Ultimately this post is about Spark, and the comment that started this was someone listing PG 'replacements' for traditional ML pipeline components. If you need Spark, you're at scales where PG has no place.

riku_iki · on Nov 1, 2019

That's why I mentioned scale in my first comment. For sub-TB datasizes with 16 cores CPU and NVME raid (you can get such machine for less than $1k nowdays) PG will be just fine.

Also in typical ML pipeline as I mentioned you can generate ngrams in input function of your model (Dataset API in TF), you don't need to store it somewhere.

outside1234 · on Oct 31, 2019

I see you haven't worked with a truly titantic amount of data then :)

truth_seeker · on Nov 1, 2019

I see you assume too much. :)

nxpnsv · on Oct 31, 2019

If you want a more complicated transform on lots of data your fancy sql won’t help.

truth_seeker · on Oct 31, 2019

Please give me an example. I can't think of any transform which cannot be done by using SQL or inbuilt Functions or new UDF.

brokensegue · on Oct 31, 2019

train a set of sklearn models one each per a random partition of the data (computed distributed). then combine all those models using averaging and evaluate them all against an even larger dataset. how do you do that in SQL

truth_seeker · on Nov 1, 2019

Sharding the table can help scale the problem across many machines and as I mentioned earlier you can use PL/R or PL/Python language extension to lift all sorts of ML functions to SQL functions.

brokensegue · on Nov 1, 2019

I'm unfamiliar with PL/Python. Can you have a Python object be the returned value of a sql query? Because that's a requirement of my example.

nxpnsv · on Nov 1, 2019

It’s also possible to do a lot in excel, it is just not always the best tool for the job.

ai_ja_nai · on Oct 31, 2019

Spark != SQL

It's also graph analysis and ML models.

truth_seeker · on Nov 1, 2019

Graph analysis -> Recursive common table expression (https://www.postgresql.org/docs/current/queries-with.html)

ML models - I already mentioned how to uplift R and Python functions to SQL function. even if you are not using PostgreSQL many other databases help you with uplifting and interfacing with existing ML libraries through FFI

RocketSyntax · on Oct 31, 2019

distributed in-memory computing for massive datasets to big to fit into vertically scaled memory. generic tabular files, not tables. delta lake.

truth_seeker · on Nov 1, 2019

Yes those features help and all of the distributed SQL databases have data and query cache.

RocketSyntax · on Nov 1, 2019

Then once the subset of data is in distributed memory... you hittem w pyspark and all compatible libraries.