Hacker News new | past | comments | ask | show | jobs | submit login
Spark as a Compiler: Joining a Billion Rows per Second on a Laptop (databricks.com)
251 points by rxin on May 23, 2016 | hide | past | favorite | 53 comments



To put this into context I would recommend reading 'MonetDB/X100: Hyper-Pipelining Query Execution' [0]. Vectorized execution has been sort of an open secret in database industry for quite some time now.

For me, is particularly interesting reading the Spark achievements. I was part of the similar Hive effort (the Stinger initiative [1]) and I contributed some parts of the Hive vectorized execution [2]. I see the same solution that applied to Hive now applies to Spark:

- move to a columnar, highly compressed storage format (Parquet, for Hive it was ORC)

- implement a vectorized execution engine

- code generation instead of plan interpretation. This is particularly interesting for me because for Hive this was discussed then and actually not adopted (ORC and vectorized execution had, justifiably, bigger priority).

Looking at the numbers presented in OP, it looks very nice. Aggregates, Filters, Sort, Scan ('decoding') show big improvement (I would expected these, is exactly what vectorized execution is best at). I like that Hash-Join also shows significant improvement, is obvious their implementation is better than the HIVE-4850 I did, of which I'm not too proud. The SM/SMB join is not affected, no surprise there.

I would like to see a separation of how much of the improvement comes from vectorization vs. how much from code generation. I get the feeling that the way they did it these cannot be separated. I think there is no vectorized plan/operators to compare against the code generation, they implemented both simultaneously. I'm speculating, but I guess the new whole-stage code generation it generates vectorized code, so there is no vectorized execution w/o code generation.

All in all, congrats to the DataBricks team. This will have a big impact.

[0] http://oai.cwi.nl/oai/asset/16497/16497B.pdf [1] http://hortonworks.com/blog/100x-faster-hive/ [2] https://issues.apache.org/jira/browse/HIVE-4160


Right now, vectorization is ONLY used in Parquet reader and in-memory cache, other operators still work in one-row-at-a-time mode, but compiled into single (or a few) Java function(s), the intermediate results of these operators could be hold in CPU register or in stack. It's easy to be high performant and support complicated data types and expressions comparing to vectorized execution, so we (Databricks team) picked whole-stage code generation other than vectorized execution.

disclaimer: I worked on whole-stage codegen and other performance related stuff.


For anyone looking for info about column oriented DBs, I can recommend [0] as a great place to start. It covers the basic optimizations and compares some older solutions at a fairly high level.

[0]http://db.csail.mit.edu/pubs/abadi-column-stores.pdf


How much of this work gets the working set off of the JVM heap? They are generating JVM bytecode? Compact heap layout is the next big win.


OP points to SPARK-12795 [0] and is all open source. They generate Java source code. You can read more at the prototype pull request: https://github.com/apache/spark/pull/10735 (I couldn't find a spec doc). If I understand correctly they insert a `WholeStageCodegen`[1] operator into the plan:

    /**
     * WholeStageCodegen compile a subtree of plans that support codegen together into single Java
     * function.
     *
     * Here is the call graph of to generate Java source (plan A support codegen, but plan B does not):
     *
     *   WholeStageCodegen       Plan A               FakeInput        Plan B
     * =========================================================================
     *
     * -> execute()
     *     |
     *  doExecute() --------->   inputRDDs() -------> inputRDDs() ------> execute()
     *     |
     *     +----------------->   produce()
     *                             |
     *                          doProduce()  -------> produce()
     *                                                   |
     *                                                doProduce()
     *                                                   |
     *                         doConsume() <--------- consume()
     *                             |
     *  doConsume()  <--------  consume()
     *
     * SparkPlan A should override doProduce() and doConsume().
     *
     * doCodeGen() will create a CodeGenContext, which will hold a list of variables for input,
     * used to generated code for BoundReference.
     */



[0] https://issues.apache.org/jira/browse/SPARK-12795 [1] https://github.com/apache/spark/blob/0e70fd61b4bc92bd744fc44...


>> This style of processing, invented by columnar database systems such as MonetDB and C-Store

True, since MonetDB came of out of the Netherlands around 1993, but KDB+ came out in 1998, well over 8 years before C-Store.

K4 the language used in KDB is 230 times faster than Spark/shark and uses 0.2GB of RAM vs. 50GB of RAM for Spark/shark, yet no mention in the article. It seems a strange omission for such a sensational sounding title [1]. I don't understand why big data startups don't try and remake the success of KDB instead of reinventing bits and pieces of the same tech with a result in slower DB operations and more RAM usage.

[1] http://kparc.com/q4/readme.txt


Because Spark is far, far more than just about storing data and doing basic queries. It is also an analytics e.g. machine learning/modelling platform and a framework for building complex applications on top of Hadoop.


KDB+/Q/K is all about analytics too. It is used in more than just financial time series data. Such as power utility usage and resourcing for one.

Hadoop is also about distributed computing, but if the process requires 250x as much RAM, you're going to have a very high TCO regardless of how cheap RAM or servers can be.

J is an opensource APL-derived language, but different in some important ways, but it is not as fast as KDB+/Q/K. J is truly an array-based language, whereas K is list based, so it has more in common with Lisp in that singular respect.

Kerf is a new language being worked on by one of the creators of the Kona language, an opensource version of the K3 language [1].

It seems these type of articles ignore non-opensource solutions even if they are more efficient in many ways including TCO. How many man-years need to be spent to try and duplicate an existing solution, and still not be a better solution?

[1] https://github.com/kevinlawler/kerf


I often feel like the one eyed man trying to describe color and shape to a world of blind people when describing the power of these programming systems. I realize everyone wants free stuff, but folks seem to find it difficult to comprehend how much more powerful and performant some of the pay-for systems, in particular the array(ish) systems are than stuff like Spark. In principle, peer review should make all open source systems superior. In practice, the people most capable of making something like Spark fast work for Oracle, Microsoft or Kx on database and machine learning engines people pay money for. While much software development is just hooking components together, there is still such a thing as software engineering, and those skills are extremely rare. Even looking at humble stuff like file systems... I thought EXT4 was pretty good until reading of the wonders of ZFS. Thanks for mentioning Kerf. I'll also mention Nial, which has had some exciting developments lately.


I completely agree with you. Compared to the bloatware of the Hadoop stack K looks and feels like lightning. But when evaluated on its own K is less magical. Impenetrable code not-withstanding its design is fairly simple (good thing TM) and not that hard to beat. If one could delay code-gen even further, it would be possible to inline and compose even bigger chunks of operations into code giving more speed up. A well JIT'ed numpy or Judy should (famous last words) be up there with K.

Since it has not been mentioned yet, let me point out that Barry Jay's Fish should be a part of this conversation. Sadly its quite hard to find. I don't know why those pages have disappeared.


You obviously have no idea what you're talking about. You can do all of this in Kx systems as well, and it runs a lot faster than spark.

I beat a databricks stack (hand tuned by one of the authors of big-DF) running on a large cluster using one jd node by factors of "a lot." And K is faster than J.


Interesting, I've never heard of kdb. Probably because the pricing is absurd from the little research that I have done so far:

http://quant.stackexchange.com/questions/3156/is-there-any-t...

$25k for 2 core setup is a little crazy. They should definitely add a free tier if they want more adoption and more people investigating their tech. Spark is open source and free.


Regardless of price, the tech has been around for quite a few years, and it beats the tech mentioned in this article and in most similar articles making similar claims.

As I replied below, I am just surprised that similar technology has not been created by the opensource community that is closer in performance to KDB+.

$25k may seem absurd to you, but if it delivers results 230x faster using 1/250th of the RAM, decision makers in business see the TCO (total cost of ownership) as being worth it. It all depends on your needs and how they are met.

Jd is the commercial database for the opensource language J, that is also fast, and I believe it costs less than KDB, so there are other faster options.

If free as in opensource is your criteria, and not actual cost-benefits analysis, you only have those free choices, but in business time is money, so the 'free' option would be more costly.


I've been personally very impressed with Spark's RDD api for easily parallelizing tasks that are "embarrassingly parallel". However, I have found the data frames API to not always work as advertised and thus I'm very skeptical of the benchmarks.

I think a prime example of this is I was using some very basic windowing functions and due to the data shuffling (The data wasn't naturally partitioned) it seemed to be very buggy and not very clear why stuff was failing. I ended up rewriting the same section of code using hive and it had both better performance and didn't seem to have any odd failures. I realize this stuff will improve but I'm still skeptical.


I have been extensively using the dateframe/sql API and I just love it. Most of the issues I have had stemmed from the cluster / Spark configuration and not the API itself. Using SQL is so much more intuitive them using multiple joins, selects, filter etc on an rdd.


So I did find it useful for doing additional exploratory aggregations once the data was already cleaned and denormalized. My comment was more directed at the upfront initial data processing (In our case, extracting time series data out of a large amount of files).

I did hit issues w/ multiple joins and shuffling though. Have you not hit issues w/ shuffling?

I was using Spark 1.5.1 for the record.


Have you tried tuning Spark's memory parameters?


How difficult was the bug reporting process?

That is a good metric for a project.


Bug reporting is simple as per any Apache project i.e. JIRA.

But let's be clear here Spark is not a 'real' open source project. It's a dictatorship run by Databricks (in the nicest possible way). There are 414 outstanding pull requests some of which came from the likes of Intel which added subquery support over a year ago. Never got merged.

The best metric for any open source project is how easily can you contribute and improve the product.


I totally agree that's the best metric. I totally disagree with your characterization of Spark.

I showed up on the mailing list, said "hey, here's some stuff that was useful for me, let me know if it's useful for anyone else." A committer said, "cool, send a PR". Couple of rounds of code review and it was merged. Later on, bigger features required a lot more discussion, but as long as I was willing to follow up, I got my changes in.

Spark has on the order of 1,000 contributors. I'd call that a "real" open source project.


Please keep the contributions coming!


It's interesting to see that as further work is done on spark (and I'm pleased they're actually improving the system), it behaves more and more like a database.


Reading about databases always makes me sad about the enterprise database I work with.


You can read about the awesome theory behind relational algebra. Gave me a great appreciation for how awesome SQL databases actually are (and why they work the way they do).


Databases are one of the most fun computer science topics. And I don't mean _databases_ from the perspective of an admin or even a developer, but a database developer of the internals. I love trying to read the code of complex systems like databases or language compilers.


Do you have some examples of interesting code ? I'm curious now!



I find postgres source code very interesting, e.g

https://github.com/postgres/postgres/tree/master/src/backend...

Sqlite is another good one to read, and play around with.


CPython is an interesting read. The internal type system (PyObject) is pretty cool. It also uses type pruining instead of a discriminated union.


Any good books on that?


Transaction Processing: Concepts and Techniques [0] by Jim Gray and Andreas Reuter. Not so much relational algebra, but everything you need to know to implement a classic relational engine.

[0] https://www.amazon.com/Transaction-Processing-Concepts-Techn...


The original paper[0] on relational databases by Edgar Codd is quite accessible.

I also loved the MOOC by Prof. Jennifer Widom[1]

[0] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

[1] http://online.stanford.edu/course/databases-self-paced


In addition to the great links from siblings, it might be interesting to read this Stonebraker paper on the cyclical nature of databases and how, ultimately, they all reduce to a few core paradigms. [0]

[0] http://pages.cs.wisc.edu/~anhai/courses/764-sp07-anhai/datam...


This java project gives you a glimpse at how it all works. It was extracted out of a now defunkt db vendor (I think?). https://calcite.apache.org/


Yeah, it came out of the Eigenbase project. (http://luciddb.sourceforge.net/ and http://www.eigenbase.org). DB internals are indeed quite fun. (I worked on LucidDB(https://github.com/LucidDB/luciddb) for a time.)


I really liked 'Database in Depth' by C.J. Date.


anything from C. J. Date


Granted, this is in-memory computing and not intended for persistent storage. There's nothing stopping you from using Spark's SQL drivers (which support ODBC) to load things into memory, unless your database does not support it.


They support JDBC out of the box.

ODBC works via a JDBC bridge which though they are often commercial and vary in quality.


I work with SAP HANA as an enterprise database, and it works quite well with Spark, has code push down, query compilation, all sorts of fun stuff.

Depends on the enterprise database, is my point.


Where do you delineate workloads between HANA and Spark?

(I'm just curious...used to work the internal SAP HANA dev team)


Spark is not a database ;-)


Would you please tell that to the architects I work with? We're reinventing relational databases over the top of Spark.


Post their linkedin profiles, I would be happy to drop in a word :)


But it is a SQL parser and execution engine.

You could easily turn it into a database by managing a set of flat files.


Typesafe plans to add support for vectorization into Scala compiler for Akka. Also JVM 8 JIT compiler does it to some extend already.


That would be great to have in Scala Native [1].

[1] https://github.com/scala-native/scala-native


Where'd you read this, I'd love to take a look.



This was video with Martin Odersky. Not sure where


I'll have to benchmark Spark 2.0 against Flink, it seems like it could be faster than Flink now. It does depend on if the dataset they used was living in memory before they ran the benchmark, but it's still some pretty impressive numbers and some of the optimizations they made with Tungsten sound similar to what Flink was doing.


Really depends on the use case. If you're trying to do streaming, spark will introduce some latency because it's based on micro-batching. Flink and Storm are better for low-latency scenarios.


I agree that Flink is better for low latency, but I was also under the impression Flink's DataSet API was faster for batch processing than Spark. Now that they're purporting Spark 2.0 may be up to 10x faster than 1.6 I'll have to take another look.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: