The new Structured Streaming API looks pretty interesting. I have the impression that many Apache projects are trying to address the problems that arise with the lambda architecture. When implementing such a system, you have to worry about dealing with two separate systems, one for low-latency stream processing, and the other is the batch-style processing of large amounts of data.
Samza and Storm mostly focus on streaming, while Spark and MapReduce traditionally deal with batch. Spark leverages its core competency of dealing with batch data, and treats streams like mini-batches, effectively treating everything as batch.
And I imagine in the following snippet, the author is referring to Apache Flink, among other projects:
> One school of thought is to treat everything like a stream; that is, adopt a single programming model integrating both batch and streaming data.
My understanding of Structured Streaming also treats everything like batch, but can recognize that the code is being applied to a stream, and do some optimizations for low-latency processing. Is this what's going on?
The longer answer is that this is about how to logically think about the semantics of computation using a declarative API, and the actual physical execution (e.g. incrementalization, record at a time processing, batching) is then handled by the optimizer.
Given that all the big data talk lately has been about GPU computing/TensorFlow, I'm glad to see that this Spark update shows in-memory computing is still viable. (Much cheaper to play with too!)
The key feature for me is Machine Learning functions in R, which otherwise lacks parallelizeable and scalable options. (Without resorting to black magic, anyways)
Neural nets are fantastic, but not all problems either require or are best solved by neural nets. Sometimes you want to understand more deeply how certain variables contribute to the prediction. That's not as trivial to do with neural nets, but in a simple regression you can just look at the coefficients and have some idea of what matters most (assuming you've normalized everything properly).
This is just one advantage of the more traditional models that Spark includes out of the box. Just because neural nets are awesome doesn't mean there aren't big advantages to the older stuff. Also, often it's not the algorithm that matters, it's how you prepare the data and how much data you have in hand. After all is said and done, in many cases the fancy model has similar performance to the boring model.
Very happy to see Spark developing yet more abilities to make ML on large datasets seamless and (relatively) painless.
Honest question: what does "in memory" mean exactly? Do traditional databases not process data using memory? Doesn't Spark spill to disk when it reached the limit of the system memory? The in-memory thing always puzzled me.
The difference is that every variable is sharded by default. So if you have 100 machines with 8gb memory each, you can keep an 800gb array in memory. Some traditional databases can do this for one operation at a time, like querying a sharded index in Mongo or doing a range query in Cassandra. But usually you want to do a whole pipeline of these operations. In that case, your computation engine should own the memory for efficiency, and seamlessly redistribute based on access patterns. Thus, Spark and friends.
You can "cache" a dataset in-memory and continue to perform operations on the dataset without having to persist it to disk, unlike MapReduce. Traditional databases can do a lot of that too, but aren't as scalable.
There are multiple (advanced) projects to get the next 10-100X of perf beyond this via in-memory & streaming GPU computing over columnar data. And... they're taking RDDs in as native formats, and some even target sql as the scripting lang :)
FWIW, my interest here is what becomes possible for sub-second computations. We do interactive analytics, and GPUs become exciting for that, including in the distributed case :) Similar story for stuff like VR, real-time ML, etc.
It often makes sense to just use spark for event processing to create a shaker set of features that are used as input to more robust ML libraries like scikit learn.
I like the direction Spark is heading. I am happy to see that they look at Spark in the same way compiler developers look at programming languages. There are huge optimizations to be made in this space. It's insane how inefficient our current systems are when it is related to big-data processing.
I think, as experiments published in last year based on Spark, the inefficiency stems from the assumption that CPU cycles are abundant w.r.t. RAM and bandwidth. So, nobody focuses on optimizing program itself as much as to reduce memory or bandwidth usage.
Seeing the talk today, I think the Spark team is very aware that CPU cycles haven't become more abundant, while I/O has. There was an entire slide to this effect. I don't know enough about the space to say more than that, but it seems it's definitely on their mind. One of the things I remember being mentioned is operating on data in a much more efficient binary representation than native Java memory model, allowing better use of cache and such.
For me, the biggest improvement is the unified typed Dataset API [1]. The current Dataset API gave us a lot of flexibility and type-safety and the new API lets us use it as DataFrame API instead of converting to RDD and reinventing the wheel, like aggregators [2].
The micro-benchmarks are impressive--eg, to join 1 billion records: Spark 1.6 ~ 61 sec; Spark 2.0 ~ 0.8 sec
i assume results such as this are due various optimizations under the Tungsten rubric (code generation, manual memory management) which rely on the sun.misc.Unsafe api.
Some of the performance gain was coming from the use of Unsafe in earlier versions of Spark (e.g. Spark 1.5). However, the massive gain you are seeing in Spark 2.0 are not coming from Unsafe. It is coming from this idea we call "whole-stage code generation", which eliminates virtual function calls and puts intermediate data in CPU registers as much as possible (versus L1/L2/L3 cache or memory).
We will be writing a deep dive blog post about this in the next week or two to talk more about this idea.
Excellent, I'm glad that the "big data" world is starting to look at database literature in terms of how it does execution, as there is much to be learned.
Most of these systems are extremely inefficient (looking at you Hadoop), when they don't really have to be. Efficient code generation should be table stakes for any serious processing framework IMO.
Is there a video of a real live example of how spark helped to solve a specific problem? I've tried quite a few times to get my head wrapped around what Spark helps you solve.
In theory, Spark lets you seamlessly write parallel computations without sacrificing expressivity. You perform collections-oriented operations (e.g. flatMap, groupBy) and the computation gets magically distributed across a cluster (alongside all necessary data movement and failure recovery).
In practice, Spark seems to perform reasonably well on smaller in-memory datasets and on some larger benchmarks under the control of Databricks. My experience has been pretty rough for legitimately large datasets (can't fit in RAM across a cluster) -- mysterious failures abound (often related to serialization, fat in-memory representations, and the JVM heap).
The project has been slowly moving toward an improved architecture for working with larger datasets (see Tungsten and DataFrames), so hopefully this new release will actually deliver on the promise of Spark's simple API.
* distributed machine learning tasks using their built-in algorithms (although note that some of them, e.g. LDA, just fall over with not-even-that-big datasets)
* as a general fabric for doing parallel processing, like crunching terabytes of JSON logs into Parquet files, doing random transformations of the Common Crawl
As a developer, it's really convenient to spin up ~200 cores on AWS spot instances for ~$2/hr and get fast feedback as I iterate on an idea.
It originally billed itself as a replacement for Hadoop and MapReduce as an in-memory data processing pipeline. It is typical in MR programs to create many sequential MR jobs and save the output between successive jobs to HDFS. So Spark can solve these use cases. Since its early days, it has built on its capabilities.
So real world use-cases? Any MR use case should be doable by Spark. There are plenty of companies using Spark to create analytics from streams, some are using it for its ML capabilities (sentiment analysis, recommendation engines, linear models, etc.).
I apologize if my comment isn't as specific as you're looking for, but I know of people who use it for exactly the scenarios I've outlined above. We are probably going to use it as well, but I don't have a use case to share just yet (at least nothing concrete at the moment). Hopefully this gives you some idea of where Spark fits.
I think your question is oriented towards X being a business problem.
Netflix has users (say 100M) who have been liking some movies (say 100k). Say The question is: for every user, find movies he/she would like but have not seen yet.
The dataset in question is large, and you have to answer this question with data regarding every user-movie pair (that would be 1e13 pairs). A problem of this size needs to be distributed across a cluster.
Spark lets you express computations across this cluster, letting you explore the problem. Spark also provides you with a quite rich Machine Learning toolset [1]. Among which is ALS-WR [2], which was developped specifically for a competition organised by Netflix and got great results [3].
We use Spark essentially as a distributed programming framework for data processing - anything you can do on a small dataset on a single server, you can do the same thing on a huge dataset and 20 servers or 2000 servers with minimal extra development
I had hundreds of gigabytes of JSON logs with many variations in the schema and a lot of noise that had to be cleaned. There were also some joins and filtering that had to be done between each datapoint and an external dataset.
The data does not fit in memory, so you would need to write some special-purpose code to parse this data, clean it, do the join, without making your app crash.
Spark makes this straightforward (especially with its DataFrame API): you just point to the folder where your files are (or an AWS/HDFS/... URI) and write a couple of lines to define the chain of operations you want to do and save the result in a file or just display it. Spark will then run these operations in parallel by splitting the data, processing it and then joining it back (simplifying).
I don't know about videos but I used spark in my last job to solve problems of "we want to run this linear algebra calculation on x00000 user profiles and have it not take forever". For me the big selling point is it lets you write code that can be read as ordinary Scala but which runs on a cluster. As much as anything else it's practical to get the statistician to review the code and say "yes, that is implementing the calculation I asked you to implement" in a way that wouldn't be practical with more "manual" approaches to running calculations in parallel on a cluster.
Samza and Storm mostly focus on streaming, while Spark and MapReduce traditionally deal with batch. Spark leverages its core competency of dealing with batch data, and treats streams like mini-batches, effectively treating everything as batch.
And I imagine in the following snippet, the author is referring to Apache Flink, among other projects:
> One school of thought is to treat everything like a stream; that is, adopt a single programming model integrating both batch and streaming data.
My understanding of Structured Streaming also treats everything like batch, but can recognize that the code is being applied to a stream, and do some optimizations for low-latency processing. Is this what's going on?