> While Google has previously published papers describing some of its technologies, Google decided to take a different approach with Dataflow. Google open-sourced the SDK and model alongside commercialization of the idea and ahead of publishing papers on the topic.
A large number of ASF projects in the Big Data space are inspired by Google's publications. Good to see Google finally taking the lead and coming out with code.
Seems like this would duplicate a rather large chunk of Apache Crunch, which implements Google Flume nearly exactly as far as public API is concerned. As far as I can tell, Google Dataflow is also a variation on top of Google Flume. It would be helpful if they could elucidate why this project would not be redundant under the Apache umbrella.
Apache explicitly allows multiple overlapping projects. See, for example Apache Storm/Spark/Flink.
Also worth noting is that the implementation matters at least as much as the API.
In this case there are substantial differences: Data flow is a DSL, not a Java API, and it is designed for streaming data. It is unclear if Crunch handles the streaming case, but it talks a lot about Map/Reduce which makes me think it isn't the primary usecase.
It's quite different from the original Flume paper as Dataflow supports expressing streaming computations in addition to batch in a unified set of APIs. It's like a combination of Flume and Millwheel.
Yes, I like how the paper is written, and the work is solid IMO, having used most of the predecessors. The annoying thing is footnote 1, where they explain that they coopted the term "dataflow model".
That's just plain dumb, since that word is used in so many other settings, especially big data settings (TensorFlow is also a dataflow system, but both the data types and model are completely different, etc.). So I can see why someone would just dismiss it as yet another framework, sort of like how all these JavaScript frameworks that are the hot thing one year and then legacy code the next.
If they had used a more specific name, I think it would help the perception a lot. The main point is batch vs streaming, which is still big pain point, and being precise about the event time vs event processing time.
The ASF proposal contains a few different components (SDK and runners) all of which have lived on GitHub for awhile (the proposal has links if you're interested.) If accepted as an incubating project, the code would still live on GitHub like other Apache Incubator projects.
tl;dr/ELI5
The goal is to grow an open community, have consistent (and open) processes, and be part of a larger OSS ecosystem. When projects live on GitHub, the code is open but project processes, etc. may not be.
Some of the best material I've read recently comes from the Confluent blug, esp. Martin Klepmann. The views are tilted toward Kafka and Samza since the founders are the same people, though they are both Apache projects. The article that blew my mind was "Turning the Database Inside Out": https://martin.kleppmann.com/2015/03/04/turning-the-database... . Doesn't encompass the full space, but the architectural implications when combined with CQRS/Event Sourcing models are huge.
Samza's architecture and API embodies a lot of the important ideas at a lower level than Storm; while it may not be the easiest to use in practice, the concepts and documentation translate.
There are literally hundreds of data-flow streaming implementations. Quite sad that Google has named their system "Dataflow" which literally search bombs any other dataflow like system out there. Many of these systems are more data-flow in semantics only but more transaction-like under the hood. Quite nice for simple check-pointing/fault tolerance but not so great for performance. There's also some interesting work form Dongarra's lab on data-flow task sets (very OmpSs like). The earliest system was by Jack Dennis. Either way, I hope my blog post helps answer your questions.
It's a Batch+Stream unified processing model, an SDK. Idea is you can code up your pipeline in Dataflow and have your choice of where to run it - Spark, Flink, etc.
Google Cloud Dataflow is a fully-managed service that executes Dataflow pipelines and has nice value adds on top like fault tolerance and auto-optimization.
Some other posts on the announcement:
http://googlecloudplatform.blogspot.com/2016/01/Dataflow-and...
http://blog.cloudera.com/blog/2016/01/spark-dataflow-joins-g...
http://data-artisans.com/dataflow-proposed-as-apache-incubat...
http://blog.cask.co/2016/01/cask-anticipates-googles-dataflo...