Google proposes its Dataflow batch/stream tech to the Apache Incubator

fhoffa · on Jan 20, 2016

Note that this proposal is being back not only by Google, but also Cloudera, Data Artisans, Talend, Cask, PayPal, ...

Some other posts on the announcement:

http://googlecloudplatform.blogspot.com/2016/01/Dataflow-and...

http://blog.cloudera.com/blog/2016/01/spark-dataflow-joins-g...

http://data-artisans.com/dataflow-proposed-as-apache-incubat...

http://blog.cask.co/2016/01/cask-anticipates-googles-dataflo...

ericand · on Jan 20, 2016

Also Talend blog: https://www.talend.com/blog/2016/01/20/talend-joins-google-t...

mindprince · on Jan 21, 2016

> While Google has previously published papers describing some of its technologies, Google decided to take a different approach with Dataflow. Google open-sourced the SDK and model alongside commercialization of the idea and ahead of publishing papers on the topic.

A large number of ASF projects in the Big Data space are inspired by Google's publications. Good to see Google finally taking the lead and coming out with code.

melted · on Jan 20, 2016

Seems like this would duplicate a rather large chunk of Apache Crunch, which implements Google Flume nearly exactly as far as public API is concerned. As far as I can tell, Google Dataflow is also a variation on top of Google Flume. It would be helpful if they could elucidate why this project would not be redundant under the Apache umbrella.

nl · on Jan 20, 2016

Apache explicitly allows multiple overlapping projects. See, for example Apache Storm/Spark/Flink.

Also worth noting is that the implementation matters at least as much as the API.

In this case there are substantial differences: Data flow is a DSL, not a Java API, and it is designed for streaming data. It is unclear if Crunch handles the streaming case, but it talks a lot about Map/Reduce which makes me think it isn't the primary usecase.

mey · on Jan 21, 2016

Apache Ant + Ivy vs Apache Maven vs Apache Buildr Apache Tapestry vs Apache Wicket vs Apache MyFaces

those jump to mind, but I'm sure there are plenty of other overlaps in the Apache sphere. They certainly embrace it.

rryan · on Jan 20, 2016

It's quite different from the original Flume paper as Dataflow supports expressing streaming computations in addition to batch in a unified set of APIs. It's like a combination of Flume and Millwheel.

arkadiyt · on Jan 20, 2016

This sentence captures everything that I hate about the Apache projects and the Hadoop ecosystem in particular.

asdfologist · on Jan 21, 2016

Yes, competition is bad and people should never try to create better ways of doing the same thing.

squarecog · on Jan 21, 2016

Note how the creator of Apache Crunch is on the Dataflow committer list.

Eduard · on Jan 20, 2016

What kind of problems do you see with redundancy?

melted · on Jan 20, 2016

Basically it looks like it's nearly the same API with minor variations.

chubot · on Jan 21, 2016

I was skeptical too, until I read the paper. The motivation is actually explained really well. It's different than Flume.

throwaway6497 · on Jan 21, 2016

Are you referring to the DataFlow paper at http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

chubot · on Jan 21, 2016

Yes, I like how the paper is written, and the work is solid IMO, having used most of the predecessors. The annoying thing is footnote 1, where they explain that they coopted the term "dataflow model".

That's just plain dumb, since that word is used in so many other settings, especially big data settings (TensorFlow is also a dataflow system, but both the data types and model are completely different, etc.). So I can see why someone would just dismiss it as yet another framework, sort of like how all these JavaScript frameworks that are the hot thing one year and then legacy code the next.

If they had used a more specific name, I think it would help the perception a lot. The main point is batch vs streaming, which is still big pain point, and being precise about the event time vs event processing time.

sysk · on Jan 21, 2016

Can anyone ELI5 what it means for an open source project to become an Apache project? Why doesn't Google just push the code on Github?

chimerasaurus · on Jan 21, 2016

Going to start backwards on this one. :)

The ASF proposal contains a few different components (SDK and runners) all of which have lived on GitHub for awhile (the proposal has links if you're interested.) If accepted as an incubating project, the code would still live on GitHub like other Apache Incubator projects.

tl;dr/ELI5 The goal is to grow an open community, have consistent (and open) processes, and be part of a larger OSS ecosystem. When projects live on GitHub, the code is open but project processes, etc. may not be.

nl · on Jan 21, 2016

It is about control.

Donating something to the ASF (Apache Software Foundation) means you lose control and ownership of the copyright.

zitterbewegung · on Jan 21, 2016

They have pushed it to GitHub I think they are giving it to apache to be more vendor neutral.

oh_sigh · on Jan 21, 2016

It's about who maintains it.

sysk · on Jan 21, 2016

Does that mean that Google will stop maintaining the project?

chimerasaurus · on Jan 21, 2016

There are several organizations included on the proposal, including Google, who will still be actively involved in the project, if accepted.

wener · on Jan 21, 2016

A lot of Apache project was submitted by venders and maintained by themself,goto ASF just say anyone can involve and do things with it.

Wonnk13 · on Jan 21, 2016

what are the best resources to learn about streaming, dataflow, etc? Not necessarily the Google implementations, but the core concepts backing them.

harlanji · on Jan 21, 2016

Some of the best material I've read recently comes from the Confluent blug, esp. Martin Klepmann. The views are tilted toward Kafka and Samza since the founders are the same people, though they are both Apache projects. The article that blew my mind was "Turning the Database Inside Out": https://martin.kleppmann.com/2015/03/04/turning-the-database... . Doesn't encompass the full space, but the architectural implications when combined with CQRS/Event Sourcing models are huge.

Samza's architecture and API embodies a lot of the important ideas at a lower level than Storm; while it may not be the easiest to use in practice, the concepts and documentation translate.

Wonnk13 · on Jan 21, 2016

cool, thank you! Def need to pickup a copy of Klepman's book

jcbeard · on Jan 21, 2016

Wrote this awhile back on streaming and dataflow: http://www.jonathanbeard.io/blog/2015/09/19/streaming-and-da...

There are literally hundreds of data-flow streaming implementations. Quite sad that Google has named their system "Dataflow" which literally search bombs any other dataflow like system out there. Many of these systems are more data-flow in semantics only but more transaction-like under the hood. Quite nice for simple check-pointing/fault tolerance but not so great for performance. There's also some interesting work form Dongarra's lab on data-flow task sets (very OmpSs like). The earliest system was by Jack Dennis. Either way, I hope my blog post helps answer your questions.

Wonnk13 · on Jan 21, 2016

very cool, thanks much!

xcelq · on Jan 21, 2016

Can we hope to see a google like search engine open source? I'm just waiting for this day to happen.

ericand · on Jan 20, 2016

O'Reilly post also released today references the Apache Dataflow submission: https://www.oreilly.com/ideas/the-world-beyond-batch-streami...

spenrose · on Jan 21, 2016

The author is one of the DataFlow committers.

_pfwi · on Jan 20, 2016

It would be awesome to have the code portable across various big data engines.

BenoitP · on Jan 21, 2016

Where does Dataflow stands? Is it only a wrapper, trying to define a standard API for combining stream producers, datastores, and stream engines?

vgt · on Jan 22, 2016

It's a Batch+Stream unified processing model, an SDK. Idea is you can code up your pipeline in Dataflow and have your choice of where to run it - Spark, Flink, etc.

Google Cloud Dataflow is a fully-managed service that executes Dataflow pipelines and has nice value adds on top like fault tolerance and auto-optimization.

(Disclosure: I work on BigQuery, not Dataflow)