Hacker News new | past | comments | ask | show | jobs | submit login
Google proposes its Dataflow batch/stream tech to the Apache Incubator (apache.org)
191 points by crb on Jan 20, 2016 | hide | past | favorite | 34 comments





> While Google has previously published papers describing some of its technologies, Google decided to take a different approach with Dataflow. Google open-sourced the SDK and model alongside commercialization of the idea and ahead of publishing papers on the topic.

A large number of ASF projects in the Big Data space are inspired by Google's publications. Good to see Google finally taking the lead and coming out with code.


Seems like this would duplicate a rather large chunk of Apache Crunch, which implements Google Flume nearly exactly as far as public API is concerned. As far as I can tell, Google Dataflow is also a variation on top of Google Flume. It would be helpful if they could elucidate why this project would not be redundant under the Apache umbrella.


Apache explicitly allows multiple overlapping projects. See, for example Apache Storm/Spark/Flink.

Also worth noting is that the implementation matters at least as much as the API.

In this case there are substantial differences: Data flow is a DSL, not a Java API, and it is designed for streaming data. It is unclear if Crunch handles the streaming case, but it talks a lot about Map/Reduce which makes me think it isn't the primary usecase.


Apache Ant + Ivy vs Apache Maven vs Apache Buildr Apache Tapestry vs Apache Wicket vs Apache MyFaces

those jump to mind, but I'm sure there are plenty of other overlaps in the Apache sphere. They certainly embrace it.


It's quite different from the original Flume paper as Dataflow supports expressing streaming computations in addition to batch in a unified set of APIs. It's like a combination of Flume and Millwheel.


This sentence captures everything that I hate about the Apache projects and the Hadoop ecosystem in particular.


Yes, competition is bad and people should never try to create better ways of doing the same thing.


Note how the creator of Apache Crunch is on the Dataflow committer list.


What kind of problems do you see with redundancy?


Basically it looks like it's nearly the same API with minor variations.


I was skeptical too, until I read the paper. The motivation is actually explained really well. It's different than Flume.


Are you referring to the DataFlow paper at http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf


Yes, I like how the paper is written, and the work is solid IMO, having used most of the predecessors. The annoying thing is footnote 1, where they explain that they coopted the term "dataflow model".

That's just plain dumb, since that word is used in so many other settings, especially big data settings (TensorFlow is also a dataflow system, but both the data types and model are completely different, etc.). So I can see why someone would just dismiss it as yet another framework, sort of like how all these JavaScript frameworks that are the hot thing one year and then legacy code the next.

If they had used a more specific name, I think it would help the perception a lot. The main point is batch vs streaming, which is still big pain point, and being precise about the event time vs event processing time.


Can anyone ELI5 what it means for an open source project to become an Apache project? Why doesn't Google just push the code on Github?


Going to start backwards on this one. :)

The ASF proposal contains a few different components (SDK and runners) all of which have lived on GitHub for awhile (the proposal has links if you're interested.) If accepted as an incubating project, the code would still live on GitHub like other Apache Incubator projects.

tl;dr/ELI5 The goal is to grow an open community, have consistent (and open) processes, and be part of a larger OSS ecosystem. When projects live on GitHub, the code is open but project processes, etc. may not be.


It is about control.

Donating something to the ASF (Apache Software Foundation) means you lose control and ownership of the copyright.


They have pushed it to GitHub I think they are giving it to apache to be more vendor neutral.


It's about who maintains it.


Does that mean that Google will stop maintaining the project?


There are several organizations included on the proposal, including Google, who will still be actively involved in the project, if accepted.


A lot of Apache project was submitted by venders and maintained by themself,goto ASF just say anyone can involve and do things with it.


what are the best resources to learn about streaming, dataflow, etc? Not necessarily the Google implementations, but the core concepts backing them.


Some of the best material I've read recently comes from the Confluent blug, esp. Martin Klepmann. The views are tilted toward Kafka and Samza since the founders are the same people, though they are both Apache projects. The article that blew my mind was "Turning the Database Inside Out": https://martin.kleppmann.com/2015/03/04/turning-the-database... . Doesn't encompass the full space, but the architectural implications when combined with CQRS/Event Sourcing models are huge.

Samza's architecture and API embodies a lot of the important ideas at a lower level than Storm; while it may not be the easiest to use in practice, the concepts and documentation translate.


cool, thank you! Def need to pickup a copy of Klepman's book


Wrote this awhile back on streaming and dataflow: http://www.jonathanbeard.io/blog/2015/09/19/streaming-and-da...

There are literally hundreds of data-flow streaming implementations. Quite sad that Google has named their system "Dataflow" which literally search bombs any other dataflow like system out there. Many of these systems are more data-flow in semantics only but more transaction-like under the hood. Quite nice for simple check-pointing/fault tolerance but not so great for performance. There's also some interesting work form Dongarra's lab on data-flow task sets (very OmpSs like). The earliest system was by Jack Dennis. Either way, I hope my blog post helps answer your questions.


very cool, thanks much!


Can we hope to see a google like search engine open source? I'm just waiting for this day to happen.


O'Reilly post also released today references the Apache Dataflow submission: https://www.oreilly.com/ideas/the-world-beyond-batch-streami...


The author is one of the DataFlow committers.


It would be awesome to have the code portable across various big data engines.


Where does Dataflow stands? Is it only a wrapper, trying to define a standard API for combining stream producers, datastores, and stream engines?


It's a Batch+Stream unified processing model, an SDK. Idea is you can code up your pipeline in Dataflow and have your choice of where to run it - Spark, Flink, etc.

Google Cloud Dataflow is a fully-managed service that executes Dataflow pipelines and has nice value adds on top like fault tolerance and auto-optimization.

(Disclosure: I work on BigQuery, not Dataflow)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: