Show HN: Interactive map for architecting big data pipelines

atombender · on June 26, 2017

This is very useful.

I wish it had some information about supported languages. Most of the processing systems are JVM-based and require that you write your program in a JVM language. Some have Python support. But I have yet to encounter one that allows you write your pipelines in Go, Rust or JavaScript, for example. One notable exception is Storm, which supports pluggable runners, including one that talks to an external program over standard I/O. My impression that aside from Python, today's pipelines require a large amount of JVM buy-in, something I'm personally not interested in.

I'd also love some kind of metric for "aliveness". For example, my impression is that Storm was hot for about a week, and then Spark and Flink happened, and now nobody is talking about it, and Twitter itself has apparently replaced it with Heron.

pixelmonkey · on June 27, 2017

Storm is very much alive. Many of its users are simply running it reliably in production now. At my company, we are well past our trillionth production tuple running through Storm.

Also note that unlike Spark, Storm is a pure open source project that does not have a major commercial entity marketing its use cases. Hortonworks has put a little marketing effort behind it, but otherwise, it's just a mature & active Apache infrastructure project. Storm 2.0 is coming out soon and features a slew of performance- and reliability-improving enhancements.

But as for marketing buzz, Google has commercial reasons for you to use Beam and Dataflow, for example. And likewise Databricks for Spark.

It's probably a good idea to pick production large-scale data infrastructure on a metric other than recency of marketing buzz.

-$0.02 from one of the original authors of streamparse, the Python API for Storm

atombender · on June 27, 2017

Thanks, that's helpful. Is building a pipeline with Java, consisting entirely of shell spouts, a viable option? Are there downsides to not using the Java API?

rahilb · on June 27, 2017

I agree, though the storm IRC channel is a bit of a ghost town the google group is fairly active.

jdoliner · on June 27, 2017

If you're looking for something that doesn't constrain you to a particular language take a look at Pachyderm. It's built around containers so you can run any code you want. I designed it with JVM-phobes like you (and me) in mind.

https://github.com/pachyderm/pachyderm

atombender · on June 27, 2017

Cool, thanks! I haven't looked closely at it, but the "version controlled" part is something I don't need/want. Does it get in the way at all? I'm mostly looking for incremental, semi-real-time streaming processing, not something where you shoot off a big job on a dataset and get back results.

jdoliner · on June 27, 2017

The version control semantics of the system are pretty crucial for some of the features you describe wanting. Pachyderm supports incremental operations on stream-like datasets. But what's going on under the hood is that the dataset is being version controlled and thus the system can tell which data has changed and only process the new data. Hope that helps, I'd be happy to chat more about your specific use case. Shoot me an email at jdoliner@pachyderm.io

siliconc0w · on June 27, 2017

You can also use Spark's pipe to call external programs.

atombender · on June 27, 2017

Ergonomically speaking, how practical is that? Do you get all the benefits and performance of an equivalent Java pipeline?

swah · on June 27, 2017

Do you see any reason Golang would be less suited for those tasks?

dsacco · on June 27, 2017

Wow, this is awesome. What a simple yet useful idea.

This format lends itself to data processing, but I think it would be really nice to apply it a variety of workflows. For example, you could model the software deployment process across different languages and frameworks. It could be a good complement to StackShare.

A bit of constructive feedback: I'm not a stickler for UX or design, but maybe spruce up the gray boxes a bit. I've never been a designer though, so take that for what you will.

vosper · on June 27, 2017

If you're aiming to be comprehensive, then you may want to add Onyx under streaming processors. It's not as popular as the options you've listed though, so I understand why it might be left off.

http://www.onyxplatform.org

ddrum001 · on June 27, 2017

Thanks, we've included it in the Unified Batch processing because of its ability to have large window sizes that enable batch processing.

lolptdr · on June 26, 2017

This is awesome. Great aggregating of so many buzzwords and brand names that I've heard over the years. Nice job!

Keep it simple and hierarchical. I suggest additional filters for each component of the data engineering flow that can discern unique features or commonalities.

ddrum001 · on June 26, 2017

Thanks, we tried to keep it simple and streamlined. If you know of any features or common patterns that would be helpful, let us know.

greggyb · on June 26, 2017

Interesting that Microsoft's only showing in this map is for Azure Blob Storage.

theatraine · on June 27, 2017

Batch processing: Azure Data Lake(https://azure.microsoft.com/en-us/solutions/data-lake/)

Stream processing: Azure Stream Analytics (https://azure.microsoft.com/en-us/services/stream-analytics/)

SQL server is mentioned, but Azure Cosmos DB should also be mentioned (https://azure.microsoft.com/en-us/services/cosmos-db/)

ddrum001 · on June 27, 2017

Fair point - the set of technologies is based off the teams we work closest with, which admittedly have a bias towards open source and Linux. So far, our map is far from comprehensive, so appreciate the suggestions (exactly what we're looking for by show HN).

To that point, just added CosmosDB, and plan to add others soon.

greggyb · on June 27, 2017

Email me if you want to talk architectural patterns or Microsoft products or both.

Details are in my profile.

greggyb · on June 27, 2017

SQL DW / APS

Analysis Services

Event Hub

IoT Hub

And hosting for a lot of the open source items in the original post.

It seems more a survey of tools the author knows or likes.

jnatkins · on June 26, 2017

StreamSets Data Collector is another useful open-source ingest tool. I'm biased, but people seem to like it.

trwoway · on June 26, 2017

Strange that Apache Flink and Google Dataflow don't figure in the Stream Processing list

ddrum001 · on June 26, 2017

We put Dataflow and Flink in the "Unified Processing" since they can handle batch and streaming (as opposed to tools that only handle steaming).

We might add them explicitly to Streaming as well though.

rahilb · on June 27, 2017

Storm also has a Scala api, but is filtered when selecting Stream Processing and Scala.

Aegeaner · on June 27, 2017

Why is there no Flink in Streaming processing framework?

drfloob · on June 27, 2017

Flink is listed under `Unified Processing` as it supports both batch and streaming (Kappa Architecture)

Faaak · on June 27, 2017

What do you think of Arctic for the Data point ?

rjbwork · on June 27, 2017

Kind of cool, but only 2 entries from Azure that aren't on other places.

Kind of useless for us on Azure.

lima · on June 27, 2017

Citus DB is missing.