I work for a company that could greatly benefit from an out-of-the-box distribut...

pyrophane · on July 13, 2021

I last looked into this a couple of years ago, so this might be slightly out of date.

I think most popular options for high-volume self-hosted distributed stream processing solutions are still Spark, Flink, and Kafka Streams.

Kafka streams is simpler, as it is basically just a framework on top of Kafka itself, so if you already use Kafka for streaming data and don't have complex needs, it might be a good option.

Spark and Flink are similar. Both support both batch processing (on top of Hadoop, for example) and stream processing. Spark has better tooling, but Flink has more sophisticated support for streaming window functions. Spark also uses "micro-batches" instead of being truly real-time, so there will be a bit more latency when doing streaming with Spark, if that matters.

--

Another interesting project is Beam, which provides a unified way of writing jobs that can then be run on different engines that support it (both Flink and Spark do, as well as Dataflow on Google).

Apache hosts a lot of projects in this category. Most (like Storm) I would probably not pick up for a greenfield project today. Also, these things come with some significant operational overhead, so make sure you really need them. Stream processing at scale is hard. The compelling use case for these things is when you need to do window aggregations on a lot of streaming data and get results in real-time.

emmelaich · on July 14, 2021

Databricks if you don't want to do it yourself. (which is Apache Spark)

I have no relationship with Databricks.

random314 · on July 13, 2021

Look at Flink or Heron. Other choices like storm aren't any good.

elric · on July 14, 2021

While I appreciate your taking the time to make suggestions, these suggestions aren't very useful without context. I'm not saying that it's up to you to provide the context. But I don't know much about Flink or Heron, and looking at their respective websites doesn't tell me whether they'd be a good fit for a specific use case.

At this point, all of these frameworks would probably benefit from a flowchart (or questionnaire tool) that can guide someone towards an informed decision. "Do you need redundancy?" - "Can you afford to lose some messages in situation XYZ?" - "How many events/sec do you want to process?" - "How much hardware can you throw at the problem?" etc.

random314 · on July 15, 2021

The situation is actually more complex than you are suggesting. A check box comparison is not useful. I have worked in considerable depth in the streaming space and my comments are based off the design docs of both systems. You should read Twitter's heron paper and Apache Flink design docs.

For eg, storm or samza might check all the boxes, but the design of the system is poor enough that the performance will suck. For older versions of Storm, you should be able to write a multithreaded app on a single machine that outperforms a storm cluster.

himoacs · on July 13, 2021

I recommend checking out Solace. They have been in business for 20 years. It's not open source though but packs all the enterprise features you would need.

haik90 · on July 13, 2021

Had a lot trouble running solace on vm and docker, spend a lot time try to find root cause memory leak (happen every few months)

I still don't get why they use VPN terms for event broker

himoacs · on July 13, 2021

That's a shame. I work at Solace so shoot me a message and I will show you how to set it up if you are interested.

Solace's first product was hardware appliances which are still used for high throughput and low latency usecases. Concept of VPN was used to set up isolated virtual brokers so different teams can have their own environments on a shared hardware appliance.

The concept was ported over to software as well and is extremely useful in an enterprise environment. It allows different teams to have their own virtual brokers but not have to pay for or manage multiple brokers.

Nullabillity · on July 13, 2021

> I work at Solace

That recontextualizes your previous post quite a bit...

jsight · on July 13, 2021

https://xkcd.com/927/