Apache Heron: A realtime, distributed, fault-tolerant stream processing engine

bob1029 · on July 13, 2021

If you are interested in hyper-scale event processing but you want to learn it from first principles, I strongly recommend following Martin Thompson's talks. Here is an example of one on cluster consensus:

> https://www.youtube.com/watch?v=GFfLCGW_5-w

and another on event log architecture:

> https://www.youtube.com/watch?v=RlwO6CJbJjQ

After digging through all of this material and playing around with LMAX Disruptor & Raft, I have been able to develop a really good understanding of how to build these sorts of systems on my own. Fun constraints like "Only one thread actually mutates anything, and its the same one over and over" make for incredibly elegant implementation opportunities. Not having to constantly hunt down exotic thread-safe data structures means that you can focus on building actual value.

Latency is the biggest devil you will dance with in this arena, so almost everything you do will be oriented around mitigating that effect. Latency both at the network and inside the CPU/memory/storage. It applies at every level.

runT1ME · on July 13, 2021

>Only one thread actually mutates anything, and its the same one over and over

Sounds great in theory and usually results in fantastic average case, but you get a hot partition and suddenly you can't work share and things go south. It's not a perfect solution.

bob1029 · on July 13, 2021

If you have a single serialized+synchronous business context in which activities must be processed, you literally have no other option than to use a single physical core if you want to go fast.

The trick is to minimize the raw amount of bytes that must enter into that synchronous context. Maybe the account disclosure PDFs can be referred to by some GUID token in AWS S3, but the actual decimal account balance/transaction facts should be included in the event data stream.

One other option is to talk to the business and see if you can break their 1 gigantic synchronous context into multiple smaller ones that can progress independently.

random314 · on July 14, 2021

There is no working around Amdahls law unless you redesign your app.

doteka · on July 13, 2021

How many distributed stream processing engines is the Apache foundation planning to collect? At this point it seems like there’s more projects that do this (if you squint a bit), than companies with a serious usecase for that type of architecture.

elric · on July 13, 2021

I work for a company that could greatly benefit from an out-of-the-box distributed stream processing engine (we've been rolling our own for over a decade). At this point, it's pretty much impossible to pick one. All similar Apache tools have similar looking web pages, promising similar benefits, similar use cases, etc. What are the differentiating factors? At which point does it make sense to pick Heron over one of the others? Or vice versa?

pyrophane · on July 13, 2021

I last looked into this a couple of years ago, so this might be slightly out of date.

I think most popular options for high-volume self-hosted distributed stream processing solutions are still Spark, Flink, and Kafka Streams.

Kafka streams is simpler, as it is basically just a framework on top of Kafka itself, so if you already use Kafka for streaming data and don't have complex needs, it might be a good option.

Spark and Flink are similar. Both support both batch processing (on top of Hadoop, for example) and stream processing. Spark has better tooling, but Flink has more sophisticated support for streaming window functions. Spark also uses "micro-batches" instead of being truly real-time, so there will be a bit more latency when doing streaming with Spark, if that matters.

--

Another interesting project is Beam, which provides a unified way of writing jobs that can then be run on different engines that support it (both Flink and Spark do, as well as Dataflow on Google).

Apache hosts a lot of projects in this category. Most (like Storm) I would probably not pick up for a greenfield project today. Also, these things come with some significant operational overhead, so make sure you really need them. Stream processing at scale is hard. The compelling use case for these things is when you need to do window aggregations on a lot of streaming data and get results in real-time.

emmelaich · on July 14, 2021

Databricks if you don't want to do it yourself. (which is Apache Spark)

I have no relationship with Databricks.

random314 · on July 13, 2021

Look at Flink or Heron. Other choices like storm aren't any good.

elric · on July 14, 2021

While I appreciate your taking the time to make suggestions, these suggestions aren't very useful without context. I'm not saying that it's up to you to provide the context. But I don't know much about Flink or Heron, and looking at their respective websites doesn't tell me whether they'd be a good fit for a specific use case.

At this point, all of these frameworks would probably benefit from a flowchart (or questionnaire tool) that can guide someone towards an informed decision. "Do you need redundancy?" - "Can you afford to lose some messages in situation XYZ?" - "How many events/sec do you want to process?" - "How much hardware can you throw at the problem?" etc.

random314 · on July 15, 2021

The situation is actually more complex than you are suggesting. A check box comparison is not useful. I have worked in considerable depth in the streaming space and my comments are based off the design docs of both systems. You should read Twitter's heron paper and Apache Flink design docs.

For eg, storm or samza might check all the boxes, but the design of the system is poor enough that the performance will suck. For older versions of Storm, you should be able to write a multithreaded app on a single machine that outperforms a storm cluster.

himoacs · on July 13, 2021

I recommend checking out Solace. They have been in business for 20 years. It's not open source though but packs all the enterprise features you would need.

haik90 · on July 13, 2021

Had a lot trouble running solace on vm and docker, spend a lot time try to find root cause memory leak (happen every few months)

I still don't get why they use VPN terms for event broker

himoacs · on July 13, 2021

That's a shame. I work at Solace so shoot me a message and I will show you how to set it up if you are interested.

Solace's first product was hardware appliances which are still used for high throughput and low latency usecases. Concept of VPN was used to set up isolated virtual brokers so different teams can have their own environments on a shared hardware appliance.

The concept was ported over to software as well and is extremely useful in an enterprise environment. It allows different teams to have their own virtual brokers but not have to pay for or manage multiple brokers.

Nullabillity · on July 13, 2021

> I work at Solace

That recontextualizes your previous post quite a bit...

jsight · on July 13, 2021

https://xkcd.com/927/

BenoitP · on July 13, 2021

The long term trend for data access is having a reactive component; stream processing engines allow you do have that.

On the other hand databases have a huge lock-in power (been trying to strangle an Oracle myself for over 5 years now). It is lucrative to be in the database business.

I'd say that every project with a claim of some improvement can, and should try to establish itself in the market; and that having it join the Apache foundation is a great way to get some brand recognition on the cheap.

----

Also, Heron is not that new. It has been developed at Twitter, for replacing Storm IIRC.

skywhopper · on July 13, 2021

But none of them spell out for what use cases they excel over the other options. Since they are all under the ASF banner, it’s impossible to know which is “better” for me. They are all just “great stream processing engines”. But surely they must have diverging properties for given use cases. But none of the pages even attempt to say how they differ. Just “try it out!!”

mumblemumble · on July 13, 2021

Years ago, I found a scientific paper that evaluated them all on a fairly detailed rubric. As I recall, it turned out that they're all, in actual fact, crap stream processing engines. You just need to pick the one that's crap at things you don't need to do.

(I could frame it in a more glass half full way, but I find that the pessimistic way of looking at it helps a lot with trimming down options when you have far, far, far too many options.)

brandmeyer · on July 13, 2021

Citation needed. And not in the snarky way - I'd really like to read that paper :)

fizwhiz · on July 13, 2021

I'm guessing they were referring to "Scalability! But at what COST?"

https://www.usenix.org/system/files/conference/hotos15/hotos...

mumblemumble · on July 13, 2021

I've long since lost track of it. It would be 5 or 6 years out of date at this point, anyway, so probably not very useful for informing decisions anymore.

monocasa · on July 13, 2021

I've just accepted that mindset as a part of the art of engineering. Shifting the inherent crappiness of a domain around to where it doesn't matter as much for your use case.

phamilton4 · on July 14, 2021

I started using Apache Storm in 2012 and was really surprised at the number of Apache projects for stream processing back in 2015 or so when I did some research for moving off of storm. I remember Flink, Samza, Spark, Flume and NiFi(?) all being probably suitable options. I remember thinking exactly what you said!

I guess there are enough users to keep the communities alive?

_6pvr · on July 14, 2021

*Enough companies reinventing the wheel to spinoff new Apache distributed stream processing libraries.

ww520 · on July 13, 2021

It really comes down to the latency and throughput requirements. Projects are architected differently for different latency expectation and different throughput expectation.

In event processing there's a continuum of expected latencies from batch processing to realtime. Batch processing is typically running reports over large volume of events (good for throughput). Hadoop is a good example. On the other end, sub-second realtime report is possible with Heron/Storm. Spark is kind of in the middle with hybrid mini-batching. Reportedly Twitter has used Heron/Storm to track word counts in all the tweets to find trending topics, where the latency between a new tweet coming in to the word counts updated over the whole network is in 100s milliseconds.

aasasd · on July 13, 2021

Apparently it's another one from Twitter.

EdwardDiego · on July 13, 2021

At least they had the decency to not open source their Bookkeeper backed pub/sub system when they gave up on it and switched to Kafka...

...then some of the employees involved moved to Yahoo, built it again, and then open sourced it as Pulsar. Then moved onto form a Confluent v2 to sell their Kafka v2 (now with even more Zookeeper!)

mumblemumble · on July 13, 2021

I had a problem, so I invented Summingbird, and now I can relatively inexpensively invent new problems on an annual basis.

jarym · on July 13, 2021

I wonder this too - I guess enough software engineering teams had similar problems with no solutions and each set about building their own? Not sure how else this proliferation happened.

ubertaco · on July 13, 2021

Oh, is this that "next-gen Storm" project that Twitter built? Seems like they've finally given it to Apache, like they did with Storm after buying the company that built it.

Based on a quick Wikipedia skim, looks like the answer is "yes". That explains this bullet point:

> Heron is API compatible with Apache Storm and hence no code change is required for migration.

squarecog · on July 13, 2021

Donated to Apache in 2018. https://blog.twitter.com/engineering/en_us/topics/open-sourc... (and open-sourced in 2016, having started development in 2014).

aynyc · on July 13, 2021

Sometimes, I feel like Apache foundation is like Thanos, collecting all the distributed engines and watch the IT world burn.

From the beginning, Heron was envisioned as a new kind of stream processing system, built to meet the most demanding of technological requirements, to handle even the most massive of workloads, and to meet the needs of organizations of all sizes and degrees of complexity. Amongst these requirements:

    The ability to process billions of events per minute
    Extremely low end-to-end latency
    Predictable behavior regardless of scale and in the face of issue like extreme traffic spikes and pipeline congestion
    Simple administration, including:
        The ability to deploy on shared infrastructure
        Powerful monitoring capabilities
        Fine-grained configurability
    Easy debuggability

I can't wait for my next startup interview. We have a requirement of 25 messages per hour, with 10KB per message, you think you can build the ingestion pipeline using Kafka and MongoDB on a 10 node M5d.24xlarge cluster?

manishsharan · on July 13, 2021

Not just startups. I interviewed with a large financial institution recently and they had their heart set on Kafka. From what I learnt , they did not need Kafka -- the message order was not important, there was only one producer and one consumer and the volume of data was not extra-ordinary -- but they wanted to be on Kafka as that was the enterprise direction. I might have come across as an old fart when I asked the why they could not use their existing MQ infrastructure.

mrweasel · on July 13, 2021

Not that I'm not impressed by Kafka and its stability, performance and scalability, but I see the same behaviour from our customers.

They specifically want Kafka, there's no real reason other than they need a queue, which Kafka actively states that it's not. At that point it gets really tricky to reason with the developers about why they might be better served by something else. Generally speaking it's not much of an issue, because Kafka will deal with workloads just fine, it's just weird. I have seen one customer use Kafka as a database, that works less well.

We do see the same with Kubernetes. The developers pick Kubernetes and at that point it's to late. They specifically want Kubernetes even if you could more easily solve the problem with Nomad, Docker-Compose, plain old VMs or EC2, depending on the problem.

victor106 · on July 13, 2021

This is the exact same problem we are facing as well. Our clients just want to use it without knowing or thinking what problem they are solving.

Confluent itself says that for a workload of 38Mbps/sec rabbitmq has 5 times less latency than Kafka and is 100x simpler to manage.

https://www.confluent.io/blog/kafka-fastest-messaging-system...

How do you teach someone to look at the problems first and then pick the tools?

cle · on July 13, 2021

> How do you teach someone to look at the problems first and then pick the tools?

By understanding what the person cares about. Everyone knows "pick the right tool for the problem". Not everyone uses such a simple calculus because life isn't that simple. People have their own agendas, backgrounds, experiences, career growth desires, personal lives, etc., that are all part of their personal objective function. If you want to convince someone that your tools are better, show that your tools have a higher payoff for their personal objective function. This is way more than a mere product question. In a team setting it's even harder, because you have to balance it across multiple people simultaneously.

eternalban · on July 13, 2021

30 not 38. Also, 30K messages/s not 30Mbps.

"Due to CPU bottlenecks, we were not able to drive a throughput higher than 38K messages/s, and any attempt to measure latency at this rate showed significant degradation in performance clocking a p99 latency of almost two seconds."

mumblemumble · on July 13, 2021

I'm about to pick Kubernetes even though a different solution would theoretically be a much better fit for my needs. This is entirely because some other tools I'm looking at play nicely with Kubernetes out of the box. If I picked something else, I'd end up writing my own glue code.

I wonder if something similar is happening with Kafka?

nobleach · on July 13, 2021

My last gig was Kubernetes, and aside from all the hate it gets here, (You're not Facebook, you don't need Facebook scale) it was a very pleasant experience. So pleasant in fact that when I moved on to my next job (Amazon EC2 VMs) it was pretty painful. They were running an old version of Amazon Linux and hadn't been updated in years. The versions of some runtimes were impossible to update due to GLibc being out of date. Our immediate answer was, can we at least get to a Docker solution? ECS/Fargate provided a nice middle ground. But I'll admit. Once you start getting into running multiple replicas, it's nice to have the other stuff that Kubernetes affords you.

snowzach · on July 13, 2021

I pick Kubernetes because I want to manage software and not manage servers. I think it gets a lot of hate because people look at helm charts that are designed to support all possible software configurations and they are quite confusing. If you break it down to the basic pieces of configuration it's not really any more complicated than say a docker-compose file. Just a little more verbose.

nobleach · on July 13, 2021

Oh definitely. Helm/Kustomize.... Yamls all over the place. It can be awful. But then how nice it is to have a cluster with load balancing, a nice API Gateway, services spinning up and down gracefully. While that is certainly achievable in other ways, this one has been my favorite. (I used to deploy WARs to JBoss and have zero downtime... while it was possible, it was horrible)

tannhaeuser · on July 13, 2021

AKA Resume-driven development, the enemy of all engineering efforts and reason

handrous · on July 13, 2021

There's also strong pressure to take whatever the "safe" choice seems to be. If everyone's jumping on some technology, and your manager already mentioned it, even, then you will catch no blame even if it's entirely terrible. Advocate for something else, and win, and congrats, now even if it's way better every little hiccup and difficulty is your fault.

shipp02 · on July 13, 2021

To experience it on a smaller scale, get a family member (non tech savvy) who uses windows/macOS to use linux for a few days. Everything that is not the same as it was on windows/macOS will be your fault even if the linux way is better/faster/cleaner than windows.

cle · on July 13, 2021

Framing it like this is unproductive IMO. There's merit to picking technologies that help you and your teammates grow as engineers. You can take it too far...we should accept that it's a messy process to find the balance that lets us be productive now while helping us be more productive in the future.

As an engineering leader sometimes that even means knowing that people are making the wrong decision, and letting them do it anyway, and then helping them learn from it.

manishsharan · on July 13, 2021

>> Resume-driven development Lol . Love it. I shall be using it from now on in similar context.

himoacs · on July 13, 2021

ha! It's funny because it's true.

kermatt · on July 13, 2021

> which Kafka actively states that it's not

Where do the docs state that? Might be a useful link to keep handy.

mrweasel · on July 13, 2021

Hmm, that might be me remembering wrong. At least I can't find it. Sorry, I may be wrong.

They do go to great length to avoid calling Kafka a queue. No where does it directly state that Kafka is not a queue. The docs just never talks about Kafka as being a queue.

cyberfart · on July 13, 2021

Summary > Latency section [1]

https://www.confluent.io/blog/kafka-fastest-messaging-system...

BeefWellington · on July 13, 2021

It's the same with any new shiny technology it seems.

The specific thing I have experience with is in analytics/relational databases. Suddenly around 8-10 years ago it became imperative for every client I was dealing with to migrate their RDBMSes to Hadoop/Hive setups, even when their largest dataset was only about 120M rows denormalized.

They were trading three servers (primary, backup, DR) for sometimes 15 to 20. Queries that MSSQL was handling sub-second were suddenly taking 45s on Hive. It was utter madness and was as far as I can tell driven by good salespeople, FOMO, and the feeling of importance of being able to say your company is running Big Data(TM?).

I saw maybe one implementation (of dozens) that actually stayed in use for any length of time.

igetspam · on July 13, 2021

We've been using Kafka for a while. We're thinking about ditching it for something AMQP. For all it's bells and whistles, it's just not really all that exciting. We're probably spending 5x more on Kafka than we could something much simpler.

EdwardDiego · on July 13, 2021

Kafka is only exciting when you're wanting to stream a metric shit ton of data without things falling over.

If you're using it, you're constrained by its engineering decisions, so you need to be sure it's a worthwhile tradeoff.

jdmichal · on July 13, 2021

We're doing the same. Actively moving away from Kafka, which was only ever used as a message queue anyway, to an actual message queue. We might move back at some point in the future to rebuild an actual event-driven processing model. But for now, it's not worth the additional complexity for the scale and systems we actually have.

himoacs · on July 13, 2021

Which product are you moving to?

himoacs · on July 13, 2021

Just to clarify...AMQP is a protocol so you will need a broker that supports it. The two options I see for you are RabbitMQ and Solace. Both support AMQP.

marcelm72 · on July 13, 2021

Apache Pulsar seems to have decent AMQP support also.

bencyoung · on July 13, 2021

The main thing Kafka is great at is scaling out via partitions with it's natural stickyness, while maintaining excellent ordering guarantees. This means it's really easy to create a load of reliable in-memory processing where each thread processes a single partition of data.

When scaling out on a standard message queue you generally have greedy consumers which means you can't assume stickyness, or have to create your own partitioning structures. It makes it great for realtime apps...

If there were cheaper alternatives I think people would use them but it does definitely have powerful benefits.

derefr · on July 13, 2021

What would you recommend as a "pipe with MQ-ish semantics" to use when "the message order was not important, there was only one producer and one consumer and the volume of data was not extra-ordinary", but you do still want the core premise of an MQ — reliable message delivery in the face of faults in the producer and/or consumer?

We actually have a use-case that exactly matches this. One service makes [stuff], the other service consumes [stuff], both services are "immutable infrastructure" with no local stable state storage, and [stuff] is individually too small and frequent to be affordable with IaaS managed-MQ per-message costs — but batching messages into reasonable chunks before send means potentially losing up-to-a-batch worth of messages if the producer dies.

zo1 · on July 13, 2021

Not OP, but I would recommend RabbitMQ. It's surprisingly simple to setup and if your load isn't huge it'll hum along just fine without clusters other bells.

neoCrimeLabs · on July 13, 2021

> I can't wait for my next startup interview. We have a requirement of 25 messages per hour, with 10KB per message, you think you can build the ingestion pipeline using Kafka and MongoDB on a 10 node M5d.24xlarge cluster?

This reminds me of something I experienced

Back in 2000 I worked at a company that hired me to take over their brand new fully redundant web infrastructure. that was architected and built by a firm on a $10,000,000+ contract.

They couldn't understand why the site only served 1 page every 2 seconds. To be very clear: _1_ page every _2_ seconds.

The short story was the development company didn't have anyone who understood databases. So they were using the oracle cluster as a key-value store, and then parsing/sorting XML files on the application servers on demand.

This was a $500,000/yr oracle license, on redundant dedicated $(million) Sun hardware, with huge high speed disk arrays. They were almost sitting idle at peak load... of 1 page every 2 second.

Bonus story: The developer's didn't understand why their dynamic uploading of files to their app servers only worked 1/x of the time where X was the number of app servers in the cluster. They were dropping the files locally on the app-servers and wondering why the other app-servers didn't know about them.

It took a dry erase board and about 30 minutes before they truly understood that the filesystems on those devices were not shared and why. (note, it was never part of the design specification their own team created)

I'm happy that I didn't have to explain ephemeral containers to them.

rad_gruchalski · on July 13, 2021

> We have a requirement of 25 messages per hour, with 10KB per message, you think you can build the ingestion pipeline using Kafka and MongoDB on a 10 node M5d.24xlarge cluster?

Sure, if you have the budget to run it, can do! Feel free to reach out, email in the profile ;)

gilbetron · on July 14, 2021

> Sometimes, I feel like Apache foundation is like Thanos

Soon: https://thanos.io/

yamrzou · on July 13, 2021

I've worked with a bunch of stream processing engines a few years ago (Samza, Kafka Streaming, Spark Streaming, Storm and Flink), and did a comparison between them as part of my internship.

IMO, Apache Flink is the most complete project for those use cases. It is well maintained and the devs are very helpful when asked on the mailing lists.

uDontKnowMe · on July 13, 2021

Totally agreed! Flink is way underrated.

saryant · on July 13, 2021

Had it improved? We tried to ship a fairly large product on Flink a couple years ago and it was a nightmare. It’s threading model was guaranteed to blow up in your face and running it reliably on a noisy network was near impossible.

pixelmonkey · on July 13, 2021

Wonder why this is getting posted today in particular?

The quick summary here is that this was a clean-house rewrite of Apache Storm done by an internal team at Twitter. As an open source project history refresher, Apache Storm was originally built by a startup called Backtype, and the project was led by Nathan Marz, the technical founder of Backtype. Then, Backtype was acquired by Twitter, and Storm became a major component for large-scale stream processing (of tweets, tweet analytics, and other things) at Twitter.

I wrote a summary of the "interesting bits" of Apache Storm here:

https://blog.parse.ly/storm/

However, at a certain point, Nathan Marz left Twitter, and a different group of engineers tried to rethink Storm inside Twitter. There was also a lot of work going on around Apache Mesos at the time. Heron is kind of a merger of their "rethinking" of Storm while also making it possible to manage Storm-like Heron clusters using Mesos.

But, I don't think Heron really took off. Meanwhile, Storm got very, very stable in the 1.x series, and then had a clean-house rewrite from Clojure to Java in the 2.x series, mainly to improve performance even more. The last stable/major Storm release was in 2020.

Storm provides a stream processing programming API, a multi-lang wire protocol, and a cluster management approach. But certain cluster computing problems can probably be better solved at the infrastructure layer today. (For example, Storm was developed before the whole container + docker + k8s focus in cloud ops.) That said, it's still a very powerful system; on my team, we process 75K+ events per second across hundreds of vCPU cores and thousands of Python processes with sub-second latencies by combining Storm and Kafka with our open source Python project, streamparse.

https://github.com/Parsely/streamparse

The core problems Storm solves: modeling data processing as a computation graph; high-speed network communication between threads, processes, and nodes; message delivery guarantees and retry capabilities; tunable parallelism; built-in monitoring and logging; and much more.

(Also, I'd be remiss if I didn't mention -- if you're interested in stream processing and distributed computing, we are hiring Python Data Engineers to work on a stack involving Storm, Spark, Kafka, Cassandra, etc.) -- https://www.parse.ly/careers/python_data_engineer

squarecog · on July 13, 2021

I was in charge of the Twitter data platform team at the time we developed Heron and deprecated Storm. The Mesos component of your retelling is not quite right. Take a look at this comment I wrote around the time we started talking about Heron, addressing the same misconception: https://news.ycombinator.com/item?id=10056479

pixelmonkey · on July 13, 2021

Good context! I encourage others to read squarecog's linked comment.

didibus · on July 14, 2021

Does Twitter now rely on Heron then or is there a new in-house replacement? Are some teams still using Storm?

squarecog · on July 14, 2021

Now? No idea. At the time I left in 2016, all Storm usage was replaced by Heron, that was the point of making it API-compatible.

AtlasBarfed · on July 13, 2021

Is there a discussion as to the decision to dump clojure --> java for "performance reasons"?

I'm not even a clojure user, but my impression was that it was pretty performant. I remember a discussion that they didn't even really need the JVM invokedynamic because they were doing pretty well without it, so that made me think it was close to pure JVM speed.

pixelmonkey · on July 13, 2021

A lot of Storm was written in Java, but the "core" was written in Clojure. There wasn't so much a "decision" to dump Clojure as much as a community "opportunity" to do so. My understanding is that Alibaba, one of Storm's production adopters, did a clean-house port from Clojure to Java, which they called jstorm. They then donated/offered that implementation to the Apache Storm project, and the project decided to base the Storm 2.x line on it. So Storm 1.x still has the Clojure core, with lineage to the original Backtype release, but 2.x is sourced from jstorm. A big focus of Storm 2.x was high-scale performance, latency, and backpressure management. I also heard that some folks in the open source Storm community suspected it might be easier to find contributors/committers for Storm if it were implemented in Java. Meanwhile, Heron sprung up as a performance-focused Storm alternative with API compatibility to Storm, before Storm 2.x took shape.

Wonnk13 · on July 13, 2021

sometimes I poke fun at the front-end folks and all the new frameworks they're constantly chasing.

Lately it's been starting to feel the same for distributed systems. How many streaming engines are there now under Apache? Four?

haggy · on July 13, 2021

Depends how you group them. If we're talking battle tested then yea maybe four but I believe there are many more that don't have widespread adoption

JasonFruit · on July 13, 2021

I read this as a realtime, distributed, fault-tolerant steam engine, which was hard to imagine.

conjectures · on July 13, 2021

Tbh I would be more excited about that headline.

dikei · on July 13, 2021

IMHO, if you need stream processing, start with Apache Flink. It not only offers a much easier to user API compare to Storm and Heron, but also has a superior execution model for time-based, exactly-once, stateful stream processing.

pixelmonkey · on July 13, 2021

Flink is pretty cool, but the Flink Python API and Beam Python APIs are pretty atrocious, and some of the higher-level APIs (like Flink SQL) are pretty hard to grok. I kicked the tires on Flink and tried to love it, but I couldn't get there. Storm (which is Heron's inspiration and is still better than Heron, IMO) is a lot simpler conceptually. But it's pretty clear Flink was built to avoid the need to combine Storm Streaming + Spark Batch for "Lambda Architecture" style setups.

uDontKnowMe · on July 13, 2021

Python's cool, but "right tool for the job" and all that. Just dive in with the Java or Scala api's and you'll have a better time :-).

pixelmonkey · on July 13, 2021

Thanks for the suggestion, but, pragmatically, it's not an option. My colleagues and I haven't written data processing, distributed systems, NLP/ML/datasci, or web tier code in Java since 2006, and we don't plan on starting now.

We've used Python at scale on petabyte-sized production data, multi-billion-request API tiers, and high-concurrency low-latency data processing all the while.

Java is the right tool for writing system-level code in some isolated contexts, but Python is the right tool for the job for a huge number of important use cases, with those use cases growing by the day.

I love Java (and appreciate Kotlin and Clojure), but with parallel compute power cheaper by the day, Python's focus on code simplicity, open source ecosystem, and programmer happiness continues to win the day.

manishsharan · on July 13, 2021

If you are familiar with Flink, would you mind sharing your use cases? why Flink and not Kafka ? I have experience with Kafka but I have not had a reason to investigate Flink for my use cases.

PhoenixReborn · on July 13, 2021

> why Flink and not Kafka ?

I assume you mean "Kafka Streams" when you say this, as Kafka is just a event bus and Flink can be used to read messages from it.

The biggest advantage of Flink (IMO) is that you can write the Flink logic once, and reuse it for both batch and stream processing. So if I write a Flink job that consumes a Kafka stream and produces some aggregated outputs, that same job can be run against my data lake in S3/GCS/Azure Blob Storage, etc.

Kafka Streams does not support batch processing, or working on top of anything other than Kafka. Flink supports building on top of other message buses like Pulsar as well: https://flink.apache.org/news/2019/11/25/query-pulsar-stream...

dikei · on July 14, 2021

Actually, unified batch and stream processing in Flink is a bit of false advertisement. In Flink, stream and batch have different API (DataStream vs DataSet) so reusing logic is not really practical, unless you abstract out everything, in which case you might as well use Spark which is faster than DataSet API. Flink's developers is trying to get rid of DataSet API and move to DataStream and Table API for everything, but it's not done yet.

dikei · on July 14, 2021

Advantage of Flink over Kafka Stream

* Flink work on other message queue tech other than Kafka like Amazon SQS, Pulsar, etc..

* Back when I last read about Kafka Stream, Flink has better support for stateful processing, the entire state of execution can be safely snapshot into storage and resume at any time.

* Kafka Stream shuffle data is slower, because it has to send data to a new topic in the broker, instead of sending it directly between compute node.

Advantage of Kafka Stream over Flink

* Kafka Stream deployment is simple, just start the jar like any other Java program, scaling up by running the jar multiple time. On the other hand, Flink need a mature orchestration framework like YARN, Meso or K8S, trying to manage a Flink deployment without them is very painful.

* Flink require a central, persistent storage like HDFS or S3 for its checkpoint mechanism; Kafka stream doesn't.

fnord77 · on July 13, 2021

we had a terrible experience with flink. we were able to replace it all with kafka streams in short order. I wouldn't recommend flink for anything.

saryant · on July 13, 2021

I had the same experience. It felt like a distributed system written by people who’d never run one in production before.

loremipsium · on July 13, 2021

hadoop, kafka, storm, spark, flink, samza, confluent This tastes an aweful lot like javascript framework hell

EdwardDiego · on July 13, 2021

- Hadoop is an ecosystem.

- Kafka is a distributed log.

- Storm, Samza & Flink are stream processing engines.

- Spark is a Map/Reduce framework that uses memory to cache computations to provide some performance increase over other disk-based frameworks. It can also do some streaming computations if you squint hard enough.

- Confluent is a company that sells an enterprise Kafka.

Not really sure the comparison you made is apt.

shipp02 · on July 13, 2021

Kafka's homepage[1] advertises stream processing as a feature

> Built-in Stream Processing > Process streams of events with joins, aggregations, filters, transformations, and more, using event-time and exactly-once processing.

[1]:https://kafka.apache.org/

thinkharderdev · on July 13, 2021

"Built-in" is a odd word choice there. Kafka Streams is a framework for building stream-processing applications on top of Kafka topics.

EdwardDiego · on July 13, 2021

Kafka Streams is a streaming framework that uses Kafka's already existing features to implement itself - so resilience and parallelisation is implemented using consumer groups, exactly once using Kafka's transactions and idempotence, topics are used (as well as RocksDB) to store state for stateful aggregations etc. etc.

So unlike Flink, Storm, Spark, Heron etc. it's only useful with Kafka.

fnord77 · on July 13, 2021

"kafka streams" is an add on product to kafka

sivakon · on July 13, 2021

You forgot gearpump

http://incubator.apache.org/projects/gearpump.html

ternaryoperator · on July 14, 2021

It appears gearpump was retired in 2018

querulous · on July 13, 2021

and beam: beam.apache.org

diehunde · on July 14, 2021

I find odd so many people complaining about having too many alternatives for streaming systems. If you are not an advanced user of these systems of course you won't be able to choose the right one and it won't be your job anyways. Think of this, we have dozens of databases out there and there are new ones coming out every week and nobody complains about that. That's because there's way more people that understands the differences and the use cases so it makes sense.

majormajor · on July 13, 2021

I would love to know more about how they do stateful stuff, since I haven't found anything that can compare to Flink there, but Flink has really poor quality of life stuff compared to some other options (e.g. Flink serialization setup pains vs Beam coders) but their docs kinda trail off here:

> Non-idempotent stateful topologies are stateful topologies that do not apply processing logic along the model of "multiply by zero" and thus cannot provide effectively-once semantics. An example of a non-idempotent

https://heron.incubator.apache.org/docs/heron-delivery-seman...

(I'm puzzled by their idempotent vs non-idempotent stateful topology description, because if something is mutating an internal state upon receiving events, it will likely be non-idempotent by design... unless they just mean "idempotent stateful" here to refer to keeping track of source/output position state and such.)

(They also do say that can only support state storage in ZK or local FS, which feels like a likely non-starter compared to Flink for some of my use cases.)

dikei · on July 14, 2021

Flink serialization become much less painful when we switch every value type in our job to Avro. It requires a bit more code compare to using Java POJO or Scala case class, but you can actually add more field to existing data type without breaking anything now :)

rayrrr · on July 13, 2021

For once almost all commenters so far are thinking the same thing I am. Yet another stream processor from Apache. Might as well create the acronym now. YASP.

EdwardDiego · on July 13, 2021

Not so much "from" Apache, as "sent to Apache to die with dignity after the originating big company git bored of it"

CSDude · on July 13, 2021

I really wonder what people actually use stream processing for, like very concrete examples. My best examples would only go far filtering a stream over a time window to compute an aggregate. My job does not require anything more, it's always basic ETL, but I really need to hear specific examples where it's useful for others. Been a long time fan of Apache Flink.

thinkharderdev · on July 13, 2021

Standard use cases for stream processing are:

1. Enriching event streams. Say you have a stream of log records with an IP address field. You want to enrich with a geo-location before sending the logs to Elasticsearch.

2. Windowed aggregation. Maybe you have an application that is emitting "login" events and you want to to detect login attempts from different IP addresses within X minutes of each other.

3. Joining multiple event streams. You have multiple different event streams and you want to join them together using some common join key (maybe session ID or something like that) to compute a metric that aggregates all of them.

There are plenty of more esoteric use cases as well.

CSDude · on July 13, 2021

They are the obvious ones, I'm looking for more advanced & specific ones.

infinite8s · on July 14, 2021

Operational analytics when dealing with real world systems (transportation logistics, tracking machine states on factory floors, sensor data fusion, real-time operational dashboards for capital markets, etc).

Mehdi2277 · on July 14, 2021

Streaming model training for recommendation systems like tiktok, facebook, youtube, etc. If you want models to learn quickly to new events allowing them to do better on very new content/users you will want to use a streaming pipeline. I have seen flink used for model training at large scale here. Model evaluation can also be done in a streaming manner with training and needs to be done parallel for model monitoring.

The lowest latency model refresh time I've come across is ~5 minutes. Going lower is likely to be a lot of data transfers for model synchronization (sync the serving model with training model) for little value.

elchief · on July 13, 2021

So, AWS Pelican coming in 3 months?

biggestlou · on July 13, 2021

Everyone, please note that this is NOT an announcement of Heron entering the Apache Foundation. This happened several years ago, I believe in 2017 (when I was working on it).

antpls · on July 13, 2021

Since no one mentioned it in the comments so far, please add Hazelcast Jet to the long list of stream processors : https://github.com/hazelcast/hazelcast-jet

karmasimida · on July 13, 2021

Anyone use this in production except Twitter? Just curious.

fmakunbound · on July 13, 2021

Is it another corporate dumping at Apache or does Twitter seem to continue to support it as an open source project there?

aetherspawn · on July 14, 2021

Uhh, a triple-take: read this twice as Apache Heroin, and I’m not a druggy.

Is it apt? Reader exercise.

cblconfederate · on July 13, 2021

For a nonprofit, apache seems to have an unhealthy affection for solving scaling problems that only a few giant rich companies have

kristjansson · on July 13, 2021

You’ve got the flow backwards. Giant rich companies contribute projects that solve their problems to the ASF

detaro · on July 13, 2021

That criticism doesn't really make sense, given that Apache the non-profit isn't developing the software.

cblconfederate · on July 13, 2021

But it is funding it

detaro · on July 13, 2021

The Apache Foundation isn't funding development in any meaningful way, no.

uDontKnowMe · on July 13, 2021

What does the Apache Foundation do/provide/give these projects if no funding is involved? Is it just like a governance thing like "if this project falls into an abandoned state, we'll oversee the process of transferring ownership to any interested and credible parties interested in taking over"?

detaro · on July 13, 2021

Mostly a governance thing, yes. "Community infrastructure, coordination and legal framework in a box".