Hacker News new | past | comments | ask | show | jobs | submit login
Apache Heron: A realtime, distributed, fault-tolerant stream processing engine (apache.org)
170 points by yagizdegirmenci on July 13, 2021 | hide | past | favorite | 120 comments



If you are interested in hyper-scale event processing but you want to learn it from first principles, I strongly recommend following Martin Thompson's talks. Here is an example of one on cluster consensus:

> https://www.youtube.com/watch?v=GFfLCGW_5-w

and another on event log architecture:

> https://www.youtube.com/watch?v=RlwO6CJbJjQ

After digging through all of this material and playing around with LMAX Disruptor & Raft, I have been able to develop a really good understanding of how to build these sorts of systems on my own. Fun constraints like "Only one thread actually mutates anything, and its the same one over and over" make for incredibly elegant implementation opportunities. Not having to constantly hunt down exotic thread-safe data structures means that you can focus on building actual value.

Latency is the biggest devil you will dance with in this arena, so almost everything you do will be oriented around mitigating that effect. Latency both at the network and inside the CPU/memory/storage. It applies at every level.


>Only one thread actually mutates anything, and its the same one over and over

Sounds great in theory and usually results in fantastic average case, but you get a hot partition and suddenly you can't work share and things go south. It's not a perfect solution.


If you have a single serialized+synchronous business context in which activities must be processed, you literally have no other option than to use a single physical core if you want to go fast.

The trick is to minimize the raw amount of bytes that must enter into that synchronous context. Maybe the account disclosure PDFs can be referred to by some GUID token in AWS S3, but the actual decimal account balance/transaction facts should be included in the event data stream.

One other option is to talk to the business and see if you can break their 1 gigantic synchronous context into multiple smaller ones that can progress independently.


There is no working around Amdahls law unless you redesign your app.


How many distributed stream processing engines is the Apache foundation planning to collect? At this point it seems like there’s more projects that do this (if you squint a bit), than companies with a serious usecase for that type of architecture.


I work for a company that could greatly benefit from an out-of-the-box distributed stream processing engine (we've been rolling our own for over a decade). At this point, it's pretty much impossible to pick one. All similar Apache tools have similar looking web pages, promising similar benefits, similar use cases, etc. What are the differentiating factors? At which point does it make sense to pick Heron over one of the others? Or vice versa?


I last looked into this a couple of years ago, so this might be slightly out of date.

I think most popular options for high-volume self-hosted distributed stream processing solutions are still Spark, Flink, and Kafka Streams.

Kafka streams is simpler, as it is basically just a framework on top of Kafka itself, so if you already use Kafka for streaming data and don't have complex needs, it might be a good option.

Spark and Flink are similar. Both support both batch processing (on top of Hadoop, for example) and stream processing. Spark has better tooling, but Flink has more sophisticated support for streaming window functions. Spark also uses "micro-batches" instead of being truly real-time, so there will be a bit more latency when doing streaming with Spark, if that matters.

--

Another interesting project is Beam, which provides a unified way of writing jobs that can then be run on different engines that support it (both Flink and Spark do, as well as Dataflow on Google).

Apache hosts a lot of projects in this category. Most (like Storm) I would probably not pick up for a greenfield project today. Also, these things come with some significant operational overhead, so make sure you really need them. Stream processing at scale is hard. The compelling use case for these things is when you need to do window aggregations on a lot of streaming data and get results in real-time.


Databricks if you don't want to do it yourself. (which is Apache Spark)

I have no relationship with Databricks.


Look at Flink or Heron. Other choices like storm aren't any good.


While I appreciate your taking the time to make suggestions, these suggestions aren't very useful without context. I'm not saying that it's up to you to provide the context. But I don't know much about Flink or Heron, and looking at their respective websites doesn't tell me whether they'd be a good fit for a specific use case.

At this point, all of these frameworks would probably benefit from a flowchart (or questionnaire tool) that can guide someone towards an informed decision. "Do you need redundancy?" - "Can you afford to lose some messages in situation XYZ?" - "How many events/sec do you want to process?" - "How much hardware can you throw at the problem?" etc.


The situation is actually more complex than you are suggesting. A check box comparison is not useful. I have worked in considerable depth in the streaming space and my comments are based off the design docs of both systems. You should read Twitter's heron paper and Apache Flink design docs.

For eg, storm or samza might check all the boxes, but the design of the system is poor enough that the performance will suck. For older versions of Storm, you should be able to write a multithreaded app on a single machine that outperforms a storm cluster.


I recommend checking out Solace. They have been in business for 20 years. It's not open source though but packs all the enterprise features you would need.


Had a lot trouble running solace on vm and docker, spend a lot time try to find root cause memory leak (happen every few months)

I still don't get why they use VPN terms for event broker


That's a shame. I work at Solace so shoot me a message and I will show you how to set it up if you are interested.

Solace's first product was hardware appliances which are still used for high throughput and low latency usecases. Concept of VPN was used to set up isolated virtual brokers so different teams can have their own environments on a shared hardware appliance.

The concept was ported over to software as well and is extremely useful in an enterprise environment. It allows different teams to have their own virtual brokers but not have to pay for or manage multiple brokers.


> I work at Solace

That recontextualizes your previous post quite a bit...



The long term trend for data access is having a reactive component; stream processing engines allow you do have that.

On the other hand databases have a huge lock-in power (been trying to strangle an Oracle myself for over 5 years now). It is lucrative to be in the database business.

I'd say that every project with a claim of some improvement can, and should try to establish itself in the market; and that having it join the Apache foundation is a great way to get some brand recognition on the cheap.

----

Also, Heron is not that new. It has been developed at Twitter, for replacing Storm IIRC.


But none of them spell out for what use cases they excel over the other options. Since they are all under the ASF banner, it’s impossible to know which is “better” for me. They are all just “great stream processing engines”. But surely they must have diverging properties for given use cases. But none of the pages even attempt to say how they differ. Just “try it out!!”


Years ago, I found a scientific paper that evaluated them all on a fairly detailed rubric. As I recall, it turned out that they're all, in actual fact, crap stream processing engines. You just need to pick the one that's crap at things you don't need to do.

(I could frame it in a more glass half full way, but I find that the pessimistic way of looking at it helps a lot with trimming down options when you have far, far, far too many options.)


Citation needed. And not in the snarky way - I'd really like to read that paper :)


I'm guessing they were referring to "Scalability! But at what COST?"

https://www.usenix.org/system/files/conference/hotos15/hotos...


I've long since lost track of it. It would be 5 or 6 years out of date at this point, anyway, so probably not very useful for informing decisions anymore.


I've just accepted that mindset as a part of the art of engineering. Shifting the inherent crappiness of a domain around to where it doesn't matter as much for your use case.


I started using Apache Storm in 2012 and was really surprised at the number of Apache projects for stream processing back in 2015 or so when I did some research for moving off of storm. I remember Flink, Samza, Spark, Flume and NiFi(?) all being probably suitable options. I remember thinking exactly what you said!

I guess there are enough users to keep the communities alive?


*Enough companies reinventing the wheel to spinoff new Apache distributed stream processing libraries.


It really comes down to the latency and throughput requirements. Projects are architected differently for different latency expectation and different throughput expectation.

In event processing there's a continuum of expected latencies from batch processing to realtime. Batch processing is typically running reports over large volume of events (good for throughput). Hadoop is a good example. On the other end, sub-second realtime report is possible with Heron/Storm. Spark is kind of in the middle with hybrid mini-batching. Reportedly Twitter has used Heron/Storm to track word counts in all the tweets to find trending topics, where the latency between a new tweet coming in to the word counts updated over the whole network is in 100s milliseconds.


Apparently it's another one from Twitter.


At least they had the decency to not open source their Bookkeeper backed pub/sub system when they gave up on it and switched to Kafka...

...then some of the employees involved moved to Yahoo, built it again, and then open sourced it as Pulsar. Then moved onto form a Confluent v2 to sell their Kafka v2 (now with even more Zookeeper!)


I had a problem, so I invented Summingbird, and now I can relatively inexpensively invent new problems on an annual basis.


I wonder this too - I guess enough software engineering teams had similar problems with no solutions and each set about building their own? Not sure how else this proliferation happened.


Oh, is this that "next-gen Storm" project that Twitter built? Seems like they've finally given it to Apache, like they did with Storm after buying the company that built it.

Based on a quick Wikipedia skim, looks like the answer is "yes". That explains this bullet point:

> Heron is API compatible with Apache Storm and hence no code change is required for migration.


Donated to Apache in 2018. https://blog.twitter.com/engineering/en_us/topics/open-sourc... (and open-sourced in 2016, having started development in 2014).


Sometimes, I feel like Apache foundation is like Thanos, collecting all the distributed engines and watch the IT world burn.

From the beginning, Heron was envisioned as a new kind of stream processing system, built to meet the most demanding of technological requirements, to handle even the most massive of workloads, and to meet the needs of organizations of all sizes and degrees of complexity. Amongst these requirements:

    The ability to process billions of events per minute
    Extremely low end-to-end latency
    Predictable behavior regardless of scale and in the face of issue like extreme traffic spikes and pipeline congestion
    Simple administration, including:
        The ability to deploy on shared infrastructure
        Powerful monitoring capabilities
        Fine-grained configurability
    Easy debuggability

I can't wait for my next startup interview. We have a requirement of 25 messages per hour, with 10KB per message, you think you can build the ingestion pipeline using Kafka and MongoDB on a 10 node M5d.24xlarge cluster?


Not just startups. I interviewed with a large financial institution recently and they had their heart set on Kafka. From what I learnt , they did not need Kafka -- the message order was not important, there was only one producer and one consumer and the volume of data was not extra-ordinary -- but they wanted to be on Kafka as that was the enterprise direction. I might have come across as an old fart when I asked the why they could not use their existing MQ infrastructure.


Not that I'm not impressed by Kafka and its stability, performance and scalability, but I see the same behaviour from our customers.

They specifically want Kafka, there's no real reason other than they need a queue, which Kafka actively states that it's not. At that point it gets really tricky to reason with the developers about why they might be better served by something else. Generally speaking it's not much of an issue, because Kafka will deal with workloads just fine, it's just weird. I have seen one customer use Kafka as a database, that works less well.

We do see the same with Kubernetes. The developers pick Kubernetes and at that point it's to late. They specifically want Kubernetes even if you could more easily solve the problem with Nomad, Docker-Compose, plain old VMs or EC2, depending on the problem.


This is the exact same problem we are facing as well. Our clients just want to use it without knowing or thinking what problem they are solving.

Confluent itself says that for a workload of 38Mbps/sec rabbitmq has 5 times less latency than Kafka and is 100x simpler to manage.

https://www.confluent.io/blog/kafka-fastest-messaging-system...

How do you teach someone to look at the problems first and then pick the tools?


> How do you teach someone to look at the problems first and then pick the tools?

By understanding what the person cares about. Everyone knows "pick the right tool for the problem". Not everyone uses such a simple calculus because life isn't that simple. People have their own agendas, backgrounds, experiences, career growth desires, personal lives, etc., that are all part of their personal objective function. If you want to convince someone that your tools are better, show that your tools have a higher payoff for their personal objective function. This is way more than a mere product question. In a team setting it's even harder, because you have to balance it across multiple people simultaneously.


30 not 38. Also, 30K messages/s not 30Mbps.

"Due to CPU bottlenecks, we were not able to drive a throughput higher than 38K messages/s, and any attempt to measure latency at this rate showed significant degradation in performance clocking a p99 latency of almost two seconds."


I'm about to pick Kubernetes even though a different solution would theoretically be a much better fit for my needs. This is entirely because some other tools I'm looking at play nicely with Kubernetes out of the box. If I picked something else, I'd end up writing my own glue code.

I wonder if something similar is happening with Kafka?


My last gig was Kubernetes, and aside from all the hate it gets here, (You're not Facebook, you don't need Facebook scale) it was a very pleasant experience. So pleasant in fact that when I moved on to my next job (Amazon EC2 VMs) it was pretty painful. They were running an old version of Amazon Linux and hadn't been updated in years. The versions of some runtimes were impossible to update due to GLibc being out of date. Our immediate answer was, can we at least get to a Docker solution? ECS/Fargate provided a nice middle ground. But I'll admit. Once you start getting into running multiple replicas, it's nice to have the other stuff that Kubernetes affords you.


I pick Kubernetes because I want to manage software and not manage servers. I think it gets a lot of hate because people look at helm charts that are designed to support all possible software configurations and they are quite confusing. If you break it down to the basic pieces of configuration it's not really any more complicated than say a docker-compose file. Just a little more verbose.


Oh definitely. Helm/Kustomize.... Yamls all over the place. It can be awful. But then how nice it is to have a cluster with load balancing, a nice API Gateway, services spinning up and down gracefully. While that is certainly achievable in other ways, this one has been my favorite. (I used to deploy WARs to JBoss and have zero downtime... while it was possible, it was horrible)


AKA Resume-driven development, the enemy of all engineering efforts and reason


There's also strong pressure to take whatever the "safe" choice seems to be. If everyone's jumping on some technology, and your manager already mentioned it, even, then you will catch no blame even if it's entirely terrible. Advocate for something else, and win, and congrats, now even if it's way better every little hiccup and difficulty is your fault.


To experience it on a smaller scale, get a family member (non tech savvy) who uses windows/macOS to use linux for a few days. Everything that is not the same as it was on windows/macOS will be your fault even if the linux way is better/faster/cleaner than windows.


Framing it like this is unproductive IMO. There's merit to picking technologies that help you and your teammates grow as engineers. You can take it too far...we should accept that it's a messy process to find the balance that lets us be productive now while helping us be more productive in the future.

As an engineering leader sometimes that even means knowing that people are making the wrong decision, and letting them do it anyway, and then helping them learn from it.


>> Resume-driven development Lol . Love it. I shall be using it from now on in similar context.


ha! It's funny because it's true.


> which Kafka actively states that it's not

Where do the docs state that? Might be a useful link to keep handy.


Hmm, that might be me remembering wrong. At least I can't find it. Sorry, I may be wrong.

They do go to great length to avoid calling Kafka a queue. No where does it directly state that Kafka is not a queue. The docs just never talks about Kafka as being a queue.



It's the same with any new shiny technology it seems.

The specific thing I have experience with is in analytics/relational databases. Suddenly around 8-10 years ago it became imperative for every client I was dealing with to migrate their RDBMSes to Hadoop/Hive setups, even when their largest dataset was only about 120M rows denormalized.

They were trading three servers (primary, backup, DR) for sometimes 15 to 20. Queries that MSSQL was handling sub-second were suddenly taking 45s on Hive. It was utter madness and was as far as I can tell driven by good salespeople, FOMO, and the feeling of importance of being able to say your company is running Big Data(TM?).

I saw maybe one implementation (of dozens) that actually stayed in use for any length of time.


We've been using Kafka for a while. We're thinking about ditching it for something AMQP. For all it's bells and whistles, it's just not really all that exciting. We're probably spending 5x more on Kafka than we could something much simpler.


Kafka is only exciting when you're wanting to stream a metric shit ton of data without things falling over.

If you're using it, you're constrained by its engineering decisions, so you need to be sure it's a worthwhile tradeoff.


We're doing the same. Actively moving away from Kafka, which was only ever used as a message queue anyway, to an actual message queue. We might move back at some point in the future to rebuild an actual event-driven processing model. But for now, it's not worth the additional complexity for the scale and systems we actually have.


Which product are you moving to?


Just to clarify...AMQP is a protocol so you will need a broker that supports it. The two options I see for you are RabbitMQ and Solace. Both support AMQP.


Apache Pulsar seems to have decent AMQP support also.


The main thing Kafka is great at is scaling out via partitions with it's natural stickyness, while maintaining excellent ordering guarantees. This means it's really easy to create a load of reliable in-memory processing where each thread processes a single partition of data.

When scaling out on a standard message queue you generally have greedy consumers which means you can't assume stickyness, or have to create your own partitioning structures. It makes it great for realtime apps...

If there were cheaper alternatives I think people would use them but it does definitely have powerful benefits.


What would you recommend as a "pipe with MQ-ish semantics" to use when "the message order was not important, there was only one producer and one consumer and the volume of data was not extra-ordinary", but you do still want the core premise of an MQ — reliable message delivery in the face of faults in the producer and/or consumer?

We actually have a use-case that exactly matches this. One service makes [stuff], the other service consumes [stuff], both services are "immutable infrastructure" with no local stable state storage, and [stuff] is individually too small and frequent to be affordable with IaaS managed-MQ per-message costs — but batching messages into reasonable chunks before send means potentially losing up-to-a-batch worth of messages if the producer dies.


Not OP, but I would recommend RabbitMQ. It's surprisingly simple to setup and if your load isn't huge it'll hum along just fine without clusters other bells.


> I can't wait for my next startup interview. We have a requirement of 25 messages per hour, with 10KB per message, you think you can build the ingestion pipeline using Kafka and MongoDB on a 10 node M5d.24xlarge cluster?

This reminds me of something I experienced

Back in 2000 I worked at a company that hired me to take over their brand new fully redundant web infrastructure. that was architected and built by a firm on a $10,000,000+ contract.

They couldn't understand why the site only served 1 page every 2 seconds. To be very clear: _1_ page every _2_ seconds.

The short story was the development company didn't have anyone who understood databases. So they were using the oracle cluster as a key-value store, and then parsing/sorting XML files on the application servers on demand.

This was a $500,000/yr oracle license, on redundant dedicated $(million) Sun hardware, with huge high speed disk arrays. They were almost sitting idle at peak load... of 1 page every 2 second.

Bonus story: The developer's didn't understand why their dynamic uploading of files to their app servers only worked 1/x of the time where X was the number of app servers in the cluster. They were dropping the files locally on the app-servers and wondering why the other app-servers didn't know about them.

It took a dry erase board and about 30 minutes before they truly understood that the filesystems on those devices were not shared and why. (note, it was never part of the design specification their own team created)

I'm happy that I didn't have to explain ephemeral containers to them.


> We have a requirement of 25 messages per hour, with 10KB per message, you think you can build the ingestion pipeline using Kafka and MongoDB on a 10 node M5d.24xlarge cluster?

Sure, if you have the budget to run it, can do! Feel free to reach out, email in the profile ;)


> Sometimes, I feel like Apache foundation is like Thanos

Soon: https://thanos.io/


I've worked with a bunch of stream processing engines a few years ago (Samza, Kafka Streaming, Spark Streaming, Storm and Flink), and did a comparison between them as part of my internship.

IMO, Apache Flink is the most complete project for those use cases. It is well maintained and the devs are very helpful when asked on the mailing lists.


Totally agreed! Flink is way underrated.


Had it improved? We tried to ship a fairly large product on Flink a couple years ago and it was a nightmare. It’s threading model was guaranteed to blow up in your face and running it reliably on a noisy network was near impossible.


Wonder why this is getting posted today in particular?

The quick summary here is that this was a clean-house rewrite of Apache Storm done by an internal team at Twitter. As an open source project history refresher, Apache Storm was originally built by a startup called Backtype, and the project was led by Nathan Marz, the technical founder of Backtype. Then, Backtype was acquired by Twitter, and Storm became a major component for large-scale stream processing (of tweets, tweet analytics, and other things) at Twitter.

I wrote a summary of the "interesting bits" of Apache Storm here:

https://blog.parse.ly/storm/

However, at a certain point, Nathan Marz left Twitter, and a different group of engineers tried to rethink Storm inside Twitter. There was also a lot of work going on around Apache Mesos at the time. Heron is kind of a merger of their "rethinking" of Storm while also making it possible to manage Storm-like Heron clusters using Mesos.

But, I don't think Heron really took off. Meanwhile, Storm got very, very stable in the 1.x series, and then had a clean-house rewrite from Clojure to Java in the 2.x series, mainly to improve performance even more. The last stable/major Storm release was in 2020.

Storm provides a stream processing programming API, a multi-lang wire protocol, and a cluster management approach. But certain cluster computing problems can probably be better solved at the infrastructure layer today. (For example, Storm was developed before the whole container + docker + k8s focus in cloud ops.) That said, it's still a very powerful system; on my team, we process 75K+ events per second across hundreds of vCPU cores and thousands of Python processes with sub-second latencies by combining Storm and Kafka with our open source Python project, streamparse.

https://github.com/Parsely/streamparse

The core problems Storm solves: modeling data processing as a computation graph; high-speed network communication between threads, processes, and nodes; message delivery guarantees and retry capabilities; tunable parallelism; built-in monitoring and logging; and much more.

(Also, I'd be remiss if I didn't mention -- if you're interested in stream processing and distributed computing, we are hiring Python Data Engineers to work on a stack involving Storm, Spark, Kafka, Cassandra, etc.) -- https://www.parse.ly/careers/python_data_engineer


I was in charge of the Twitter data platform team at the time we developed Heron and deprecated Storm. The Mesos component of your retelling is not quite right. Take a look at this comment I wrote around the time we started talking about Heron, addressing the same misconception: https://news.ycombinator.com/item?id=10056479


Good context! I encourage others to read squarecog's linked comment.


Does Twitter now rely on Heron then or is there a new in-house replacement? Are some teams still using Storm?


Now? No idea. At the time I left in 2016, all Storm usage was replaced by Heron, that was the point of making it API-compatible.


Is there a discussion as to the decision to dump clojure --> java for "performance reasons"?

I'm not even a clojure user, but my impression was that it was pretty performant. I remember a discussion that they didn't even really need the JVM invokedynamic because they were doing pretty well without it, so that made me think it was close to pure JVM speed.


A lot of Storm was written in Java, but the "core" was written in Clojure. There wasn't so much a "decision" to dump Clojure as much as a community "opportunity" to do so. My understanding is that Alibaba, one of Storm's production adopters, did a clean-house port from Clojure to Java, which they called jstorm. They then donated/offered that implementation to the Apache Storm project, and the project decided to base the Storm 2.x line on it. So Storm 1.x still has the Clojure core, with lineage to the original Backtype release, but 2.x is sourced from jstorm. A big focus of Storm 2.x was high-scale performance, latency, and backpressure management. I also heard that some folks in the open source Storm community suspected it might be easier to find contributors/committers for Storm if it were implemented in Java. Meanwhile, Heron sprung up as a performance-focused Storm alternative with API compatibility to Storm, before Storm 2.x took shape.


sometimes I poke fun at the front-end folks and all the new frameworks they're constantly chasing.

Lately it's been starting to feel the same for distributed systems. How many streaming engines are there now under Apache? Four?


Depends how you group them. If we're talking battle tested then yea maybe four but I believe there are many more that don't have widespread adoption


I read this as a realtime, distributed, fault-tolerant steam engine, which was hard to imagine.


Tbh I would be more excited about that headline.


IMHO, if you need stream processing, start with Apache Flink. It not only offers a much easier to user API compare to Storm and Heron, but also has a superior execution model for time-based, exactly-once, stateful stream processing.


Flink is pretty cool, but the Flink Python API and Beam Python APIs are pretty atrocious, and some of the higher-level APIs (like Flink SQL) are pretty hard to grok. I kicked the tires on Flink and tried to love it, but I couldn't get there. Storm (which is Heron's inspiration and is still better than Heron, IMO) is a lot simpler conceptually. But it's pretty clear Flink was built to avoid the need to combine Storm Streaming + Spark Batch for "Lambda Architecture" style setups.


Python's cool, but "right tool for the job" and all that. Just dive in with the Java or Scala api's and you'll have a better time :-).


Thanks for the suggestion, but, pragmatically, it's not an option. My colleagues and I haven't written data processing, distributed systems, NLP/ML/datasci, or web tier code in Java since 2006, and we don't plan on starting now.

We've used Python at scale on petabyte-sized production data, multi-billion-request API tiers, and high-concurrency low-latency data processing all the while.

Java is the right tool for writing system-level code in some isolated contexts, but Python is the right tool for the job for a huge number of important use cases, with those use cases growing by the day.

I love Java (and appreciate Kotlin and Clojure), but with parallel compute power cheaper by the day, Python's focus on code simplicity, open source ecosystem, and programmer happiness continues to win the day.


If you are familiar with Flink, would you mind sharing your use cases? why Flink and not Kafka ? I have experience with Kafka but I have not had a reason to investigate Flink for my use cases.


> why Flink and not Kafka ?

I assume you mean "Kafka Streams" when you say this, as Kafka is just a event bus and Flink can be used to read messages from it.

The biggest advantage of Flink (IMO) is that you can write the Flink logic once, and reuse it for both batch and stream processing. So if I write a Flink job that consumes a Kafka stream and produces some aggregated outputs, that same job can be run against my data lake in S3/GCS/Azure Blob Storage, etc.

Kafka Streams does not support batch processing, or working on top of anything other than Kafka. Flink supports building on top of other message buses like Pulsar as well: https://flink.apache.org/news/2019/11/25/query-pulsar-stream...


Actually, unified batch and stream processing in Flink is a bit of false advertisement. In Flink, stream and batch have different API (DataStream vs DataSet) so reusing logic is not really practical, unless you abstract out everything, in which case you might as well use Spark which is faster than DataSet API. Flink's developers is trying to get rid of DataSet API and move to DataStream and Table API for everything, but it's not done yet.


Advantage of Flink over Kafka Stream

* Flink work on other message queue tech other than Kafka like Amazon SQS, Pulsar, etc..

* Back when I last read about Kafka Stream, Flink has better support for stateful processing, the entire state of execution can be safely snapshot into storage and resume at any time.

* Kafka Stream shuffle data is slower, because it has to send data to a new topic in the broker, instead of sending it directly between compute node.

Advantage of Kafka Stream over Flink

* Kafka Stream deployment is simple, just start the jar like any other Java program, scaling up by running the jar multiple time. On the other hand, Flink need a mature orchestration framework like YARN, Meso or K8S, trying to manage a Flink deployment without them is very painful.

* Flink require a central, persistent storage like HDFS or S3 for its checkpoint mechanism; Kafka stream doesn't.


we had a terrible experience with flink. we were able to replace it all with kafka streams in short order. I wouldn't recommend flink for anything.


I had the same experience. It felt like a distributed system written by people who’d never run one in production before.


hadoop, kafka, storm, spark, flink, samza, confluent This tastes an aweful lot like javascript framework hell


- Hadoop is an ecosystem.

- Kafka is a distributed log.

- Storm, Samza & Flink are stream processing engines.

- Spark is a Map/Reduce framework that uses memory to cache computations to provide some performance increase over other disk-based frameworks. It can also do some streaming computations if you squint hard enough.

- Confluent is a company that sells an enterprise Kafka.

Not really sure the comparison you made is apt.


Kafka's homepage[1] advertises stream processing as a feature

> Built-in Stream Processing > Process streams of events with joins, aggregations, filters, transformations, and more, using event-time and exactly-once processing.

[1]:https://kafka.apache.org/


"Built-in" is a odd word choice there. Kafka Streams is a framework for building stream-processing applications on top of Kafka topics.


Kafka Streams is a streaming framework that uses Kafka's already existing features to implement itself - so resilience and parallelisation is implemented using consumer groups, exactly once using Kafka's transactions and idempotence, topics are used (as well as RocksDB) to store state for stateful aggregations etc. etc.

So unlike Flink, Storm, Spark, Heron etc. it's only useful with Kafka.


"kafka streams" is an add on product to kafka



It appears gearpump was retired in 2018


and beam: beam.apache.org


I find odd so many people complaining about having too many alternatives for streaming systems. If you are not an advanced user of these systems of course you won't be able to choose the right one and it won't be your job anyways. Think of this, we have dozens of databases out there and there are new ones coming out every week and nobody complains about that. That's because there's way more people that understands the differences and the use cases so it makes sense.


I would love to know more about how they do stateful stuff, since I haven't found anything that can compare to Flink there, but Flink has really poor quality of life stuff compared to some other options (e.g. Flink serialization setup pains vs Beam coders) but their docs kinda trail off here:

> Non-idempotent stateful topologies are stateful topologies that do not apply processing logic along the model of "multiply by zero" and thus cannot provide effectively-once semantics. An example of a non-idempotent

https://heron.incubator.apache.org/docs/heron-delivery-seman...

(I'm puzzled by their idempotent vs non-idempotent stateful topology description, because if something is mutating an internal state upon receiving events, it will likely be non-idempotent by design... unless they just mean "idempotent stateful" here to refer to keeping track of source/output position state and such.)

(They also do say that can only support state storage in ZK or local FS, which feels like a likely non-starter compared to Flink for some of my use cases.)


Flink serialization become much less painful when we switch every value type in our job to Avro. It requires a bit more code compare to using Java POJO or Scala case class, but you can actually add more field to existing data type without breaking anything now :)


For once almost all commenters so far are thinking the same thing I am. Yet another stream processor from Apache. Might as well create the acronym now. YASP.


Not so much "from" Apache, as "sent to Apache to die with dignity after the originating big company git bored of it"


I really wonder what people actually use stream processing for, like very concrete examples. My best examples would only go far filtering a stream over a time window to compute an aggregate. My job does not require anything more, it's always basic ETL, but I really need to hear specific examples where it's useful for others. Been a long time fan of Apache Flink.


Standard use cases for stream processing are:

1. Enriching event streams. Say you have a stream of log records with an IP address field. You want to enrich with a geo-location before sending the logs to Elasticsearch.

2. Windowed aggregation. Maybe you have an application that is emitting "login" events and you want to to detect login attempts from different IP addresses within X minutes of each other.

3. Joining multiple event streams. You have multiple different event streams and you want to join them together using some common join key (maybe session ID or something like that) to compute a metric that aggregates all of them.

There are plenty of more esoteric use cases as well.


They are the obvious ones, I'm looking for more advanced & specific ones.


Operational analytics when dealing with real world systems (transportation logistics, tracking machine states on factory floors, sensor data fusion, real-time operational dashboards for capital markets, etc).


Streaming model training for recommendation systems like tiktok, facebook, youtube, etc. If you want models to learn quickly to new events allowing them to do better on very new content/users you will want to use a streaming pipeline. I have seen flink used for model training at large scale here. Model evaluation can also be done in a streaming manner with training and needs to be done parallel for model monitoring.

The lowest latency model refresh time I've come across is ~5 minutes. Going lower is likely to be a lot of data transfers for model synchronization (sync the serving model with training model) for little value.


So, AWS Pelican coming in 3 months?


Everyone, please note that this is NOT an announcement of Heron entering the Apache Foundation. This happened several years ago, I believe in 2017 (when I was working on it).


Since no one mentioned it in the comments so far, please add Hazelcast Jet to the long list of stream processors : https://github.com/hazelcast/hazelcast-jet


Anyone use this in production except Twitter? Just curious.


Is it another corporate dumping at Apache or does Twitter seem to continue to support it as an open source project there?


Uhh, a triple-take: read this twice as Apache Heroin, and I’m not a druggy.

Is it apt? Reader exercise.


For a nonprofit, apache seems to have an unhealthy affection for solving scaling problems that only a few giant rich companies have


You’ve got the flow backwards. Giant rich companies contribute projects that solve their problems to the ASF


That criticism doesn't really make sense, given that Apache the non-profit isn't developing the software.


But it is funding it


The Apache Foundation isn't funding development in any meaningful way, no.


What does the Apache Foundation do/provide/give these projects if no funding is involved? Is it just like a governance thing like "if this project falls into an abandoned state, we'll oversee the process of transferring ownership to any interested and credible parties interested in taking over"?


Mostly a governance thing, yes. "Community infrastructure, coordination and legal framework in a box".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: