IMHO, if you need stream processing, start with Apache Flink. It not only offers a much easier to user API compare to Storm and Heron, but also has a superior execution model for time-based, exactly-once, stateful stream processing.
Flink is pretty cool, but the Flink Python API and Beam Python APIs are pretty atrocious, and some of the higher-level APIs (like Flink SQL) are pretty hard to grok. I kicked the tires on Flink and tried to love it, but I couldn't get there. Storm (which is Heron's inspiration and is still better than Heron, IMO) is a lot simpler conceptually. But it's pretty clear Flink was built to avoid the need to combine Storm Streaming + Spark Batch for "Lambda Architecture" style setups.
Thanks for the suggestion, but, pragmatically, it's not an option. My colleagues and I haven't written data processing, distributed systems, NLP/ML/datasci, or web tier code in Java since 2006, and we don't plan on starting now.
We've used Python at scale on petabyte-sized production data, multi-billion-request API tiers, and high-concurrency low-latency data processing all the while.
Java is the right tool for writing system-level code in some isolated contexts, but Python is the right tool for the job for a huge number of important use cases, with those use cases growing by the day.
I love Java (and appreciate Kotlin and Clojure), but with parallel compute power cheaper by the day, Python's focus on code simplicity, open source ecosystem, and programmer happiness continues to win the day.
If you are familiar with Flink, would you mind sharing your use cases? why Flink and not Kafka ? I have experience with Kafka but I have not had a reason to investigate Flink for my use cases.
I assume you mean "Kafka Streams" when you say this, as Kafka is just a event bus and Flink can be used to read messages from it.
The biggest advantage of Flink (IMO) is that you can write the Flink logic once, and reuse it for both batch and stream processing. So if I write a Flink job that consumes a Kafka stream and produces some aggregated outputs, that same job can be run against my data lake in S3/GCS/Azure Blob Storage, etc.
Actually, unified batch and stream processing in Flink is a bit of false advertisement. In Flink, stream and batch have different API (DataStream vs DataSet) so reusing logic is not really practical, unless you abstract out everything, in which case you might as well use Spark which is faster than DataSet API. Flink's developers is trying to get rid of DataSet API and move to DataStream and Table API for everything, but it's not done yet.
* Flink work on other message queue tech other than Kafka like Amazon SQS, Pulsar, etc..
* Back when I last read about Kafka Stream, Flink has better support for stateful processing, the entire state of execution can be safely snapshot into storage and resume at any time.
* Kafka Stream shuffle data is slower, because it has to send data to a new topic in the broker, instead of sending it directly between compute node.
Advantage of Kafka Stream over Flink
* Kafka Stream deployment is simple, just start the jar like any other Java program, scaling up by running the jar multiple time. On the other hand, Flink need a mature orchestration framework like YARN, Meso or K8S, trying to manage a Flink deployment without them is very painful.
* Flink require a central, persistent storage like HDFS or S3 for its checkpoint mechanism; Kafka stream doesn't.