IMHO, if you need stream processing, start with Apache Flink. It not only offers...

pixelmonkey · on July 13, 2021

Flink is pretty cool, but the Flink Python API and Beam Python APIs are pretty atrocious, and some of the higher-level APIs (like Flink SQL) are pretty hard to grok. I kicked the tires on Flink and tried to love it, but I couldn't get there. Storm (which is Heron's inspiration and is still better than Heron, IMO) is a lot simpler conceptually. But it's pretty clear Flink was built to avoid the need to combine Storm Streaming + Spark Batch for "Lambda Architecture" style setups.

uDontKnowMe · on July 13, 2021

Python's cool, but "right tool for the job" and all that. Just dive in with the Java or Scala api's and you'll have a better time :-).

pixelmonkey · on July 13, 2021

Thanks for the suggestion, but, pragmatically, it's not an option. My colleagues and I haven't written data processing, distributed systems, NLP/ML/datasci, or web tier code in Java since 2006, and we don't plan on starting now.

We've used Python at scale on petabyte-sized production data, multi-billion-request API tiers, and high-concurrency low-latency data processing all the while.

Java is the right tool for writing system-level code in some isolated contexts, but Python is the right tool for the job for a huge number of important use cases, with those use cases growing by the day.

I love Java (and appreciate Kotlin and Clojure), but with parallel compute power cheaper by the day, Python's focus on code simplicity, open source ecosystem, and programmer happiness continues to win the day.

manishsharan · on July 13, 2021

If you are familiar with Flink, would you mind sharing your use cases? why Flink and not Kafka ? I have experience with Kafka but I have not had a reason to investigate Flink for my use cases.

PhoenixReborn · on July 13, 2021

> why Flink and not Kafka ?

I assume you mean "Kafka Streams" when you say this, as Kafka is just a event bus and Flink can be used to read messages from it.

The biggest advantage of Flink (IMO) is that you can write the Flink logic once, and reuse it for both batch and stream processing. So if I write a Flink job that consumes a Kafka stream and produces some aggregated outputs, that same job can be run against my data lake in S3/GCS/Azure Blob Storage, etc.

Kafka Streams does not support batch processing, or working on top of anything other than Kafka. Flink supports building on top of other message buses like Pulsar as well: https://flink.apache.org/news/2019/11/25/query-pulsar-stream...

dikei · on July 14, 2021

Actually, unified batch and stream processing in Flink is a bit of false advertisement. In Flink, stream and batch have different API (DataStream vs DataSet) so reusing logic is not really practical, unless you abstract out everything, in which case you might as well use Spark which is faster than DataSet API. Flink's developers is trying to get rid of DataSet API and move to DataStream and Table API for everything, but it's not done yet.

dikei · on July 14, 2021

Advantage of Flink over Kafka Stream

* Flink work on other message queue tech other than Kafka like Amazon SQS, Pulsar, etc..

* Back when I last read about Kafka Stream, Flink has better support for stateful processing, the entire state of execution can be safely snapshot into storage and resume at any time.

* Kafka Stream shuffle data is slower, because it has to send data to a new topic in the broker, instead of sending it directly between compute node.

Advantage of Kafka Stream over Flink

* Kafka Stream deployment is simple, just start the jar like any other Java program, scaling up by running the jar multiple time. On the other hand, Flink need a mature orchestration framework like YARN, Meso or K8S, trying to manage a Flink deployment without them is very painful.

* Flink require a central, persistent storage like HDFS or S3 for its checkpoint mechanism; Kafka stream doesn't.

fnord77 · on July 13, 2021

we had a terrible experience with flink. we were able to replace it all with kafka streams in short order. I wouldn't recommend flink for anything.

saryant · on July 13, 2021

I had the same experience. It felt like a distributed system written by people who’d never run one in production before.