I've deployed it with the pretty minimum configs, I've wrote a consumer with the minimum python example, and a writer with the minimum python example.
It all just sort of works, what am I missing beyond this ? Is this the basics ? :)
I saw consumer groups and whatnot.. but didn't bother. FWIW this is actually running a lot of processes.. and it all just works.. also using Sentry to send errors/exceptions so I know when the consumers or the producer are not working
Why does Kafka allow a consumer to process the same messages as another consumer by default anyway? With RMQ or others, this "consumer group" behaviour is just how it works be default.
A Kafka topic is divided into partitions. A consumer group can have many consumers. The consumers inside the same group will be assigned to consume a subset of the partitions.
Say you have a topic with 10 partitions. Consumer group A has 10 consumers and Consumer group B has 1 consumer.
Each consumer in group A will be assigned to a single partition,thus processing different messages.
The single consumer in group B will be assigned to all 10 partitions, processing the same messages as group A.
Kafka is a distributed log, so this pattern is useful for replicating data across downstream applications.
Different defaults, RMQ and others also had another way of storing the messages, so the sensible defaults for RMQ are different from the ones for Kafka. Head to head, Kafka does less processing on the messages and that work is left for the consumers, while rabbit does more work on the broker and less on the consumers. Depending on the requirements, one of the two is best suited than the other, but both can do almost the same kind of work.
Because Kafka is not a message queue, but can be used as one. Multiple consumers on a topic is the default behavior because it’s very useful for the typical use cases in Kafka. I like to think of Kafka as bringing “micro services” to data processing. If I want order data to be both shown real time on a dashboard as well as aggregated for reporting purposes I’d build two different consumers for the same topic. One which sends the results off to the order dashboard via websocket and one which performs a rolling aggregation using a streaming query most likely. Maybe you now need to alert the sales team when an order comes in over a certain amount so they can give the customer the white glove treatment. Just add another consumer which is a ksql query for orders over $$$ which then get dropped into the high value orders topic. There you can have a consumer fire off the email to the sales team. Want to do other things when a high value order comes in? Add another consumer.
I believe the default of kafka is if you do not specify a consumer group it makes up a random one for you. It is an optional field.
Not sure 'why of it'. If it were me doing that sort of behavior I would think the idea was short lived processes that do a small amount of work then disappear. For that idea, consumer group would not make sense as you would just want to start at the front or end of the topic and blast thru it anyway. Once you add the idea of two different things happening or gangs of processes working from the same events with partitions then consumer groups makes a lot more sense to add in. But retro fitting it back in would be awful? So default is 'random' groupid? Probably in the pip's somewhere what the rational is.
That's the basics yes. You have a pletora of things coming next. One is "Windowing" mentioned in the article, it's well explained there, and maybe it looks simple, but when you start with it, takes some time to wrap your mind around it.
The other things in kafka world are stateful transformations, which you would normally do using Java's Flink. The closest in python is Faust (the fork) [0]. What are stateful aggregations? something like doing SQL on top of a topic: group_by, count, reduce, and joins. So similar to SQL that you have kSQL [1].
Consumer groups IMO falls under basic usage, if you need to scale, take a look at it, and what are partitions and replicas, with that in mind, you'll be ok.
Do you have auto-commit on for your consumers? If yes, which tends to be the case for most basic examples, then you'll have "at most once" semantics, and you may experience unwanted data loss* should your consumers fail.
If not, then you'll have to manually commit the offsets and your consumer will have "at least once" semantics, meaning that you'll have to deal with duplicate messages (you should be designing idempotent pipelines, anyway)
Whether you have at-least or at-most semantics is more related to when you commit relative to your other logic than whether you do it automatically or manually, but for common cases where you “do something” like an HTTP call or increment a number in Redis per message, autocommit will give you at-least (it will commit after the next poll call which happens after processing the items).
It all just sort of works, what am I missing beyond this ? Is this the basics ? :)
I saw consumer groups and whatnot.. but didn't bother. FWIW this is actually running a lot of processes.. and it all just works.. also using Sentry to send errors/exceptions so I know when the consumers or the producer are not working