Apache Kafka Beyond the Basics: Windowing

claytongulick · on Feb 9, 2023

Queues like Kafka can be great for a certain class of solution, but I see them over used.

The justification is usually a sort of handwave about scaling, but I tend to be skeptical of that argument.

I remember in the early 2000's writing a system that handled all of the consumer equipment orders and returns for Verizon, pick pack and ship and advanced billing integration with the ILEC phone bill, credit systems, IVR, etc... On a PII class server running SQL server stored procedures and a .net UI.

In 2009, I was processing a TB per month of 837 claims on even worse hardware.

Not that long ago, I saw a post on HN where a Casio calculator was handling the front page traffic.

My point is that even modest hardware has the capability to handle massive loads when the software is designed properly.

Kafka and it's ilk are great for certain use cases, but it's normally a mistake to start out with it unless you're really sure you know what you're doing.

CQRS and data-bus/event sourced systems can be tricky to get right, and you lose a lot when you go to async processing, especially when submitting transactions from a user-facing app.

Inevitably you have to create some sort of polling or server push mechanism to update the UI, and the number of edge cases that can bite you are vast.

EdwardDiego · on Feb 9, 2023

Kafka's not a message queue, it's a distributed log.

The main advantage of Kafka is that you can:

1. Write a shitload of data to it fast

2. Experience a cluster failure or degradation with low risk of data loss

3. Retain absolute ordering, with caveats, if you're careful.

4. Run large numbers of concurrent consumers without imposing significant overhead on the cluster

That's pretty much it. I've made a good career the last few years out of Kafka, but there are _so_ many companies using it that don't need it, and what strikes me as passing odd is when I've been consulting on their Kafka adoption and tell them that, they don't want to hear it.

Which is really weird IMO because I stand to make far more money by encouraging them to use it unnecessarily because...

...Kafka is a complex solution to a complicated problem, and often people using it don't have the sheer scale of data that would necessitate taking on that complexity.

> In 2009, I was processing a TB per month of 837 claims on even worse hardware.

Kafka starts to shine when you're creating X TiB of data a day, and would like not to lose it.

It's best mentally modelled as a series of rather resilient tubes, that way people can stop thinking it's just a super duper scalable message queue, and they can just drop it in as a black box and get all the semantics they're used to from ActiveMQ / RabbitMQ etc.

Even Pulsar, which tries to emulate MQ semantics, isn't a drop-in replacement, at all. And it has even more moving parts than Kafka.

tstrimple · on Feb 9, 2023

I've had the most luck at enterprises with large legacy foundations that they are desperate to modernize. These companies have tried to migrate to the cloud. They have tried building replacements for their legacy systems. They have tried paying millions of dollars to consulting companies to build a one size fits all replacement for their legacy mess. Kafka has been the fastest way to start taming some of the ridiculous amounts of data going through these systems that are so convoluted that no one knows where all the bodies are buried. I've found it gives a very good layer of separation where you can still interface with and get data from these legacy systems while also enabling modern application development with said data.

Company after company has spent years trying to just classify and normalize all of their data. These big data warehouse style environments always end up so brittle and take so much longer to get anything useful out of compared to the sort of immediate and incremental improvement you can get with a Kafka style migration away from these older systems. You're right that it's overkill for so many companies though. I'm curious to know where you've seen success with Kafka.

EdwardDiego · on Feb 9, 2023

> I'm curious to know where you've seen success with Kafka.

Much the same as you, as a replacement for brittle data pipelines where multiple services are generating data, and then a cron moves it across a network file system, or the apps are pushing to something like ES directly, which has caused data loss when their ES wasn't configured correctly, etc. etc.

The nice thing about Kafka is that it decouples those dependencies, which gives you room to scale the ingestion side, and also allows for failures without data loss.

The one caveat though, is that Kafka becomes the dependency, but in my years of using Kafka in anger (since 0.8), I've only ever encountered full cluster unavailability a few times.

Memorable ones: * a sysop who decided that the FD limit for processes on the boxes running their self-managed Kafka should be a nice low number, which prevented Kafka from opening sufficient sockets

* a 3 node cluster, where to "prevent data loss", they were using acks=all and had configured replicas to be 3, and min insync replicas to be 3, and then a broker had gone down. Yeah, not great that.

* A 2.5 stretch cluster (brokers and ZKs in two AZs, tie-breaker ZK in a 3rd) suffered a gradual network partition when an AWS DC in Frankfurt overheated, and they ended up with two leaders for the same topic partition in each of their main AZs, that were still accepting writes as they could hit minISR within their own DCs. There was some other weirdness in how they interacted with the tie-breaker ZK. And when the network partition was ended, neither of them could be elected leader. Had to dump the records from both brokers from the time the partition began, roll the brokers to allow an unclean leader election, then roll them again to turn it off once the election had succeeded, and then the client got to sift through the data and figure out what to do with it - as they weren't relying on absolute ordering, it was reasonably trivial for them to just write it again IIRC.

So for absolutely essential do not lose data, if Kafka connectivity was lost, the app should make itself unready, and then fail over to writing any uncommitted data to disk for a side-car to upload to S3 or similar.

Kafka's great, but you have to code deliberately for it, which is where I saw the biggest mistakes occur - strapping Kafka to ActiveMQ with Camel and expecting Kafka to work the same as ActiveMQ...

zakki · on Feb 9, 2023

>> Kafka's not a message queue, it's a distributed log.

Is it in the same area as Graylog? How is it compared?

edude03 · on Feb 9, 2023

The logging platform? Graylog is backed by kafka for the reasons OP mentioned

EdwardDiego · on Feb 9, 2023

Log in the sense of the append-only data structure :)

Kinrany · on Feb 9, 2023

Thoughts on NATS?

codegangsta · on Feb 9, 2023

Disclaimer: I work on NATS and am employed by Synadia, maintainer of NATS.

NATS is much more lightweight than the alternatives, and has a much simpler operation story and can run on low resource hardware very easily.

Kafka fits a very specific use case, and NATS tries to solve a much larger set of distributed computing use cases, including micro-services, distributed KV, streaming, pub/sub, object store and more.

It’s a very different philosophy, but can be very powerful when it’s deployed as a utility inside an organization.

We’ve seen so many success stories, especially companies moving to the edge, we continue to see NATS used as a common transport for all kinds of use cases in finance, IoT, edge, Industry 4.0, and Retail

bsaul · on Feb 9, 2023

how robust is nats persistence layer ? strenght of kafka relies on its proven ability to maintain a high load while making sure you won't loose the events you're streaming.

jarema · on Feb 9, 2023

Disclaimer: I'm a NATS maintainer working in Synadia (the company behind NATS)

It is robust and performant, with subject-based querying, HA, scalability, streams, key-value store, mirrors and at-least-once and exactly-once semantics.

Used in production on a large scale by companies across many industries.

EdwardDiego · on Feb 10, 2023

I like it, spent some time poking LiftBridge, but have never had a chance to use either in anger.

nickpeterson · on Feb 9, 2023

This has always been true. Some tight C code running on a raspberry pi could probably handle most of the data needs for ordinary businesses today. But to acknowledge this reality makes CIO's sweat ;)

happymellon · on Feb 9, 2023

All the times I've used Kafka, it's felt like the microservices confusion. A technical solution that can be taken to the extreme, when really it's a solution to a political problem.

If we break our application into small enough chunks then we start running into internal consistency issues, so we will introduce Kafka to resolve this. I'm dealing with it at the moment where my customer decided to break their crud application to CUD and R because "best practice". Now they have Kafka synchronising the two applications because it's a terrible design.

bicijay · on Feb 9, 2023

You are talking about CQRS, which is far from a "terrible design", maybe not right for your customer?

happymellon · on Feb 9, 2023

I understand that it might help some people, and I am referring to my clients desire as terrible rather than the whole concept. However I have yet to actually see it used without it increasing complexity without demonstrable improvements. I see that Martin Fowler agrees.

https://martinfowler.com/bliki/CQRS.html

It falls into the "you are not Google" problem solving hole.

I'm dealing with a financial institution doing this. CQRS should be an evolution if you find it complex to model, use, and scale not the initial implementation. Now we have another location to lose data.

tstrimple · on Feb 9, 2023

What? Calling Kafka a queue seems to be a fundamental misunderstanding what it is as a tool. Why would you be polling in Kafka for real-time updates to a UI? That's not necessary at all, a topic subscriber will automatically pick up new messages and can send them via web socket. Kafka doesn't mean databases go away, so there's absolutely zero reason for any sort of issues submitting transactions from a user-facing app. It's almost like you're not really talking about Kafka.

ramenmeal · on Feb 9, 2023

He mentioned "or server push mechanism", aka websockets. I don't think they're trying to call kafka a queue, they're just talking about asynchronous architectures.

oweiler · on Feb 9, 2023

I agree that Kafka is overused. But IMHO Kafka's strength is not scaling but decoupling. Services can publish data without other interested parties being online. If you need this, is obviously another question.

happymellon · on Feb 9, 2023

That is true of most pub/sub systems though. Kafka has a lot of other "features" which is what makes it overkill for most situations.

xupybd · on Feb 9, 2023

I'm about to start implementing this sort of thing myself. We have a handful of webservices, an online store, a customer design tool and some third party connections to customers and suppliers. Then we have a factory of machines and operator stations that need to fetch and display information.

My thinking was to use something like Kafka, leaning towards NATS, as a big pub sub pipe.

I'm not looking for scaling. We are slow and don't need real time. What I need is uniformity. I don't want this connection to interface through FTP while that one is over a REST API. I'd prefer one connection for our main system and little interfaces to connect everything to the same pipe.

codegangsta · on Feb 9, 2023

Disclaimer: NATS maintainer

NATS definitely tries to solve for that end to end experience, where you can have NATS simultaneously driving websocket connections as well as backend connections in a secure and performant way. I’ve personally had a ton of fun making interactive applications with NATS, it scales reliably and is really fun to work with

edude03 · on Feb 9, 2023

> Queues like Kafka

Kafka is a distributed log. A sibling comment to this mentions this as well, but I'm repeating it because I actually think that's were most people using kafka go wrong. You _can_ use kafka as a queue, but it'll would feel like using a drill to hammer a nail, would kind of work but would make you think the tool just sucks.

> The justification is usually a sort of handwave about scaling, but I tend to be skeptical of that argument.

Kafka does help with scaling in so far as you can independently scale the number of producers and consumers but I find kafka the most useful in its ability to allow multiple consumers to process the same message(s) at different rates.

For example let's say you have an app that's producing messages like "{foo: 1, bar: 2 }" and you need to count how many foos there are and how many bars there are. You could queue the messages, then have 1 service that does both, or you could use kafka and have a FooCounter and BarCounter service. They each process the same message(s) but ignore the part of the payload they don't care about.

While this is really trivial example, I've used, and have seen this pattern used for processing actual logs for example.

lmm · on Feb 9, 2023

You don't need to scale for performance, but you do for reliability. SQL databases still tend to rely on having one server that never goes down, or at best having a replica and a semi-manual "promotion" procedure.

Kafka does what a datastore should and nothing more. It's what databases should have been. High performance is a consequence of that good design, but the good design is valuable in itself.

I would always start with Kafka. There are so many ways to shoot yourself in the foot with an SQL database that will bite you later, sometimes in a way that may corrupt your data permanently (e.g. read-modify-write logic), and you'll almost certainly end up having to change your data architecture to the one you'd've started with if you were using Kafka. Or you can just throw it in Kafka from day 1 and a lot of bad ideas just become impossible because the system gives you no way to do them.

I was also working on phone systems in the 2000's that ran at "web scale". We used MySQL, but we essentially built Kafka on top of MySQL - tables for every intermediate step in the process, and a way for the next step in the processing pipeline to mark which rows it had started/finished consuming.

paulmd · on Feb 9, 2023

> Queues like Kafka can be great for a certain class of solution, but I see them over used.

> The justification is usually a sort of handwave about scaling, but I tend to be skeptical of that argument.

actually this sort of solution is a great fit to GPGPU, where each thread has some kind of a "window" and will perform N aligned accesses in a stepwise fashion to compute its particular data. Aligned/coalesced accesses and "re-computation" are very cheap in GPGPU and that often means you lean on re-computation rather than notarization or other "stateful" approaches.

I think something like this would be a great fit to AVX-512 probably.

aprdm · on Feb 9, 2023

I've deployed it with the pretty minimum configs, I've wrote a consumer with the minimum python example, and a writer with the minimum python example.

It all just sort of works, what am I missing beyond this ? Is this the basics ? :)

I saw consumer groups and whatnot.. but didn't bother. FWIW this is actually running a lot of processes.. and it all just works.. also using Sentry to send errors/exceptions so I know when the consumers or the producer are not working

bsuvc · on Feb 9, 2023

You need consumer groups when you scale the number of consumers, so one consumer doesn't process the same records as another consumer.

Having different consumer groups let's you have multiple different consumers for different purposes that DO process the same records.

lloydatkinson · on Feb 9, 2023

Why does Kafka allow a consumer to process the same messages as another consumer by default anyway? With RMQ or others, this "consumer group" behaviour is just how it works be default.

reese_john · on Feb 9, 2023

A Kafka topic is divided into partitions. A consumer group can have many consumers. The consumers inside the same group will be assigned to consume a subset of the partitions.

Say you have a topic with 10 partitions. Consumer group A has 10 consumers and Consumer group B has 1 consumer.

Each consumer in group A will be assigned to a single partition,thus processing different messages.

The single consumer in group B will be assigned to all 10 partitions, processing the same messages as group A.

Kafka is a distributed log, so this pattern is useful for replicating data across downstream applications.

joking · on Feb 9, 2023

Different defaults, RMQ and others also had another way of storing the messages, so the sensible defaults for RMQ are different from the ones for Kafka. Head to head, Kafka does less processing on the messages and that work is left for the consumers, while rabbit does more work on the broker and less on the consumers. Depending on the requirements, one of the two is best suited than the other, but both can do almost the same kind of work.

tstrimple · on Feb 9, 2023

Because Kafka is not a message queue, but can be used as one. Multiple consumers on a topic is the default behavior because it’s very useful for the typical use cases in Kafka. I like to think of Kafka as bringing “micro services” to data processing. If I want order data to be both shown real time on a dashboard as well as aggregated for reporting purposes I’d build two different consumers for the same topic. One which sends the results off to the order dashboard via websocket and one which performs a rolling aggregation using a streaming query most likely. Maybe you now need to alert the sales team when an order comes in over a certain amount so they can give the customer the white glove treatment. Just add another consumer which is a ksql query for orders over $$$ which then get dropped into the high value orders topic. There you can have a consumer fire off the email to the sales team. Want to do other things when a high value order comes in? Add another consumer.

sumtechguy · on Feb 9, 2023

I believe the default of kafka is if you do not specify a consumer group it makes up a random one for you. It is an optional field.

Not sure 'why of it'. If it were me doing that sort of behavior I would think the idea was short lived processes that do a small amount of work then disappear. For that idea, consumer group would not make sense as you would just want to start at the front or end of the topic and blast thru it anyway. Once you add the idea of two different things happening or gangs of processes working from the same events with partitions then consumer groups makes a lot more sense to add in. But retro fitting it back in would be awful? So default is 'random' groupid? Probably in the pip's somewhere what the rational is.

woile · on Feb 9, 2023

That's the basics yes. You have a pletora of things coming next. One is "Windowing" mentioned in the article, it's well explained there, and maybe it looks simple, but when you start with it, takes some time to wrap your mind around it.

The other things in kafka world are stateful transformations, which you would normally do using Java's Flink. The closest in python is Faust (the fork) [0]. What are stateful aggregations? something like doing SQL on top of a topic: group_by, count, reduce, and joins. So similar to SQL that you have kSQL [1].

Consumer groups IMO falls under basic usage, if you need to scale, take a look at it, and what are partitions and replicas, with that in mind, you'll be ok.

[0]: https://github.com/faust-streaming/faust

[1]: https://www.confluent.io/blog/ksql-streaming-sql-for-apache-...

aprdm · on Feb 10, 2023

I throw my events in databases and then query them in the dbs themseveles through services when it makes sense to do so

reese_john · on Feb 9, 2023

Do you have auto-commit on for your consumers? If yes, which tends to be the case for most basic examples, then you'll have "at most once" semantics, and you may experience unwanted data loss* should your consumers fail.

If not, then you'll have to manually commit the offsets and your consumer will have "at least once" semantics, meaning that you'll have to deal with duplicate messages (you should be designing idempotent pipelines, anyway)

morelisp · on Feb 9, 2023

Whether you have at-least or at-most semantics is more related to when you commit relative to your other logic than whether you do it automatically or manually, but for common cases where you “do something” like an HTTP call or increment a number in Redis per message, autocommit will give you at-least (it will commit after the next poll call which happens after processing the items).

reese_john · on Feb 9, 2023

Good point, you can have both at-least and at-most once semantics.

pojzon · on Feb 9, 2023

Lets write a „How to upgrade self-managed Kafka with no downtimes”

Coz even Yelp says to „just in case schedule a maintenance window”

mau · on Feb 9, 2023

I don't know what were the issues Yelp was facing, I have upgraded several times my past Kafka clusters and really never experienced any issues. Normally the upgrade instructions are documented (e.g. https://kafka.apache.org/31/documentation.html#upgrade) and the regular rolling upgrade comes with no downtimes.

Besides this, operating Kafka never required much effort a part when we needed to re-balance partitions across brokers. Earlier versions of Kafka required to handle it with some external tools to avoid network congestions, but I think this is part of the past now.

On the other hand, Kafka still needs to be used carefully, especially you need to plan topics/partitions/replications/retention but that really depends by the application needs.

Hackbraten · on Feb 9, 2023

I used to work in a project where every other rolling upgrade (Amazon’s managed Kafka offering) would crash all the Streams threads in our Tomcat-based production app, causing downtime due to us having to restart.

The crash happens 10–15 minutes into the downtime window of the first broker. Absolutely no one has been able to figure out why, or even to reproduce the issue.

Running out of things to try, we resorted to randomly changing all sorts of different combinations of consumer group timeouts, which are imho poorly documented so no one really understands which means which anyway. Of course all that tweaking didn’t help either (gunshot debugging never does).

This has been going on for the last two years. As far as I know, the issue still persists. Everyone in that project is dreading Amazon’s monthly patch event.

sumtechguy · on Feb 9, 2023

Check the errors coming back on your poll/commit. The kafka stack should tell you when you can retry items. If it is in the middle of something sometimes it does not always fail nicely but you can retry and it is usually fine. Usually I see that sort of behavior if the whole cluster just 'goes away' (reboots, upgrades, etc). It will yeet out a network error and then just stop doing anything. You have to watch for it and recreate your kafka object (sometimes, sometimes retry is fine). If they are bouncing the whole cluster on you each broker can take a decent amount of time before they are alive again. So if you have 3 and they restart all 3 in quick succession all at once you will see some nasty behavior out of the kafka stack. You can fiddle your retries and timeouts. However, if that is lower than it takes for the cluster to come back you can end up with what looks like a busted kafka stream. I have seen it take anywhere from 3-10 mins for a single broker to restart sometimes (other times it is like 10 seconds). So depending on the upgrade/patch script that can be a decent outage. It goes really sideways if the cluster has a lot of volume to replicate between topics on each broker (your replica factor).

rehevkor5 · on Feb 9, 2023

Along similar lines, I found Flink's blog posts to be good reading. Some examples:

https://flink.apache.org/news/2015/12/04/Introducing-windows...

https://www.ververica.com/blog/session-windowing-in-flink

https://flink.apache.org/news/2020/07/30/demo-fraud-detectio...

bayesian_horse · on Feb 9, 2023

If you're developing something that involves windowing on a scalar timeseries, don't use kafka, use a timeseries database. Much much easier, probably even more performant.

ivanhoe · on Feb 9, 2023

Any recommendation on the time-series DB to try?

bayesian_horse · on Feb 9, 2023

Timescaledb is based on Postgres, always bet on Postgres...

BenoitP · on Feb 9, 2023

Does Timescaledb do streaming? I was under the impression that it was more focused on OLAP for time data.

I'm pretty sure Timescaledb is a very good product, but columnar vs row-based reactive streaming systems tend to very different needs.

bayesian_horse · on Feb 9, 2023

I guess there's different tradeoffs when you have a gazillion of these sensors or their output requires complicated parsing in the pipeline. But even then...

bayesian_horse · on Feb 9, 2023

I haven't used it much but it does do "continuous aggregates".

aynyc · on Feb 9, 2023

What does this means?

human-centered developer experience

luciacerchie · on Feb 9, 2023

Hi aynyc,

When I describe myself as interested in a human-centered developer experience, I'm making a reference to Don Norman (google Nielsen-Norman Group, or The Design of Everyday Things), who believed in human-centered design. I believe in not belittling the user/developer when they misuse an interface (website, CLI, API, what have you) but instead trying to understand what kind of internal motivation caused their behavior when it's unexpected to me. Making changes based on that leads to a good developer experience.

jjtheblunt · on Feb 9, 2023

programmer ergonomics: minimal cognitive load apis?

lloydatkinson · on Feb 9, 2023

https://developerexperience.io/articles/good-developer-exper...