Queues like Kafka can be great for a certain class of solution, but I see them over used.
The justification is usually a sort of handwave about scaling, but I tend to be skeptical of that argument.
I remember in the early 2000's writing a system that handled all of the consumer equipment orders and returns for Verizon, pick pack and ship and advanced billing integration with the ILEC phone bill, credit systems, IVR, etc... On a PII class server running SQL server stored procedures and a .net UI.
In 2009, I was processing a TB per month of 837 claims on even worse hardware.
Not that long ago, I saw a post on HN where a Casio calculator was handling the front page traffic.
My point is that even modest hardware has the capability to handle massive loads when the software is designed properly.
Kafka and it's ilk are great for certain use cases, but it's normally a mistake to start out with it unless you're really sure you know what you're doing.
CQRS and data-bus/event sourced systems can be tricky to get right, and you lose a lot when you go to async processing, especially when submitting transactions from a user-facing app.
Inevitably you have to create some sort of polling or server push mechanism to update the UI, and the number of edge cases that can bite you are vast.
Kafka's not a message queue, it's a distributed log.
The main advantage of Kafka is that you can:
1. Write a shitload of data to it fast
2. Experience a cluster failure or degradation with low risk of data loss
3. Retain absolute ordering, with caveats, if you're careful.
4. Run large numbers of concurrent consumers without imposing significant overhead on the cluster
That's pretty much it. I've made a good career the last few years out of Kafka, but there are _so_ many companies using it that don't need it, and what strikes me as passing odd is when I've been consulting on their Kafka adoption and tell them that, they don't want to hear it.
Which is really weird IMO because I stand to make far more money by encouraging them to use it unnecessarily because...
...Kafka is a complex solution to a complicated problem, and often people using it don't have the sheer scale of data that would necessitate taking on that complexity.
> In 2009, I was processing a TB per month of 837 claims on even worse hardware.
Kafka starts to shine when you're creating X TiB of data a day, and would like not to lose it.
It's best mentally modelled as a series of rather resilient tubes, that way people can stop thinking it's just a super duper scalable message queue, and they can just drop it in as a black box and get all the semantics they're used to from ActiveMQ / RabbitMQ etc.
Even Pulsar, which tries to emulate MQ semantics, isn't a drop-in replacement, at all. And it has even more moving parts than Kafka.
I've had the most luck at enterprises with large legacy foundations that they are desperate to modernize. These companies have tried to migrate to the cloud. They have tried building replacements for their legacy systems. They have tried paying millions of dollars to consulting companies to build a one size fits all replacement for their legacy mess. Kafka has been the fastest way to start taming some of the ridiculous amounts of data going through these systems that are so convoluted that no one knows where all the bodies are buried. I've found it gives a very good layer of separation where you can still interface with and get data from these legacy systems while also enabling modern application development with said data.
Company after company has spent years trying to just classify and normalize all of their data. These big data warehouse style environments always end up so brittle and take so much longer to get anything useful out of compared to the sort of immediate and incremental improvement you can get with a Kafka style migration away from these older systems. You're right that it's overkill for so many companies though. I'm curious to know where you've seen success with Kafka.
> I'm curious to know where you've seen success with Kafka.
Much the same as you, as a replacement for brittle data pipelines where multiple services are generating data, and then a cron moves it across a network file system, or the apps are pushing to something like ES directly, which has caused data loss when their ES wasn't configured correctly, etc. etc.
The nice thing about Kafka is that it decouples those dependencies, which gives you room to scale the ingestion side, and also allows for failures without data loss.
The one caveat though, is that Kafka becomes the dependency, but in my years of using Kafka in anger (since 0.8), I've only ever encountered full cluster unavailability a few times.
Memorable ones:
* a sysop who decided that the FD limit for processes on the boxes running their self-managed Kafka should be a nice low number, which prevented Kafka from opening sufficient sockets
* a 3 node cluster, where to "prevent data loss", they were using acks=all and had configured replicas to be 3, and min insync replicas to be 3, and then a broker had gone down. Yeah, not great that.
* A 2.5 stretch cluster (brokers and ZKs in two AZs, tie-breaker ZK in a 3rd) suffered a gradual network partition when an AWS DC in Frankfurt overheated, and they ended up with two leaders for the same topic partition in each of their main AZs, that were still accepting writes as they could hit minISR within their own DCs. There was some other weirdness in how they interacted with the tie-breaker ZK. And when the network partition was ended, neither of them could be elected leader. Had to dump the records from both brokers from the time the partition began, roll the brokers to allow an unclean leader election, then roll them again to turn it off once the election had succeeded, and then the client got to sift through the data and figure out what to do with it - as they weren't relying on absolute ordering, it was reasonably trivial for them to just write it again IIRC.
So for absolutely essential do not lose data, if Kafka connectivity was lost, the app should make itself unready, and then fail over to writing any uncommitted data to disk for a side-car to upload to S3 or similar.
Kafka's great, but you have to code deliberately for it, which is where I saw the biggest mistakes occur - strapping Kafka to ActiveMQ with Camel and expecting Kafka to work the same as ActiveMQ...
Disclaimer: I work on NATS and am employed by Synadia, maintainer of NATS.
NATS is much more lightweight than the alternatives, and has a much simpler operation story and can run on low resource hardware very easily.
Kafka fits a very specific use case, and NATS tries to solve a much larger set of distributed computing use cases, including micro-services, distributed KV, streaming, pub/sub, object store and more.
It’s a very different philosophy, but can be very powerful when it’s deployed as a utility inside an organization.
We’ve seen so many success stories, especially companies moving to the edge, we continue to see NATS used as a common transport for all kinds of use cases in finance, IoT, edge, Industry 4.0, and Retail
how robust is nats persistence layer ? strenght of kafka relies on its proven ability to maintain a high load while making sure you won't loose the events you're streaming.
Disclaimer: I'm a NATS maintainer working in Synadia (the company behind NATS)
It is robust and performant, with subject-based querying, HA, scalability, streams, key-value store, mirrors and at-least-once and exactly-once semantics.
Used in production on a large scale by companies across many industries.
This has always been true. Some tight C code running on a raspberry pi could probably handle most of the data needs for ordinary businesses today. But to acknowledge this reality makes CIO's sweat ;)
All the times I've used Kafka, it's felt like the microservices confusion. A technical solution that can be taken to the extreme, when really it's a solution to a political problem.
If we break our application into small enough chunks then we start running into internal consistency issues, so we will introduce Kafka to resolve this.
I'm dealing with it at the moment where my customer decided to break their crud application to CUD and R because "best practice". Now they have Kafka synchronising the two applications because it's a terrible design.
I understand that it might help some people, and I am referring to my clients desire as terrible rather than the whole concept. However I have yet to actually see it used without it increasing complexity without demonstrable improvements. I see that Martin Fowler agrees.
It falls into the "you are not Google" problem solving hole.
I'm dealing with a financial institution doing this. CQRS should be an evolution if you find it complex to model, use, and scale not the initial implementation. Now we have another location to lose data.
What? Calling Kafka a queue seems to be a fundamental misunderstanding what it is as a tool. Why would you be polling in Kafka for real-time updates to a UI? That's not necessary at all, a topic subscriber will automatically pick up new messages and can send them via web socket. Kafka doesn't mean databases go away, so there's absolutely zero reason for any sort of issues submitting transactions from a user-facing app. It's almost like you're not really talking about Kafka.
He mentioned "or server push mechanism", aka websockets. I don't think they're trying to call kafka a queue, they're just talking about asynchronous architectures.
I agree that Kafka is overused. But IMHO Kafka's strength is not scaling but decoupling. Services can publish data without other interested parties being online. If you need this, is obviously another question.
I'm about to start implementing this sort of thing myself. We have a handful of webservices, an online store, a customer design tool and some third party connections to customers and suppliers. Then we have a factory of machines and operator stations that need to fetch and display information.
My thinking was to use something like Kafka, leaning towards NATS, as a big pub sub pipe.
I'm not looking for scaling. We are slow and don't need real time. What I need is uniformity. I don't want this connection to interface through FTP while that one is over a REST API. I'd prefer one connection for our main system and little interfaces to connect everything to the same pipe.
NATS definitely tries to solve for that end to end experience, where you can have NATS simultaneously driving websocket connections as well as backend connections in a secure and performant way. I’ve personally had a ton of fun making interactive applications with NATS, it scales reliably and is really fun to work with
Kafka is a distributed log. A sibling comment to this mentions this as well, but I'm repeating it because I actually think that's were most people using kafka go wrong. You _can_ use kafka as a queue, but it'll would feel like using a drill to hammer a nail, would kind of work but would make you think the tool just sucks.
> The justification is usually a sort of handwave about scaling, but I tend to be skeptical of that argument.
Kafka does help with scaling in so far as you can independently scale the number of producers and consumers but I find kafka the most useful in its ability to allow multiple consumers to process the same message(s) at different rates.
For example let's say you have an app that's producing messages like "{foo: 1, bar: 2 }" and you need to count how many foos there are and how many bars there are. You could queue the messages, then have 1 service that does both, or you could use kafka and have a FooCounter and BarCounter service. They each process the same message(s) but ignore the part of the payload they don't care about.
While this is really trivial example, I've used, and have seen this pattern used for processing actual logs for example.
You don't need to scale for performance, but you do for reliability. SQL databases still tend to rely on having one server that never goes down, or at best having a replica and a semi-manual "promotion" procedure.
Kafka does what a datastore should and nothing more. It's what databases should have been. High performance is a consequence of that good design, but the good design is valuable in itself.
I would always start with Kafka. There are so many ways to shoot yourself in the foot with an SQL database that will bite you later, sometimes in a way that may corrupt your data permanently (e.g. read-modify-write logic), and you'll almost certainly end up having to change your data architecture to the one you'd've started with if you were using Kafka. Or you can just throw it in Kafka from day 1 and a lot of bad ideas just become impossible because the system gives you no way to do them.
I was also working on phone systems in the 2000's that ran at "web scale". We used MySQL, but we essentially built Kafka on top of MySQL - tables for every intermediate step in the process, and a way for the next step in the processing pipeline to mark which rows it had started/finished consuming.
> Queues like Kafka can be great for a certain class of solution, but I see them over used.
> The justification is usually a sort of handwave about scaling, but I tend to be skeptical of that argument.
actually this sort of solution is a great fit to GPGPU, where each thread has some kind of a "window" and will perform N aligned accesses in a stepwise fashion to compute its particular data. Aligned/coalesced accesses and "re-computation" are very cheap in GPGPU and that often means you lean on re-computation rather than notarization or other "stateful" approaches.
I think something like this would be a great fit to AVX-512 probably.
The justification is usually a sort of handwave about scaling, but I tend to be skeptical of that argument.
I remember in the early 2000's writing a system that handled all of the consumer equipment orders and returns for Verizon, pick pack and ship and advanced billing integration with the ILEC phone bill, credit systems, IVR, etc... On a PII class server running SQL server stored procedures and a .net UI.
In 2009, I was processing a TB per month of 837 claims on even worse hardware.
Not that long ago, I saw a post on HN where a Casio calculator was handling the front page traffic.
My point is that even modest hardware has the capability to handle massive loads when the software is designed properly.
Kafka and it's ilk are great for certain use cases, but it's normally a mistake to start out with it unless you're really sure you know what you're doing.
CQRS and data-bus/event sourced systems can be tricky to get right, and you lose a lot when you go to async processing, especially when submitting transactions from a user-facing app.
Inevitably you have to create some sort of polling or server push mechanism to update the UI, and the number of edge cases that can bite you are vast.