Hacker News new | past | comments | ask | show | jobs | submit login
RabbitMQ vs. Kafka – An Architect’s Dilemma (Part 1) (eranstiller.com)
292 points by gslin 12 months ago | hide | past | favorite | 166 comments



I've seen Tibco Rendezvous used in manufacturing. ~300 megabytes per hour of raw log generated 24/7/365 by tools and control systems in a factory setting. Probably on the order of 10k+ participants in the pub/sub network.

If you are running something like a factory where thousands of independent systems need to communicate in some way, this kind of tech starts to look like the only option.

If you are orchestrating the concerns of 5-10 services, I think you are making your life harder than it needs to be with ESB-style abstractions. Direct method invocation is much more reliable than whatever any one of these vendors could ever sell you. Put all your services into one exe. If you can't be bothered to use one language/repo, there are still ways to achieve this.

The real architect's dilemma is shoving one's edifice-constructing ego into a box long enough to produce a useful shack for the business.


Our prod cluster generates that about every minute at O(1M) qps.

We JUST turned on remote Logs because until now Kafka didn't have capacity.


TIBCO Rendezvous is tech from 1998/99, millions of messages per second didn't exist at the time. Only NYSE and NASDAQ were capable of producing millions of events back then (still not by minute nor by second).

TIBCO Rendezvous was one of the first successful large scale, low latency and near real-time pub/sub implementations, and it had a very efficient, Avro like, wire level serialisation format that made messages very compact and efficient to deliver. It was very popular in finance, banking and manufacturing, and is all but legacy now.


$$$ not capability. We have ~50 hosts that generate up to TB per day in just logs and 50k hosts that generate O(200mb/day). The large hosts ssh and grep works surprisingly well, but the smaller hosts is the real benefit.

Hard to justify a team of 7 burning several million in just Logging infra costs.


My previous company's Kafka cluster was handing 20 million messages per second 5 years ago, and dozens of petabytes of data per day. Maybe your particular cluster that didn't have the capacity to handle 1M qps, but Kafka easily had that capacity years ago.


I have to ask, what value is this adding business-wise to store so much?


Kafka when used correctly is the like the nervous system for your entire company. You use it like a message bus and dump every single thing into it, and you can extract it out at your leisure, but mostly real-time. It completely transforms how you do services in a company, but it also means you have to invest a lot of money and manpower into maintaining it because it is mission critical.


Not OP but I think it isn't always about storing, but having a log of events which get routed, processed, and aggregated in many cases.


It was quota and hardware, not ability. This is a single service onboarding and they need the hardware.

And at that scale, we need to grep the logs so the downstream need the ability to process that volume, which it couldn't until about 2 years ago.


NATS was developed by an ex-TIBCO engineer.


It’s not apparent as a dilemma until the said architect has spent years -convinced- that the grand edifice is “good architecture”, and finally matured as a practitioner. Only after that phase passes is there an actual ego-driven dilemma, strictly speaking.


I use NATS to create a secure pipe between buildings and the cloud. I don't need speed but I do need routing and I do need a security layer. Nats just works and took very little setup.


Or NATS


or kafka, or any other publish/subscribe system


Your usecase is EASY - its EASY!

This is a SOLVED usecase.


> one is a message broker, and the other is a distributed streaming platform

I think this is an odd way of putting it. One is smart messaging; dumb clients. The other is dumb messaging; smart clients. It turns out the latter (i.e. Kafka) scales wonderfully so you can send more data, but you add complexity to your clients, who can't just now pluck messages off a queue to process, or have messages retry upon the first 3 failures, as they could with RabbitMQ.

Having said that, Kafka lets you keep all your data, so you don't have to worry about losing messages to unexpected interactions between RabbitMQ rules. But having said that, now you have to store all your data.


> who can't just now pluck messages off a queue to process

The problem is you cannot mark individual messages as read, for a given consumer&partition you can only update the offset for a partition.

If a certain message processing takes very long, all other messages in that partition will have to wait.

Also, with kafka, the max read concurrency is equal to the number of partitions, for something like rabbitMq it is much higher; but you do get nice message ordering for any given partition in kafka which you do not get in RabbitMq (afik); you are also get some really nice data locality with kafka because unless the consumers get the partitions re-assigned, all messages for the same key are served on the same physical consumer.


> The problem is you cannot mark individual messages as read, for a given consumer&partition you can only update the offset for a partition.

Hence "smart clients". If you MUST process every message at least once, you will anyway be tracking messages individually on the client (e.g. a DB or file system plus logic for idempotent message processing) and thus disable auto-offset commits back to the cluster for your consumer.

RabbitMQ says "let me track this for you", Kafka says "you already need to track this so why duplicate the data in the cluster and complicate the protocol".

If you don't have reliable persistent storage available and insist on using the Kafka cluster to track offsets, you can track processed offsets in memory and whenever your lowest processed offset moves forward, you have your consumer commit that offset manually as part of its message loop.

If your service restarts your downstream commands need to be idempotent of course because you will reconsume messages you may have previously processed, but this would be the case regardless of Kafka or RabbitMQ unless you're using distributed transactions (yuck).

> If a certain message processing takes very long, all other messages in that partition will have to wait.

You can stream messages into a buffer and process them in parallel, and commit the low watermark offset whenever it changes, as described above. I've implemented this in .NET with Channels and saturate the CPUs with no problem.


You've made very good points about smart clients, but at some point one has to ponder if it's worth it or one should just not use kafka in the first place.

I've seen databases used as messaging queues and if it was up to me, I'd never do that. It's usually "but we already have kafka + db, why burden ourselves with another messaging technology?", which is fair.

> You can stream messages into a buffer and process them in parallel, and commit the low watermark offset whenever it changes, as described above. I've implemented this in .NET with Channels and saturate the CPUs with no problem.

That is very nice -- certainly seems better than just batch processing of kafka messages, but you're still just kicking the can down the road. How large do allow the buffer to become and what do you do when it's getting too large?

You probably use a DLQ.

Don't get me wrong, I think the buffer idea probably works most of the time.


Completely agree. Kafka was another team's decision, not mine, so I had to figure it out. RabbitMQ is very convenient in that you don't need to read a couple of books on reliable data integration patterns to get something working simply and intuitively.

I am fond of Kafka now that I understand it, but I was also an assembly language programmer in a past life so my opinion is probably in the minority.

Regarding the buffer size: you need to implement back pressure, especially if you are CPU and not IO bound; it's another thing that's easy to get wrong with Kafka.


> You can stream messages into a buffer and process them in parallel, and commit the low watermark offset whenever it changes, as described above. I've implemented this in .NET with Channels and saturate the CPUs with no problem.

And there are libraries that will manage all this for you e.g. https://github.com/line/decaton


If you have idempotent messages, why can't you use auto offset committing?


You are quite correct - you absolutely can use auto offset commits in that case. In my scenario, though, I have a lot of messages and a low recovery time objective on service restart so I find it cleaner to skip messages I know I won't need. Also reduces noise on the service logs, makes for easier debugging etc.


Worth noting that Kafka is getting queues: https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A...


And also Rabbit has streams[1]. There's a lot of overlap.

[1] https://www.rabbitmq.com/streams.html


Just my 2c but for anyone unaware, you should check out NATS.

It combines the best of both Kafka and RabbitMQ IMO.


I thought NATS didn't actually store messages, am I mistaken?

Looking at Wikipedia (https://en.wikipedia.org/wiki/NATS_Messaging) I see that I'm technically right, it's JetStream that does the storage layer - but it's part of the NATS server.

From memory, I really liked the philosophy of NATS but found the nomenclature confusing.


I think they call that part of NATS “Jetstream” if I’m not mistaken. I haven’t used it, but I believe it has some form of message persistence.

I have used it mostly for message-first services, and found subject-based messaging a breath of fresh air to decouple services. You can do the same thing with RabbitMQ topic exchanges, but it requires quite a bit more hand-waving.


Jetstream does indeed have message persistence: I can issue queries like “get messages on topic since 5 minutes ago” - I do this a ton. However, that seems to be the extent of the storage/query API that it exposes for historical messages. I’m quite a big fan, and would recommend it with the caveat that Jetstream is considerably more complex than simple nats and I get the feeling I’m barely scratching the surface with it.


NATS JetStream also implements subject-based addressing at the stream level (unlike Kafka where 1 stream = 1 topic, and you can only use the message's key for distribution, not for addressing).

So you can for example ask for [the first/the last/all] of the messages on a particular subject, or on a hierarchy of subjects by using wildcards. All the filtering is done at the server level.


I’m really interested in a Kafka like message broker in the Go ecosystem and look forward to checking it out for whatever my next project ends up being.


It’s pretty cool. I would personally suggest Kafka or RabbitMQ depending on your needs, as Jetstream has proven to require a lot of ops engineering to remain stable in production


You're getting to the key thing:

You don't want to classify them by what they do. You want to classify them by what the clients must do/experience.


It is not odd, it is basically accurate. You are making a fetish of the S-C interaction but the essential matter is that Kafka is designed to store & distribute logs, whereas Rabbit is designed to route & send messages. The ‘store’ bit is very much a part of Kafka’s mission statement but not for Rabbit.


> One is smart messaging; dumb clients. The other is dumb messaging; smart clients.

All the smartness of the messaging can be implemented in the smart clients. Then you can expose that as a smart messaging api to dumb clients.

The most obvious example is kafka streams which exposes a "simple" api rather than dealing directly with kafka, but obviously you could create a less featurefull wrapper than that.


I can't help but think that this just gives you the worst of both worlds. You are now on the hook managing that non-standard "smart" wrapper which will quickly just become the status quo for the project. Anyone wanting to change how it works needs to understand exactly how "smart" you made it and all the side effects that will come with making a change there.

I pushed against knative in our company particularly for that reason. Like we wanna use kafka because [Insert kafka sales pitch], but we don't want our developers to utilize any of the kafka features. We're just gonna define the kafka client in some yaml format and have our clients handle an http request per message. It didn't make sense to me.


Thats kind of like saying dont use any software libraries because they all use the standard lib indirectly so you may as well just use that?

Its just an abstraction layer to make things less effort.


> Thats kind of like saying dont use any software libraries because they all use the standard lib indirectly so you may as well just use that?

This is decent advice, IMO. The cost of dependency management is often vastly understated.


That’s Not Invented Here syndrome, and it’s decidedly bad advice.

The cost of dependency management may be understated but it’s always less than the cost of reimplementing everything found in established libraries.


One HTTP connection per message (if this is what the original poster meant) is probably a bad idea whether you implement it yourself or not.

Also, let's be honest: The phalanx of developers that violently argue for importing everything and never implementing anything yourselves is way bigger than people who argue for the opposite; I don't think we need to worry about the latter making things worse as much as the former.

We've seen what the world turns into in both scenarios, I would argue, and at least with the first one we got software that ran decently and developers who knew how to actually do things. We have overall much safer languages now so their home grown solutions won't have the same safety issues that they've historically had.

With the importer crowd we've gotten software that feels like molasses and an industry full of people who know only the surface of everything and more importantly become tied to frameworks and specific APIs because they never actually go beyond any surface APIs.

As with most tradeoffs in software there is a good middle-ground, but we won't ever get there if we don't have people who argue for making things yourselves.


Who has the time to count the cost. Just keep shipping and worry about the costs later is the reality in a large portion of the tech ecosystem.


Being judicious in which dependencies you take on is not the same as Not Invented Here syndrome. Code is usually a liability, not an asset.


yeah, don't wrap all calls to a standard lib in another homegrown or non-standard single-digit user lib that makes changes in all sort of subtle ways. There are plenty of C++ projects that make their own or wrap stdlib and they are always a big wtf.

It's one thing to have an abstraction for kafka in your code, it's another to wrap the client in a smart client that reimplements something like rabbitmq, and much worse a smart service.


> don't wrap all calls to a standard lib

Im not saying to expose the same primitives - what would be the point in that? I am saying that EVERY lib you use will be using the standard lib or some abstraction of it to perform its own utility.

> It's one thing to have an abstraction for kafka in your code, it's another to wrap the client in a smart client, and much worse a smart service.

That abstraction is exactly what i am talking about. Why write 50 lines of boilerplate multiple times throughout your code when you can wrap that up in a single function call and expose THAT as your client. You know thats exactly what you will end up doing on any non-trivial project. Or you could use a lib that already does that - such as the "official" kafka streams lib.


mediocre Libraries sink to the bottom over time. Home grown libraries have a different quality and don’t sink easily. This is definitely YMMV for different teams

I can imagine home grown libraries having inconsistent api with wild and wonderful assumptions and beware the edge cases


This would be my instinct too.


And reimplement rabbitmq? Great idea. Let's do it in rust too.


That's a neat way to put it!

- RabbitMQ: Smart messaging, dumb client - Kafka: Dumb messaging, smart client

Have you heard of Fluvio? Fluvio: Smart messaging, Smart Client, Stateful Streaming

Kafka + Flink in Rust + WASM Git Repo - https://github.com/infinyon/fluvio


>All the smartness of the messaging can be implemented in the smart clients.

How do you do, for example, a queue with priorities client side without it being insanity? That's a relatively basic AMQP thing. Or managing the number of redeliveries for a message that's being repeatedly rejected.

You can absolutely try to build some of this with a look-aside shared data store that all clients have to depend on in order to emulate having the capability in the broker, but you just introduced another common point of failure in addition to the messaging infrastructure. Life is too short for this.


I totally agree that you cant do a lot of AMQP stuff. As you noted, you can build some of it by managing state via transactional producers, etc - but you definitely cant do everything. The biggest gripe for me is actually dynamic "queue" creation, patterns for topics, etc. So I use an MQ for an MQ ;)

I'm just saying you can "dumb down" the client side on kafka by creating an abstraction layer (or one of the many higher level libs that already do that).


Those requirements would definitely be examples of those that are fulfilled by smart messaging.


Every decision has a consequence. There are a lot more options depending on the use case.


If you have a throughput problem you are most likely doing it wrong. If your knee jerk reaction is to scale your messaging up you may want to reconsider. Messaging systems are usually hard to scale up and always very costly to do so compared to the amount of data they are transferring.

The simplest thing you can do is to realise WHY you are using messages. Messages are there to trigger a process. Usually, you don't need a lot of data to trigger a process, the message just needs to let the system know enough to locate all necessary information.

Also, when you are sending information at an extremely high rate, there usually is no difference if each message is processed separately or in batches.

So what you can do in practice?

1) Get the producer to batch the messages. For example set rules like "batch up to 10.000 messages, up to 100ms, up to 100MB of data, whichever comes first". 2) Serialise the batch (for example, if it makes sense, create a compressed JSON file) 3) Upload the file to some high throughput, scalable, cheap storage (for example S3) 4) Send a message to the queue / topic / whatever else you are using with just enough to locate and process the message -- usually just the link to the S3 object.

This usually can be modified depending on specific project needs.

Now your messaging only ever sees a small number of very small messages and you will never have any scaling problems, at least not on messaging side.


If your use case can stomach the added failure modes this implies, yeah, that's what you can do.


Everything has some cons, some failure modes.

All engineering is about knowing, understanding and making tradeoffs.

In my practice, I am happy if I can get rid of hard problems (my messaging platform being unable to process X messages per second) and replace them with relatively easier problems (my persistence might sometimes fail and then I can't send a message).

I would argue that distributed persistence solutions are usually more reliable than messaging platforms and also what is a very large throughput for a messsaging solution is usually nothing much for monsters that are engineered to take much larger volumes of data. And so, in my experience, reducing load to messaging and increasing load to persistence is net positive for the overall reliability.


1) some designs can't tolerate the producer sending messages at such a delay. 3) s3 is not cheap storage, it is significantly higher cost than most on prem solutions when talking about large scale storage. (pedabytes scale)


1) If you can't accept and process the the load of messages with your messaging the discussion of whether 100ms is or is not acceptable delay is very much pointless.

Messaging middleware is by design not suited well for architecting real time systems. If you require real time guarantees you would benefit from some other communication channel.

2)

3) S3 is orders of magnitude cheaper compared to messaging platforms like RabbitMQ or Kafka. If you take load off of your RabitMQ or Kafka and put it on S3 you should see a significant reduction in cost.

S3 might be more expensive than other persistence solutions, true. Just choose whatever else you have. I used S3 as an example because it is extremely easy to implement and get going.

Again, all engineering is about tradeoffs. You compare two solutions, they will always have some cons. You just decide which cons you can live with. If your platform can't process messages at all and you don't know how to scale it up to do so that's pretty large problem in my book.


If you want features of RabbitMQ (specifically queue like behavior) but the scalability of Kafka then you probably want Apache Pulsar.

To elaborate on that a bit the main things Pulsar gives you are:

1. Still underlying distributed stream based architecture, this is what makes it able to do Kafka like things.

2. Broker side management of subscription state which allow out of order acknowledgement, this means you can use it like a queue. (Subscriptions sort of act like AMQP mailboxes but without the exchange routing semantics). Vs Kafka which can only do cumulative acknowledgement, i.e head of line blocking.

3. Separated "compute" and storage. By storing data in Bookkeeper you can scale your needs to support a lot of consumers separately from how you stash the data those consumers need to read vs Kafka where these 2 are coupled and an imbalance between the two becomes awkward.

4. In built offload with transparent pass-through read. When your data falls off the retention cliff for your standard broker cluster the data can be archived to object storage. The broker can transparently handle read request for these earlier messages though, just with higher startup latency to pull the archived ledgers.

5. Way more plugability than Kafka, in-fact similar plugability as RabbitMQ. You can implement your own authz/authn, a different listener to support a different protocol (there is a Kafka one, MQTT, AMQP etc).

6. Much greater metadata scalability. Before the new KRaft implementation the layout of metadata in ZK meant that you couldn't feasibly have more than about 10k topics. Especially because of how long the downtime would be on controller failover. Pulsar can easily support much larger numbers of topics which prevents needing to use a firehose design when you would prefer individual topics per tenant/customer/whatever.


While pulsar on paper seems a superior solution, in my experience it is very still very immature and very buggy. I really want to use it over kafka but I would not bet my business on it.

I am not a fan of Kafka, it's kinda old, and the code is a bit messy, a lot of the once only semantic problems 100% solved by Pulser are sorta kinda in Kfaka these days. All the newer stuff like built in RAFT makes it competitive with anything.

But, anyone who has used Kafka at scale has to say it 100% does what it says on the box. Many people are used to it's idiosyncrasies at scale and can get it to scale.

Now RabbitMQ. I have been burnt so badly by it breaking at scale I'll never be touching it again. Maybe my fault, maybe not, but replacing it with Kafka solved all my flakey issues and I never looked at it again.


I have run all 3 at big scale. Kafka is still great as long as everyone using it understands it's a stream, not a queue and using it like a queue is going to get them burnt. I don't touch RabbitMQ with a 30ft pole anymore, too many lost days or nights to split brains and other chaos.

Pulsar has mostly replaced Kafka for me because I don't need to worry about people coming along and changing requirements after the fact and saying, actually yeah we do need queue semantics. Or actually yeah we need data larger than the ~48hrs we want to store in Kafka and we don't want to teach our application to read from the archive.

Pulsar is definitely greener than Kafka in a lot of ways but the underlying stuff is very solid, BookKeeper in particular is tough so you aren't really at risk of data loss but you might run into bugs that make brokers do silly things and that can be annoying. Generally speaking though if you validate both the paths you are using and new releases it's been fairly OK to me.

The big thing was being able to directly connect external clients into Pulsar using the Websocket listener on the proxy and plugging in my own authz/authn logic. The eliminated a layer that would otherwise need to be implemented separately.

So far I have been happy with Pulsar, if you haven't tried it for a while you should give it another go. It will only get mature if people use it. :)


Pulsar is a very interesting architectural case study. The cost of the greater clarity and flexibility is the greater management burden. The server-side functions are nice (and remind of JEE MBeans) but the direct challenge to Kafka is the decoupling of storage from servers via Bookkeeper (which adds the lower layer cluster management burden) which addresses the rebalancing headaches with Kafka type of solution (where the server and storage are unified).


I am contemplating this exact topic for my project at this moment. It would be great if you can briefly explain what, as per your understanding, stream vs queue semantic are. I am studying it and got somewhat confusing discussions on the internet and in person forums.


Ok so the ELI5 is that a stream essentially has to be consumed in order while a queue can be processed out of order.

This is a gross oversimplification as all ELI5 are but it's a decent rule regardless.

The reason for this is that streaming systems by and large function on some sort of offset mechanism. Your client when it's receiving messages is generally calling something similar to poll(fromOffset, max) to get some messages and then keeping track of the max offset it's published somewhere (Kafka has consumer groups to help you store your offsets).

The problem with this model is you can only generally a) get messages in order on a given topic partition starting from some offset and b) you can only "commit" the latest message you processed.

This is fine if the chance of failure for a given message is the same for all messages. i.e streaming database updates into an backup. Either the backup target is available or it's not, if one message fails all would likely fail.

On the other end of the spectrum you have something like a queue of webhook jobs to execute against 3rd party/user supplied targets. The chance of any given webhook failing is entirely divorced from the rest in the queue.

So if you were to try use a stream for the webhook case you would quickly get blocked on the first bad webhook server you ran into. While with a proper queueing system you could kick that job back with a delay and process it again later without blocking work on other tasks or being able to commit which tasks have been processed.

This is generally called head of line blocking problem.


Your comments seem to closely reflect mine. I'll have to take your advice and give it another go. It's everything all the others want to be. The architecture with distributed bookkeeper underneath seems so much more advanced.


We also switched to Pulsar after running some benchmarks for our use cases. We use these services primarily as worker queues for image tasks that require low latency. And Pulsar turned out to have a 20x lower latency than Kafka in our setup.


I think this is an excellent article. The only thing I'd add is that RabbitMQ is an implementation of AQMP (optionally v1.0) as a standardized broker service protocol so is designed to be interchangeable with other extant implementations such as Apache Active MQ and Cupid whereas Kafka is one-of-a-kind software. Beyond that RabbitMQ has standardized client libs and frameworks in Java land if that matters to you - it did matter in the original context of message queue middlewares and SOA from where AMQP originated and where enterprise messaging sees major use. OTOH Kafka, with caveats, is in principle more "web scale" - though that is far from a free ride.


NATS (https://nats.io/) is another option, though I'm not sure if it's still considered a viable Kafka replacement.


It’s true FOSS, and the server is standalone Go binary that’s so small it can even be embedded. Lots of language bindings for clients. Has persistence, durability, and nicely aligns into a raft-like cluster in a DC without a separate orchestrator.

I’m a big fan – never understood why it’s not at the top of the list in these tech reviews.


Rabbitmq is FOSS, has lots of language bindings. It has persistence, durability, and doesn't require a separate orchestrator.


I was mostly comparing against Kafka but yes I should def take a look at RabbitMQ again. I remember there was some reason it wasn’t a good fit for me but can’t recall what it was.

Are the horizontal scaling issues solved now?


Yes, if you use quorum queues


NATS is something else, but it's awesome. It has awesome throughput and latency out of the box (without Jetstream), while using little resources.

I'd recommend considering it, especially as an alternative to RabbitMQ.


I only tested NATS using JetStream and I struggled with the throughput in Python. I probably used it wrong. But your comment may imply that jetstream is slow.


I think sometimes the client bindings are/were in need of improvement.

As an example, the C# API was originally very 'go-like' and written to .NET Framework, didn't take advantage of a lot of newer features... to the point a 3rd party client was able to get somewhere between 3-4x the throughput. This is now finally being rectified with a new C# client, however it wouldn't surprise me if other languages have similar pains.

I haven't tested JetStream but my general understanding is that you do have to be mindful of the different options for publishing; especially in the case of JetStream, a synchronous publish call can be relatively time consuming; it's better to async publish and (ab)use the future for flow control as needed.


I didn’t mean to imply that Jetstream is slow. It’s just that I did my benchmarks without it. On a local PC, with 10 KiB messages sent (synchronously) in a loop, I could transfer 3.2 GiB over 5 seconds with 0.2 nanoseconds latency. Performing the same test with RabbitMQ, I got even better throughput out of the box, but way worse latency.

Those numbers are for server 2.9.6 and .NET client 1.0.8.


we used it for some low latency stuff in python. it was about 10ms to enqueue at worst. However we were using raw NATS, and had a clear SLA that meant that the queue was allowed to be offline or messages lost, so long as we notified the right services.


I’m not familiar with the Python lib but it could be waiting for streams to acknowledge each message reception/persistence before sending the next one. Some clients allow transactions to run in parallel e.g. with futures.


If I'm not buying a message bus in as a service, then NATS is great for pub/sub and or message passing system

it is simple to configure, has good documentation, and excellent integration into most languages. It guarantees uptime, and thats about it. It clusters really well, so you can swap out instances, or scale in/out as you need.


Different set of promises. NATS is great but has a different tradeoff bargain from Rabbit or Kafka.


Could you expand on this a bit more? I am curious.


NATs has a decent-ish guide here: https://docs.nats.io/nats-concepts/overview/compare-nats

A few things they get wrong mostly about Rabbit:

+ RabbitMQ does support replay, and also has a memory only mode which will support persistance in a cluster

+ RabbitMQ doesn't have that sensitive of a latency between cluster members (no more sensitive than NATS in some setups).

+ RabbitMQ also supports Prometheus

A good (but incomplete) rule of thumb:

+ Kafka is a distributed Append-Only-Log that looks like a message bus. It allows time travel but has very simple server semantics.

+ RabbitMQ is a true message broker with all the abilities and patterns therein. It comes with more complexity.

+ NATs is primarily a straight forward streaming messaging platform.

Also consider Redis' message queue mode, zeromq, mqtt brokers (Eclipse Mosquitto) and the option of just not using a message broker/queue system. Even as someone who really likes the pubsub pattern, there's a good chance you don't need it and you may be heading to a distributed monolith antipattern.


> distributed monolith antipattern

Beauty is in the eye of the beholder, I guess. (I would definitely not consider a distributed monolith an antipattern)

Also wanted to add NSQ to your very good list.


Thanks for this link -- really interesting.

I like the suggestion to rethink whether you actually need to be doing asynchronous computing with a message broker/queue/stream or whether you can represent your work another way.



One is a tomato, the other is an orange. From a distance they might look alike but they really are two completely different tools. This is a pretty solid explanation of the differences with good illustrations.

Rabbit can do everything Kafka does - and much more - in a more configurable manner. Kafka is highly optimized for essentially one use case and does that well. Nothing in life is free, there are trade-offs everywhere. I am not privy to which one is theoretically faster - but once you reach that question methinks the particular workload is the deciding factor.


Rabbit is an arse to scale past one broker. It was possible, but a pain, that might have changed.

Kafka is just a pain full stop.


At a previous company, about 10 years ago, we had roughly 10 RabbitMQ instances (brokers), all isolated. The system was essentially partitioned by queue server. We had a directory-ish service that would associate clients with their assigned node. It worked well, except if a client got too large we might have to move them to another queue.


The official rabbitmq controller for kubernetes is a breeze. Scales wonderfully without almost any effort.


>> Rabbit can do everything Kafka does - and much more - in a more configurable manner. Sure, if you're doing like 10's of MB/s. RMQ is fast compared to AK if you're not adding durability, persistence, etc. Try to run gigabytes per second through it though, or stretch across regions, or meet RTO when the broker gets overloaded and crashes.. Get your shovel ready! ;)

Kafka itself is dumb but scalable and resilient, it's the client ecosystem that's massive compared to RabbitMQ. Count 10 stream processing, connectivity, ingestion or log harvesting platforms that use RMQ as it's backend, then name 10 languages that have supported libraries for RMQ.. then compare that to Kafka.


I am not an expert in either and have only worked with Kafka. At a past job I had to write a connector job to parse and sanitize some extremely dirty, unstructured data and pass it along somewhere else. RabbitMQ supports this? What is the one use case of kafka? I think you have it backwards.


> parse and sanitize some extremely dirty, unstructured data and pass it along somewhere else

can you be more specific? that to me sounds like hello world for either of these tools. "santize data" is an application level concern that neither rabbit or kafka would handle. as far as "pass along somewhere else" again both tools can do.


It was a Sink Connector. I don’t know what it was or wasn’t supposed to do but I was asked to do it, as is often the case in tech. I could have done any number of transformations in that process though, which I’m not sure rabbitmq supports


It sounds to me like you aren't really even sure what you built. I have operated both rabbit and kafka at scale I definitely do not have it backwards :)


No, I’m not, because it was years ago, and I’m asking for clarification because what was said immediately sounded wrong to me (I’ve managed a lot of rabbitmq deployments) and you’ve not really given one other than an appeal to authority. guess I have my answer. Can’t find anything that suggests rabbitmq natively supports anything like sink connectors. thanks.


> Can’t find anything that suggests rabbitmq natively supports anything like sink connectors

Kafka doesnt natively support them either. That would be Kafka Connect. I guess you could use it as an MQ, but it wouldnt be a very good one. Its more used as a data integration platform. If you want more MQ-like functionality OOTB on top of Kafka you would want to use something like Kafka Streams instead.


Thanks for this clarification, this is what I was after.


So let me get this straight. You've used Kafka once, RabbitMQ never. You don't really know what you did with Kafka. But you somehow know that RabbitMQ cannot do the thing which you don't really remember anymore. Doesn't make much sense to be honest.

Nobody can really have any sources for RabbitMQ being able to do it if you don't know what it supposedly cannot do. The way you descibed it, is that you simply read data and then did something with the data and passed it to somewhere else. RabbitMQ obviously can do it.


> So let me get this straight. You've used Kafka once, RabbitMQ never

Not true, and some also for the rest of your snotty comment, I'd have a response but it's best not to engage trolls. Another commenter answered the question I had. Good luck.

Also, did you register solely to make this comment? Pretty sad display, really.


> I am not an expert in either and have only worked with Kafka.

> I’ve managed a lot of rabbitmq deployments

... ?


You do not need to be an expert in something's internal workings to manage/monitor a deployment. Surely this does not need to be explained further.


> I think you have it backwards

You do need to be an expert when you start telling other people they’re wrong, though.


A classic exchange on the internet.


Nice post! RabbitMq is battle tested, exceptionally fast and low resources app. Capable of handling millions of transactions/second. RabbitMQ will handle vast majority of usecases. I'm puzzled why often startups, or even banks use Kafka, soley because is hype. Kafka on the order hand requires massive CPUs, Memory, often requiring its own K8S cluster just to be alive.


Pretty much every bank uses kafka ad the central messaging layer. What people are missing in almost every post here is the write once read many without data duplication and with different offsets is the killer app for Kafka beyond just the near infinite scale which is also super appealing. The failure modes are way way better than Rabbit as well. Note: I owned the streaming platform for a top 5 bank in the us.


Yeah, I'm sorry to others, but if you require the guarantees and compliance that Kafka provides, Kafka wins, especially at this kind of scale. I'd love to see RabbitMQ scaled out to handle hundreds of trillions of events per day and able to retain years worth of highly durable, immutable, and replayable event storage.

Ultimately, this comparison is apples vs oranges...


Rabbitmq sucks to scale. clustering and partitioning were terrible for a long time, maybe still is. Clusters dying in split brained ways, nodes crashing terribly / unrecoverable if they exceed iops or storage limits. You couldn’t pay me enough to run a high volume rMQ cluster again.

Never mind that the persistent durable log pattern of kakfa enables a lot replay type use cases that are very beneficial in financial systems specifically.

It’s not solely because of hype at all, it’s objectively better for many use cases.


If your have a clean event-driven architecture, ie messages are completely agnostic and decoupled from one-another you don't need Kafka.


Event-driven architecture is an architectural principle, and Kafka, RabbitMQ/ActiveMQ, Pulsar, NATS and so forth are implementations that support the event-driven architectural principle. Yet, all of them range in a variety, complexity and extent of features they provide which may or may not be a good fit for a particular use case.

Traditional message brokers (RabbitMQ and similar) do support the event-driven architecture, yet the data they handle is ephemeral. Once a message has been processed, it is gone forever. Connecting a new raw data source is not an established practice and requires a technical «adapter» of sorts to be built. High concurrency levels is problematic for scenarios where the strict message processing ordering is required: the traditional message brokers do not handle it well in highly parallel scenarios out of the box.

Kafka and similar also support event-driven architectures, yet they allow the data to be processed multiple times – by existing (i.e. a data replay) and, most importantly, new or unknown at the time consumers (note: this is distinct from the data replay!). This is allows to plug existing data source(s) into a data streaming platform (Kafka) and incrementally add new data consumers and processors over the time with the datasets being available intact. This is an important distinction. Kafka and similar also improve on the strict processing order guarantee by allowing a message source (a Kafka topic) to be explicitly partitioned out and guaranteeing that the message order will be retained and enforced for a consumer group receiving messages from that partition.

To recap, traditional message brokers are a good fit for handling the ephemeral data, and data streaming platforms are a good fit for connecting data sources and allowing the data to be ingested multiple times. Both implement and support event-driven architectures in a variety of scenarios.


NATS with JetStream provides _both_ queuing like a traditional message broker and multiple data replay from offset (plus KV, and request/reply)


This is a ridiculous statement if you really build an EDA. Kafka is what enables the decoupling.


If someone is asking if they should decide between RabbitMQ vs Kafka, they should 100% use RabbitMQ. It means they have no idea what they're dealing with, the architectural differences, and the investment that the company needs in order to use Kafka.

So use RabbitMQ.


How do you create anything with RabbitMQ that a) has performance characteristics under load you can reason about and b) can handle individual node or networking failures without data loss?

Kafka is overkill in most scenarios and you should probably just see if postgres isn't enough for your needs first (especially since you will almost certainly already need a database anyway). Kafka is more pain to setup and run than it ought to be. But underlying it is a useful and sensible abstraction for building robust distributed systems.


> Apache Kafka is not an implementation of a message broker. Instead, it is a distributed streaming platform. Unlike RabbitMQ, which is based on queues and exchanges, Kafka’s storage layer is implemented using a partitioned transaction log. Kafka also...

This seems like an important passage, drawing the crucial and long-awaited distinction between RabbitMQ and Kafka, and yet without having defined a "partitioned transaction log" the author strands the reader without any help in absorbing the distinction.


Hi, this is the article author here. Thanks for the feedback! I've written this article 3.5 years ago, and it could definitely deserve a shake-up. I agree this should be cleared up a bit.


This is sadly commonplace in tech blogs: rather than taking a great opportunity to hook the reader, the writer will drop a vocab term in bold and move on.


I’m personally a fan of Kafka. I think the design of persisting the messages, and tracking offsets for progress instead of message acknowledgments is a much cleaner and more versatile design.

You can get all the same advantages of message acknowledgments, but now you can also replay queues, let different applications use the messages (handy for cross cutting event/notification systems) and you get better scaling properties-which doesn’t hurt at the small scale, and provides further scaling when you need it.


> You can get all the same advantages of message acknowledgments, but now you can also replay queues

with rmq you can reject/nack a message and have it put back on the queue. rmq is not well suited for long term historical retention inside queues a-la kafka's logs but it is possible to do.

> let different applications use the messages (handy for cross cutting event/notification systems)

rmq also does a publish once and fanout to multiple queues to support this. data is replicated so that could be a deal breaker, but it is possible.

how often have you had to diagnose a stuck consumer or some other kind of offset glitch where a consumer is unable to resume where it left off?

not knocking kafka here but I do think it is a tool you should reach for when you need to solve a very hyper focused problem, while rabbit is a tool more suited to most cases where queuing is required. kafka is a code smell in a lot of organizations from my experience - most do not need it.


> with rmq you can reject/nack a message and have it put back on the queue

I know other systems have semi-similar mechanisms, however most of them retain the “someone is the sole owner of this message” style design, which I think is fundamentally limiting. Owning application dies, is it acked or not? Acks but never gets around to putting it back on the queue? Who takes priority if 2 separate applications wish to watch the same stream of events?

I think Kafka’s “nobody owns it, acks are consumer group level” give you the same advantages for the application itself, without a number of the more difficult complications.

> rmq also does a publish once and fanout to multiple queues to support this

Which is probably fine for small volume or velocity topics, but is going to cause all sorts of load issues at higher scale.


> Who takes priority if 2 separate applications wish to watch the same stream of events?

each app would get its own queue, the messages would hit a fanout exchange that would route the same message to both queues.


> afka is a code smell in a lot of organizations from my experience - most do not need it.

Kafka is really nice if you don't care that much about latency during peak load and you don't have absurd processing times for messages.


Kafka can sustain sub 20ms at millions or even billions per second scale. Processing time delays is bad consumer code and partition design smell. Aka , your consumer shouldnt depend on a slower resource within an ordering domain. This can also be mitigated with an async consumer


These sound like consumer issues to me.

Kafka had been extremely reliable with latency, even under load in my experience.

If you’ve got badly lagging consumers that are trying to read from very old points in the topic while everyone else is at the head, you’ll definitely see some increased resource usage, but again, that’s mostly a consumer issue, and I’ve need seen performance degrades that much.


If you're concerned about latency you might want to consider zeromq. Stream processing doesn't really have a time expectation to it.


> now you can also replay queues

yeahnah, that leads to people treating queues like databases (I'm looking at you new york times, you know what you did wrong)

its either a queue, or a pubsub, either way its ephemeral. Once its gone, it should stay gone. thats what database, object stores or filesystems are for.

Kafka is a beast, has lots of bells and whistles and grinds to a halt when you look at it funny. Yes, it can scale, but also it can just sulk.

rabbit has it's own set of problems, and frankly it's probably not choose either anymore.


What would you choose today ?


It depends on the context.

Currently I'm using DDS, specifically from eprosma. I would avoid that implementation unless you're using java.

I really like NATS. However I would probably use what every is bundled with the cloud system I'm using, unless its super critical.

MQTT is quite nice for things, as is rabbit.


> (I'm looking at you new york times, you know what you did wrong)

You're going to have to be a tiny bit more specific here. NYT is THE factory of wrongness for sure. In every dimension. Are we talking "yellow cake" wrong, or somewhere else on the severity of f'up scale...


https://www.confluent.io/blog/publishing-apache-kafka-new-yo...

^ this.

All they needed was a database, or possibly a DB that supports row signing. I mean actually they could have done it with git. They don't publish that many stories an hour.

Everything about this setup is just plain wrong, and to then boast about it, absolute madness.


They wrote a post on how they disabled the deletion and compaction of the data in Kafka and used it as the source of truth.


> You can get all the same advantages of message acknowledgments.

Maybe 95% of cases, but not all.

Long message processing time really kills kafka in a way it doesn't kill Rabbit Mq. Combine it with inherent read paralelism being limited to the number of partitions. Add in high variability of message rates and bingo, that's like 90% of the issues I've had with kafka over the years.


Message ordering is an illusion. Unless you track/store the messages on the client and are willing to deal with stuck queues due to failures in one "poisoned" message.


There are different kinds of order. Yes, there’s no total order in a distributed system, but you can have certain partial order guarantees. It’s nice if something is added before it’s updated, for instance.


Could you expand on that? How would you achieve "partial order" guarantees?


One type of partial order would be that a producer puts all the messages that are related in the same queue, so that A always precedes B. Basically the invariant becomes:

If a consumer sees an event B, it will have certainly have seen the event A before that.

Assuming business logic is correctly written, that saves you from having to write certain retry logic on the B handlers. This requires the message queue to be always available. If it goes down, the system would not make progress (like a db - in fact the MQ is a db).

Once you add more actors/nodes to the same related events, maintaining a “causal order” can be very tricky and subtle, especially if you have an MQ and a DB as multiple sources of truth. So I’m not exactly endorsing it, even though MQ-as-a-DB (aka event sourcing) is a very interesting idea.


Let's say you have events coming in that result in inserts, updates, and deletes on a table with a certain primary key. Assuming no dependencies external to this table, you only need events involving a specific key to be ordered. I.e. it doesn't really matter if row_a gets updated before or after row_b. In either case, you end up with the same thing. So if you do something like kafka partitions and you send events to certain partitions based on their primary key, then those partitions will be ordered which will be enough.

That doesn't fix your example of dealing with individual errors, but in many cases that's enough.


Couldn't agree more, messages should be completely agnostic from one-another. If you have a decent event-driven architecture, you don't need kafka. and you can be happy with Redis or RabbitMQ


Hi everyone, this is the article author here. I was genuinely surprised to see the article pop up here 3.5 years after I wrote it.

Thank you for all the feedback! Given the advances in the past few years in this area, this article could use a serious update. A comparison to NATS and Pulsar is warranted, along with some extra explanations for some of the technical terms in the article.

I'll carefully review the feedback here and try to make an update at some point soon.


I'm sure there remain good use cases for message buses that you have to run yourself, where there really are millions of messages that can't be batched up and need real-time and whatnot. But you can get pretty far with:

1. Write a bunch of records to an S3 object. 2. Trigger a lambda to process — infinite scale out!

and if a queue really is needed due to constrained consumers, then:

1. Write a bunch of records to an S3 object. 2. The lambda trigger puts a message into an SQS queue with the S3 object's URI. 3. An auto-scaling group gets the message off the queue and processes the data in the object.

If ordering's important, then the SQS queue can be made FIFO. That has pretty low numbers in messages/second, but since records are being batched into S3 objects you can still have fairly high throughput.

It used to be that the elegance of queues for such systems would tempt developers into the operational slough of despond that is running these types of systems. Again, I'm sure it's warranted for some applications, but S3, Lambda, and SQS as above work nicely together.


You can get rid of Step 2 if you're going to be creating a new S3 object for each batch. AWS has an inbuilt feature to trigger SQS notifications for S3 operations: https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable...


What these articles always miss is what are the real-world scenarios that use Kafka or RabbitMQ? I've never used one professionally and beyond the async processing cases I come across in web dev (mostly sending emails asynchronously), or seeing devs using one to handle 5 req/s which doesn't need it, I don't have a good feel for when they are really needed - especially for companies at sub-Uber scale.

What real-world cases would you use them for?


My comment is mostly about part 2 of this post, but wrt message ordering being a kafka "win" I'd raise the point that in the actual use case of "a consumer fails in some way to process the message" you can still end up with out of order processing of the consumer's input since you might want to dump them into a DLQ or something. The fact that the message isn't reappended to the topic by default for processing is kind of an academic point no?

Unrelatedly, I've been looking at Pulsar lately. Anyone have experience with Pulsar and either RMQ/Kafka want to throw out some opinions from having tried both?


Pulsar can have both MQ semantics and pub/sub semantics. In pub/sub it's sorta like "Kafka with all bits people found it necessary to build later already built in", e.g., a proxy, schema registry, connectors, replication, tiered storage, all out of the box.

It also has lightweight streaming functions built-in, but they operate per record, so good for lightweight transforms/routing, not for stream aggregations etc.

It has more moving parts also, brokers are decoupled from storage, which is handled by BookKeeper, and replication between two clusters requires a ZooKeeper that both clusters can access, in addition to the ZK used by the brokers and bookies.

And it's a reasonably new project, so last time I looked into it, some of the documentation was incorrect, especially around managing bookies.


Massive and complex platform.. at a certain point why not just run 2 different platforms that are best of breed for each.


You can just run Pulsar standalone which is super-simple, it is a great start and scales just fine for one machine. Once you outscale one machine, you can refactor Pulsar into a distributed setup. The advantage is that everything is still familiar.


The same MQ patterns as mentioned in the article (exactly once, consumer groups) can also be done in kafka, contrary to what the article suggests.


The great thing about Kafka is the ability to batch operations. Collect a set of messages in memory and, when its time to commit, submit a bulk operation. If something fails you just rebuild the buffer from the last committed offset. Pretty neat piece of technology.


RabbitMQ is one the most impressive technology I have ever used I think. It's simple, efficient and reliable. In 10 years, I have never seen a crash or a bug, even with heavy load.


where i work we have some queues with millions of items in and the thing is perfectly stable.


Wow, this is the type of trash one can expect from someone calling themselves "Architect"


Be kind! Author has shared his findings based on his understanding, if you have more to add go ahead and create another page?


Redis isn't an option?


I'd also like someone with experience to contrast redis for these use cases.


Redis is a cache, it's not a queue.

Can you do something similar? I guess, just like you can use a highlighter to paint your house. But the semantics are not correct.


There is Redis Streams, but certainly not without it's problems. Super obscure, not a lot of client support.


Or postgres. If you are under 1000 messages a second it functions well as a transactional queue.


Until table bloat happens.


Yes, Don't why is not mentioned in the article, maybe because is more barebones.


If you use Confluent Kafka, the billing is pretty high. About 4 years ago it was much cheaper, but then they completely revamped the pricing to something ridiculous. I found that switching to Google Pub/Sub, at least if it meets your needs, is cheaper.


Yes, I can confirm that. Confluent is the most expensive part of our current infrastructure.


Can you switch away from it? Or do you need its advanced features?


Sounds about right.


I see they offer Kafka's exactly-once delivery: https://cloud.google.com/blog/products/data-analytics/cloud-...


Kafka doesn't guarantee exactly once delivery at all, unless you're using Kafka Streams and even then your final output topic still won't get exactly once, the consumer group protocol doesn't allow for it.


It's cheaper until you get to a sizeable workload, and the P90+ latency is ridiculous.. the kafka api is weak and when you're not using kafka api you're limited on integration tools unless you want to be super locked in to GCP.


The lock-in argument is a non starter with me. I never see people move between clouds (well it happens but it is incredibly rare) and it isn’t because of lock in but rather they are pretty close to equivalent. And if you wanted to go on prem you can replace the messaging system, it isn’t one of the hardest steps of going on prem.


seems like there is a missing option?

- Message retention (like Kafka)

- Easy consumer scale out (like RabbitMQ)

- No particular ordering guarantees (like RabbitMQ)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: