If I had to nominate a piece of software, as an SRE, that is as close to "set and forget" as possible, it'd absolutely be RabbitMQ standalone and a close runner up would be its clustered form.
I've worked in 3 places where RabbitMQ has been a fundamental cornerstone of the architecture, and while it does require a little tuning around performance occasionally (generally because it's being used inappropriately or without full consideration of its limitations/best practices), it's rock solid, easy to debug/inspect, has an active and supportive community, and is generally just all around pleasant to work with and maintain.
Kudos to RabbitMQ and its developers!
As an aside, the addition of recent features such as super streams, streams and quorum queues make it a compelling all-in-one tool for solving a bunch of architectural based concerns/requirements in application and infrastructure development. I've often thought about why it's not more utilised in the ops side of the world for metrics gathering and other usecases. I have also wondered how useful it'd be for log ingestion with lazy queues etc.
Does anyone out there have examples of unusual use cases for RabbitMQ where it's outshone some alternative product? Would love to hear about them!
I feel like "and while it does require a little tuning around performance occasionally" is doing a lot of heavy lifting there :)
Honestly though my only experience with RabbitMQ has been as a backend for Celery (background task processor for Python) and I think my real issues are mostly to do with how Celery uses RabbitMQ in a very poor default setup.
Message confirmations are off by default and turning them on caused our queues to grind to a halt. Queues being single threaded and clusters don't much help with that without using some kind of sharding plugin. It seemed like getting to a good spot required a lot of arcane knowledge that wasn't so easy to find.
"Configuring RabbitMQ to be a performant message queue for background task systems" would be an excellent blog post I would share widely!
RabbitMQ cluster mode for classic queues is not a true HA solution. If a node gets replaced the content of your queue has to be synchronized to the new node and while this is running you can't produce to the queue which is an outage. Unfortunately this synchronization method is also unreliable if you have a significant (a few GB) amount of messages in the queue, it often crashes nodes with no way to recover the internal database. And even if everything works you still have a chance of lost messages or lost acks (to be fair, this is documented). It is so bad that it is now officially deprecated.
This also makes online upgrades extremely hard, the only acceptable way is to stand up a new cluster, switch producers and consumers and then shovel the data from the old cluster (which is also not too reliable).
They came up with quorum queues which have less features and require to keep all messages in memory. I don't like having servers with 90% unused memory for that one event where I actually need to queue a lot of messages because a consumer is broken.
I would never pump logs through RabbitMQ, if you get into a situation where you accumulate a large amount of data in a queue you will face trouble sooner or later. Most likely RabbitMQ will run out of memory, will "flow-control" producers and you'll have an outage you can't recover from.
> They came up with quorum queues which have less features and require to keep all messages in memory.
I don't think this is true
> Quorum queues store their message content on disk (per Raft requirements) and only keep a small metadata record of each message in memory. This is a change from prior versions of quorum queues where there was an option to keep the message bodies in memory as well. This never proved to be beneficial especially when the queue length was large.
You are correct, nice to see that this was improved.
I should find time to run some tests with quorum queues, this now actually looks usable. But in the end we will have to see how stable it is running production workloads.
interesting, i have the opposite experience. When we got to high throughput in a cluster we had all sorts of crashes, partitions, and nasty failure modes that required stressful delicate rebuilds. It has similar challenges to a clustered M-M relational db. Moved the high volume events to kinesis which was far more reliable for that use case.
At low volumes yeah sure, it just ticks along, and does what you want.
I only used it at one company, but it was the most unreliable and hard to diagnose piece of our infrastructure. We didn't even have high throughput and I wouldn't use it again without a really good reason.
What's low volume? At one point I used rmq to ingest ~20,000 messages per second of varying length. From Tweets to blog posts, all containing the full content of the activity with metadata. It was with 3 node cluster.. wish I remembered the specs but nothing crazy aside from the SSD IOPs. The one time it fell over was when the consumers were broke long enough to fill up the disks.
around there & upwards. You've listed one of the big problems with rabbit @ volume - inevitably/unavoidably you are going to have consumers go down or so slow. At a high enough volume you're heading for a crash/partition quickly if you cant respond fast enough (where "fast enough" is a time window inverse to how high volume the queue is). Its a crappy failure mode to have a sword hanging over you like that.
other log-based messaging technologies like kinesis, kafka, etc do not care if a consumer goes down & are thus much safer.
In my case it would have been an issue regardless of what the queue/pubsub tech was (talking on-prem, not GCP Pubsub or AWS, which would just chug along effortlessly, not care, and take our money), since the entire consumer stack was toast and dumping unprocessed data was a no-no. The real issue there was my manager not having a spine coupled with not allowing my team to do its job autonomously. Even with the wonky setup we had it would have been dead simple to chain additional clusters. Stupid but easy with the automation we had built.
However, adding another Kafka or Pulsar node would have been much easier.
I've read of RABBITMQ so much, but yet I don't understand what does it actually do? I know it is a 'message broker' but I don't understand what that means. To me it sounds it's a backend for messaging that's easy to integrate with any user account module?
It's messaging in the sense of message passing between processes. You put messages into it when you want them to be processed asynchronously.
They go into queues based on attributes of the messages ("routing keys") matched against rules you set up ("bindings").
Other things pick up messages off of those queues and process them.
When they're done, they acknowledge the message and it's removed from the queue. If they crash, the message doesn't get acknowledged and it goes back to the queue.
I think of it is as the nervous system of our infrastructure. Our infra has all these moving parts like schedulers, storage, virtual machines, networking and so on. Rabbit is the thing that all the moving parts use to coordinate with all the other moving parts.
Absolutely. My team uses .NET, so we use the NServiceBus library on top of it, and RabbitMQ has been rock-solid. We never have to think about it, it's just been running for years.
Some great points there, I’ve been hit by all of them. I particularly like the first which is to engage an expert to validate your design. There are so many kinds of deployments and configurations and knowing the best for your application isn’t at all obvious.
A bit of a weird article.
It announces super streams-which are basically topic partitions in Kafka, acknowledges that, but then hand-waves “oh but they’re different” and then spends the rest of the article talking about all the features that are identical to how topic partitions work…
It’s a good feature, there’s nothing wrong with that, but there’s nothing wrong with saying “we’ve brought this feature to RabbitMQ too” rather than trying to pretend it’s totally-not-the-same.
Here's the full paragraph where they talk about Kafka:
> Let’s talk about the elephant in the room: how does it compare to Kafka? We can compare a super stream to a Kafka topic and a stream to a partition of a Kafka topic. A RabbitMQ stream is a first-class, individually-named object though, whereas a Kafka partition is a subordinate of a Kafka topic. This explanation leaves a lot of details out, there is no real 1-to-1 mapping, but it is accurate enough for our point in this post.
I think that's fine. This is a post about a new feature in RabbitMQ, from the maintainers of RabbitMQ. It's not intended as a RabbitMQ-Kafka comparison post.
They acknowledge that it's equivalent to Kafka, and then chose not to spend time digging into the details of how it's different - I'm willing to take their word for it that there are all sorts of interesting technical differences here, and I'm fine with them not addressing those in the post where they announce their new feature to the world.
I don't see this as them pretending that it's totally-not-the-same.
I read the blog post and the Java documentation (which is more informative), and maybe I missed it, but I can't figure out how you define the # of streams/partitions. Is it defined when the superstream is created?
What's the general practice for multi-tenancy? Say we have millions of clients with thousand being added daily and for various reasons, at least for _some_ message types, we'd like to keep them separate (for example, maybe strict ordering is super important, so we can't throw messages away, but we don't want a poison message to impact all customers).
I've worked in 3 places where RabbitMQ has been a fundamental cornerstone of the architecture, and while it does require a little tuning around performance occasionally (generally because it's being used inappropriately or without full consideration of its limitations/best practices), it's rock solid, easy to debug/inspect, has an active and supportive community, and is generally just all around pleasant to work with and maintain.
Kudos to RabbitMQ and its developers!
As an aside, the addition of recent features such as super streams, streams and quorum queues make it a compelling all-in-one tool for solving a bunch of architectural based concerns/requirements in application and infrastructure development. I've often thought about why it's not more utilised in the ops side of the world for metrics gathering and other usecases. I have also wondered how useful it'd be for log ingestion with lazy queues etc.
Does anyone out there have examples of unusual use cases for RabbitMQ where it's outshone some alternative product? Would love to hear about them!