Hacker News new | past | comments | ask | show | jobs | submit login

I've had truly terrible experiences with RabbitMQ. I believe that it should not be used in any application where message loss is not acceptable. Its two big problems are that it cannot tolerate network partitions (reason enough to never use it in production systems, see https://twitter.com/antifuchs/status/735628465924243456), and it provides no backpressure to producers when it starts running out of memory.

In my last job, we used Rabbit to move about 15k messages per sec across about 2000 queues with 200 producers (which produced to all queues) and 2000 consumers (which each read from their own queues). Any time any of the consumers would slow down of fail, rabbit would run out of memory and crash, causing sitewide failure.

Additionally, Rabbit would invent network partitions out of thin air, which would cause it to lose messages, as when partitions are healed, all messages on an arbitrarily chosen side of the partition are discarded. (See https://aphyr.com/posts/315-jepsen-rabbitmq for more details about Rabbit's issues and some recommendations for running Rabbit, which sound worse than just using something else to me.)

We experimented with "high availability" mode, which caused the cluster to crash more frequently and lose more messages, "durability", which caused the cluster to crash more frequently and lose more messages, and trying to colocate all of our Rabbit nodes on the same rack (which did not fix the constant partitions, and caused us to totally fail when this rack lost power, as you'd expect.)

These are not theoretical problems. At one point, I spent an entire night fighting with this stupid thing alongside 4 other competent infrastructure engineers. The only long term solution that we found was to completely deprecate our use of Rabbit and use Kafka instead.

To anyone considering Rabbit, please reconsider! If you're OK with losing messages, then simply making an asynchronous fire-and-forget RPC directly to the relevant consumers may be a better solution for you, since at least there isn't more infrastructure to maintain.




Rabbitmq blocks producers when it hits memory high watermark (default 40% of available RAM) - https://www.rabbitmq.com/memory.html


We used to have a pub rate of about 200k msgs/s, from about 400 producers all to a single exchange and had similar issues. However, we were able to mitigate this by using lazy queues.

This worked fine until things got behind and then we couldn't keep up. We were able to work around that by using a hashed exchange that spread messages across 4 queues. It hashed based on timestamp inserted by a timestamp plugin. Since all operations for a queue happen in the same event loop, any sort of backup led to pub and sub operations fighting for CPU time. By spreading this across 4 queues we wound up with 4x the CPU capacity for this particular exchange. With 2000 queues you probably didn't run into that issue very often.


We had a similar experience where I work. We just ended up rolling our own queue system because we really just needed point to maybe a few other points.


Kyle's analysis of RabbitMQ is almost 6 years old. Rest assured, things have changed since then.


How did the switch to Kafka solve your issue with providing backpressure to producers?


I'm glad Kafka is working for you you! Rabbit's HA story has definitely been rough until recently. But I think a few of the issues you describe can be mitigated with a bit better understanding of what's going on.

> Any time any of the consumers would slow down of fail, rabbit would run out of memory and crash, causing sitewide failure.

Not to be glib, but in any brokered system, you to have enough (memory and disk) buffer space to soak up capacity when consumers slow down, within reason. Older (2.x) RabbitMQs did a very poor job rapidly paging queue contents to disk when under memory pressure. Newer versions do better, but you can still run the broker out of memory with a high enough ingress/low enough egress, which brings me to...

It sounds like you did not set your high watermarks correctly (another commenter already pointed this out); RabbitMQ can be configured to reject incoming traffic when over a memory watermark, rather than crash.

However, a couple of things can complicate this: rejection of incoming publishes on already-established connections may not make it back to your clients, if they are poorly behaved (and a lot of AMQP client libraries are poorly behaved) or are not using publisher confirms. Additionally, if your clients do notice that this is happening and continually reattempt to reconnect to RabbitMQ to handle the (actually backpressure due to memory) rejection notification, this connection churn can put massive amounts of strain on the broker, causing it to slow down or hang. In RabbitMQ's defense, connect/disconnect storms will damage many/most other databases as well.

> We experimented with ... "durability", which caused the cluster to crash more frequently and lose more messages

A few things to be aware of regarding durability:

Before RabbitMQ 3-point-something (I want to say 3.2), some poorly chosen Erlang IO-threadpool tunings caused durability to have higher latency than expected with large workloads. Anecdotally, the upgrade from 3.6 to 3.7 also improved performance of disk-persisted workloads.

If you have durability enabled, you should really be using publisher confirms (https://www.rabbitmq.com/confirms.html) as well. This isn't just for assurance that your messages made it; without confirms on, I've seen situations where publishers seem to get "ahead" of Rabbit's ability to persist and enqueue messages internally, causing server hiccups, hangs, and message loss. That's all anecdotal, of course, but I've seen this occur on a lot of different clusters. Pub confirms are a species of backpressure, basically--not from consumers to producers, but from RabbitMQ itself to producers.

When moving a high volume of non-tiny messages (where tiny is <500b), you really need a fast disk. That means the equivalent of NVMe/a write-cache-backed RAID (if on real hardware; ask me about battery relearns killing RabbitMQ sometime ... that was a bad night like the one you described), or paying attention to max-throughput/IOPS if deploying in the cloud (for example, a small EBS gp2 volume may not bring enough throughput, and sometimes you may need to RAID-0 up a pair of sufficiently-sized gp2's to get what you need). And no burst IOPS, ever.

> We experimented with "high availability" mode

You're 100% right about this. RabbitMQ's story in this area was pretty bad until recently. Quorum queues and lots of internal improvements have made the last ~4 years worth of the Rabbit versions behave better in HA configurations. But things can still get really dicey. Always "pause minority" (trade away your uptime for message loss), as the Jepsen article you linked mentioned.

For failure recovery (though it's not that "HA") if you can get single-node durability working well and are using networked disks (e.g. NFS, EBS) or a snapshotted-and-backed-up filesystem, one of the nice things about RabbitMQ's persistence format is that at the instant of crash, all but the very most recent messages are recoverable in the storage layer. That doesn't solve availability, but it does mean you don't have catastrophic data loss when you lose a node (restore a snapshot or reattach the data volume to a replacement server).


Wow, that error message! Unless you are Google, network partitions are a thing. With CAP, you don’t get to choose CA.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: