I've had truly terrible experiences with RabbitMQ. I believe that it should not ...

tmarice · on May 22, 2020

Rabbitmq blocks producers when it hits memory high watermark (default 40% of available RAM) - https://www.rabbitmq.com/memory.html

dzr0001 · on May 22, 2020

We used to have a pub rate of about 200k msgs/s, from about 400 producers all to a single exchange and had similar issues. However, we were able to mitigate this by using lazy queues.

This worked fine until things got behind and then we couldn't keep up. We were able to work around that by using a hashed exchange that spread messages across 4 queues. It hashed based on timestamp inserted by a timestamp plugin. Since all operations for a queue happen in the same event loop, any sort of backup led to pub and sub operations fighting for CPU time. By spreading this across 4 queues we wound up with 4x the CPU capacity for this particular exchange. With 2000 queues you probably didn't run into that issue very often.

Teknoman117 · on May 22, 2020

We had a similar experience where I work. We just ended up rolling our own queue system because we really just needed point to maybe a few other points.

lukebakken · on May 22, 2020

Kyle's analysis of RabbitMQ is almost 6 years old. Rest assured, things have changed since then.

zbentley · on May 22, 2020

How did the switch to Kafka solve your issue with providing backpressure to producers?

zbentley · on May 22, 2020

I'm glad Kafka is working for you you! Rabbit's HA story has definitely been rough until recently. But I think a few of the issues you describe can be mitigated with a bit better understanding of what's going on.

> Any time any of the consumers would slow down of fail, rabbit would run out of memory and crash, causing sitewide failure.

Not to be glib, but in any brokered system, you to have enough (memory and disk) buffer space to soak up capacity when consumers slow down, within reason. Older (2.x) RabbitMQs did a very poor job rapidly paging queue contents to disk when under memory pressure. Newer versions do better, but you can still run the broker out of memory with a high enough ingress/low enough egress, which brings me to...

It sounds like you did not set your high watermarks correctly (another commenter already pointed this out); RabbitMQ can be configured to reject incoming traffic when over a memory watermark, rather than crash.

However, a couple of things can complicate this: rejection of incoming publishes on already-established connections may not make it back to your clients, if they are poorly behaved (and a lot of AMQP client libraries are poorly behaved) or are not using publisher confirms. Additionally, if your clients do notice that this is happening and continually reattempt to reconnect to RabbitMQ to handle the (actually backpressure due to memory) rejection notification, this connection churn can put massive amounts of strain on the broker, causing it to slow down or hang. In RabbitMQ's defense, connect/disconnect storms will damage many/most other databases as well.

> We experimented with ... "durability", which caused the cluster to crash more frequently and lose more messages

A few things to be aware of regarding durability:

Before RabbitMQ 3-point-something (I want to say 3.2), some poorly chosen Erlang IO-threadpool tunings caused durability to have higher latency than expected with large workloads. Anecdotally, the upgrade from 3.6 to 3.7 also improved performance of disk-persisted workloads.

If you have durability enabled, you should really be using publisher confirms (https://www.rabbitmq.com/confirms.html) as well. This isn't just for assurance that your messages made it; without confirms on, I've seen situations where publishers seem to get "ahead" of Rabbit's ability to persist and enqueue messages internally, causing server hiccups, hangs, and message loss. That's all anecdotal, of course, but I've seen this occur on a lot of different clusters. Pub confirms are a species of backpressure, basically--not from consumers to producers, but from RabbitMQ itself to producers.

When moving a high volume of non-tiny messages (where tiny is <500b), you really need a fast disk. That means the equivalent of NVMe/a write-cache-backed RAID (if on real hardware; ask me about battery relearns killing RabbitMQ sometime ... that was a bad night like the one you described), or paying attention to max-throughput/IOPS if deploying in the cloud (for example, a small EBS gp2 volume may not bring enough throughput, and sometimes you may need to RAID-0 up a pair of sufficiently-sized gp2's to get what you need). And no burst IOPS, ever.

> We experimented with "high availability" mode

You're 100% right about this. RabbitMQ's story in this area was pretty bad until recently. Quorum queues and lots of internal improvements have made the last ~4 years worth of the Rabbit versions behave better in HA configurations. But things can still get really dicey. Always "pause minority" (trade away your uptime for message loss), as the Jepsen article you linked mentioned.

For failure recovery (though it's not that "HA") if you can get single-node durability working well and are using networked disks (e.g. NFS, EBS) or a snapshotted-and-backed-up filesystem, one of the nice things about RabbitMQ's persistence format is that at the instant of crash, all but the very most recent messages are recoverable in the storage layer. That doesn't solve availability, but it does mean you don't have catastrophic data loss when you lose a node (restore a snapshot or reattach the data volume to a replacement server).

sethammons · on May 22, 2020

Wow, that error message! Unless you are Google, network partitions are a thing. With CAP, you don’t get to choose CA.