This is a topic that I'm actively researching on for my startup (www.vacationlab...

atombender · on Sept 28, 2014

While it sounds like it could be a case for a queue, the fact that your workflow is mission-critical is itself a reason not to use one. In particular, a reason not to use RabbitMQ.

Some message queues are more reliable than others. RabbitMQ is designed to be clustered, and its handling of partition tolerance has been shown [1] to be very bad, something that I have experienced personally in a production system. Don't ever use it if losing messages will be a problem; and never use it with automatic acking (you'll want messages to be retried if your workers die mid-stream). RabbitMQ can be reliable if your boxes are all on a native (not cloud VM-based) LAN that is stable, and your machines don't occasionally get so overloaded that it impacts network connectivity.

One possibility is to use a message queue purely for signaling, not for state -- instead, use databases and APIs to transmit actual state. For example, let's say you want to shoot off an email every time there is a new booking. The "emailer" app listens to events published by the "booking" app. But the event doesn't contain any information about the booking; instead, the event simply says that there was a booking. When the "emailer" app receives this event, it asks the "booking" app for new bookings that it doesn't know about; it processes each booking, first recording that a (booking_id, email_id) row, then firing off the email, then committing that row.

This makes every participant in the workflow idempotent, because they can run the same piece of logic many times and still produce the same result. If you ever have a problem with the queue going down, you can simply execute the exact same code: You don't need to replay any missing events. You only need to worry about multiple listeners (multiple "emailer" workers) waking up from the same notification and doing the emailing for the same bookings. This is why you must transactionally update your email log table using database locks. You don't necessarily need to use database locks, but such a system needs some kind of locking to be absolutely atomic.

The nice thing about this workflow is that you can make it extra bulletproof by making the "check for bookings to email about" logic run, say, every ten minutes -- in addition to responding to the message queue events. If the queue isn't working, your app will still be doing the emailing, just a little slower. In other words, the queue simply becomes a trigger mechanism.

[1] http://aphyr.com/posts/315-call-me-maybe-rabbitmq