Hacker News new | past | comments | ask | show | jobs | submit login

This is a topic that I'm actively researching on for my startup (www.vacationlabs.com) and would love to know what other experienced people think.

Our Rails app has grown over time and we ended up doing everything in a monolithic way to get stuff out faster. However, now the app is so big and tightly coupled that newer members of the team find it very hard to contribute. Therefore we are thinking of breaking it down into smaller apps which talk to each other using some sort of APIs -- either JSON/HTTP or MQ.

The MQ use case being considered is this -- the core transactional part of the system handles just that - maintaining tour availability and keeping track of payments. Everything else will be hived off to a mini-app which is notified of booking/payment related changes via a pub/sub model. So the email notification system will subscribe to the new-booking, booking-cancelled, payment-received, etc events and will send out emails appropriately. The billing system will subscribe to another set of events and will bill the customer appropriately. And so on.

Is this the right use case for an MQ? Is using an MQ worth the additional dev-ops related complexity that it brings along? in our case message delivery needs to be guaranteed, else extremely important business functions will not work. How do we deal the MQ unavailability? In normal cases if the DB is unavailable your system is down. Is this how MQ should also be treated? If not, how do you deal with the situation where the core DB transaction is complete, but for some reason you're unable to publish an event to the MQ? If the pub/sub system were built on top of the DB itself this problem would not arise because publishing an event would be part of your DB transaction itself.

Is there a sane way to build a pub/sub architecture on top of a DB, especially Postgres, which is what we're using. If not, any recommendations for which MQ we should be using for a guaranteed delivery pub/sub model?




While it sounds like it could be a case for a queue, the fact that your workflow is mission-critical is itself a reason not to use one. In particular, a reason not to use RabbitMQ.

Some message queues are more reliable than others. RabbitMQ is designed to be clustered, and its handling of partition tolerance has been shown [1] to be very bad, something that I have experienced personally in a production system. Don't ever use it if losing messages will be a problem; and never use it with automatic acking (you'll want messages to be retried if your workers die mid-stream). RabbitMQ can be reliable if your boxes are all on a native (not cloud VM-based) LAN that is stable, and your machines don't occasionally get so overloaded that it impacts network connectivity.

One possibility is to use a message queue purely for signaling, not for state -- instead, use databases and APIs to transmit actual state. For example, let's say you want to shoot off an email every time there is a new booking. The "emailer" app listens to events published by the "booking" app. But the event doesn't contain any information about the booking; instead, the event simply says that there was a booking. When the "emailer" app receives this event, it asks the "booking" app for new bookings that it doesn't know about; it processes each booking, first recording that a (booking_id, email_id) row, then firing off the email, then committing that row.

This makes every participant in the workflow idempotent, because they can run the same piece of logic many times and still produce the same result. If you ever have a problem with the queue going down, you can simply execute the exact same code: You don't need to replay any missing events. You only need to worry about multiple listeners (multiple "emailer" workers) waking up from the same notification and doing the emailing for the same bookings. This is why you must transactionally update your email log table using database locks. You don't necessarily need to use database locks, but such a system needs some kind of locking to be absolutely atomic.

The nice thing about this workflow is that you can make it extra bulletproof by making the "check for bookings to email about" logic run, say, every ten minutes -- in addition to responding to the message queue events. If the queue isn't working, your app will still be doing the emailing, just a little slower. In other words, the queue simply becomes a trigger mechanism.

[1] http://aphyr.com/posts/315-call-me-maybe-rabbitmq




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: