Behind a Backend-as-a-Service Provider: The How and Why of Our Architecture.

larubbio · on May 3, 2012

Full disclosure, I'm heading up the backend development for Zipline Games Moai Cloud (http://getmoai.com/). We're targeting game developers so that lead us to some different use cases and language choices (like Lua).

Thanks for sharing this, at first read through it sounds a lot like the architecture mongrel2 provides (which we use). If you were to swap out the node.js dispatcher with mongrel2 and the redis queue with ZeroMQ.

Have you run into any issues using redis as a queue? If it were replicated across machines I wonder if you could have multiple workers dequeuing the request. If it's on a single server wouldn't the blocking operation on the list become a bottleneck?

Again, thanks for sharing, the offering looks great.

automatthew · on May 3, 2012

Regarding the architectural question:

We actually tested a few designs using m2 and 0mq during our R&D phase. 0mq's push/pull sockets provide the same "take" behavior as the Redis RPUSH/BLPOP, so we definitely could have used m2 as the HTTP end of a similar architecture to what we use now.

One of the considerations that led to the choice of Redis was the transparency of the queueing and dequeueing. The messages in the queue are easily inspected, and the Redis MONITOR command helps greatly in debugging.

More important from a design perspective was our desire to hide the HTTP specifics from the workers. m2 pushes a JSON or tnetstring representation of the HTTP request to its workers, but we want the task we send to the workers to be generalized, stripped of information that is only meaningful for HTTP. We also want to classify the task by resource type and requested action, which allows us to use multiple queues. Multiple queues allows us to implement workers on an ad hoc basis.

M2 could work here if our request classification only depended on the URL. But that is a limitation we are not willing to accept. Request headers can be very useful in dispatching, especially those related to content negotiation (Accept, Content-Type, etc.)

There is an interesting hybrid approach using mongrel2: write one or more m2 handlers that perform the same function as our node.js dispatchers. I.e. m2 sends the JSON-formatted request to an m2 handler that deserializes it, removes the HTTP dressing, classifies the request according to type and action, then queues a task in the appropriate queue. A worker takes the task, does its own little thing, and sends the result to an m2 handler that knows how to re-clothe the result as an HTTP response and deliver it to the m2 cluster.

Regarding the question about queue behavior across replicated Redises:

I do not know for certain, but I do certainly hope it is not possible for an item in a Redis list to be popped by more than one client, no matter how the replication is configured.

With our architecture, we could relieve at least some of the strain on the task/result messaging system by using a cluster of Redis servers for the task queues. Each queue server might have its own cadre of workers listening for tasks. The return trip (getting a result from a worker back to the HTTP front end) is a little trickier, because it matters which HTTP server is holding open the HTTP connection. You could use PUB/SUB (which I believe is how m2 currently does it), or each HTTP server could be popping results from its own result queue.

When using a single Redis server, the only hard limitation we have seen with using the BLPOP operation is the number of client connections Redis can keep open. In case it's not clear, the BLPOP is blocking for the client, not the server.

djb_hackernews · on May 3, 2012

Nice post.

Do you do anything special to make the queue which your workers are BLPOPing durable?

Is there a reason you didn't use Redis pub/sub? Seems like the perfect use case.

automatthew · on May 3, 2012

The "queue" is merely a Redis list. Durable by default.

Redis PUB/SUB is not suited for the task queues we use, because any number of subscribers will receive the messages. We want to guarantee that only one worker will act upon each message.

djb_hackernews · on May 3, 2012

Err, I guess I meant something else. For instance, what happens if one of the workers goes down after popping? The message is lost, right?

I think I invented your architecture in my head that isn't near reality. I imagined you were using the workers to take incoming published messages and pushing them into the queues that subscribed connections are popping off of. Effectively building your own fanout. So in this context, I was wondering why not just use pubsub which will handle fanout and get rid of the entire worker model.

Thanks for the reply!

stefanve · on May 3, 2012

Maybe a bit off topic,

but is it just me or is the BaaS term just a marketing ploy, or at least unnecessary?

I think every thing is placable in the IaaS, PaaS and SaaS model.

spladow · on May 3, 2012

I can see how this might seem like just another way to stand out, but I think in this case BaaS really is more descriptive than the other options. Platform-as-a-Service is a broad term that could reasonably contain what we do, in that we provide a platform from which to build applications. However, the idea of a PaaS is generally accompanied by the understanding that it involves deployment solutions and is more centered on providing the network, servers, etc. What we, and other companies using the BaaS title do, has much more to do with trying to provide backend and not just a platform to deploy your own stack to. It's a matter of focus, and while the title does help differentiate us from other services, I think it does so in an honest and helpful way.

reinhardt · on May 3, 2012

They're not the first and they won't be the last; just search for '"as a service" -software -platform -infrastructure -saas -iaas -paas' and weep (or laugh). The "as a service" meme is begging to be parodied if it hasn't been done already.

kozubik · on May 3, 2012

I don't understand the business model at all.

I undertand the attraction of implementing web based "messaging" (chat) in javascript. But why wouldn't I just point that javascript back to myself ?

Why would I route the product of JS based chat through a third party when it could just communicate with the server it got the HTTP from in the first place ?

My guess is that this is for folks that don't have any control over their back end - it's just a web serving black box, and this is just some more content to paste into it. Is that about right ?

The missing piece, though, is the revenue model - the users who would generate more than 30 million messages in a month are the same users who actually might have their own back end, and the wherewithal to use it. I would think if you need to use third party javascript snippets, you're ipso facto a smaller, lower volume user ...

tbatterii · on May 4, 2012

"Why would I route the product of JS based chat through a third party when it could just communicate with the server it got the HTTP from in the first place ?"

I'll take a stab at it with an anecdote.

First it can be used for more than chat. Anything where a message bus would meet the need could work on top of this.

The project I work on uses pubnub(http://www.pubnub.com/) instead of appengine's channel api because we wanted a reliable way to broadcast to several listeners.

Where at the time the channel api would only do point to point(still does), and if you wanted broadcast you had to maintain connection state with all listeners somehow. So you would have to invent your own keep-alive protocol(not my cup of tea) etc...

So now , when the server needs to notify all listening clients of something, a json message is put in a scheduled task queue, and the call goes out to pubnub in a few ms and arrives to clients a few ms after that. It's pretty impressive.

Looks like spire.io provides similar services. essentially a cloud based message bus that supports broadcast/fan out.

pubnub is supposedly servicing 100K messages per second now. And I would guess it's not just chat.(http://techcrunch.com/2012/03/21/as-developers-seek-more-int...)

nl · on May 4, 2012

The missing piece, though, is the revenue model - the users who would generate more than 30 million messages in a month are the same users who actually might have their own back end, and the wherewithal to use it

This is true, but if you had a service with the potential to generate say 50 million messages per month would you spend $60/month and use this, or multiple thousand dollars to develop your own?

(Also, note that a big market for this is mobile, not just javascript on websites)

kozubik · on May 4, 2012

Ok, fair enough. I'm still wrapping my head around JSAI (javascript as infrastructure) so bear with me ...

keithnoizu · on May 3, 2012

Service as a service.

nl · on May 4, 2012

It's not really a PaaS, because it only provides a single service (in this case realtime messaging).

That doesn't mean this model doesn't make a lot of sense.

If you are using IaaS then it can be a lot quicker to use something like this than build it yourself. Same for things like Search, or (sometimes) Database services or distribution (think CDNs).

Even if you are using a PaaS it might still make sense to use this (depending on what your PaaS supplies).

gyaresu · on May 3, 2012

Your chat demo is broken. http://www.spire.io/examples/chat/

spladow · on May 3, 2012

I know this is the least satisfying support answer possible, but the chat appears to be available for me right now. I'd like to help figure out the problem, can you give me some details like your OS and browser?

gyaresu · on May 4, 2012

Opera Version: 11.62 Build: 1347 Platform: Mac OS X System: 10.7.3

https://www.dropbox.com/s/nhqolr7q11vb4hj/Screen%20Shot%2020...

rwolf · on May 3, 2012

Seconded. Nothing happens when I click join, and there is an error "no zlib library" in the console.

spladow · on May 3, 2012

Thanks. It should be alright now.

skrebbel · on May 3, 2012

I get a 405 Method not allowed after filling in my name. Opera 11.61, Win 7.

knwang · on May 3, 2012

Typo on the front page:

We run our servers on Amazon Web Services and use their elastic load balancer (which is were we terminate SSL).

which is _why_ (?)

lolcatstevens · on May 3, 2012

We've been bitten by EC2 instances having issues accepting incoming connections (multiple HAProxy boxes in TCP mode to an STunnel cluster), and we've never had that issue in testing with ELB. ELB also beats our failover time when we lose an EC2 machine.

But ELB is in no way a permanent part of our infrastructure (nothing is permanent) especially as we move to supporting technologies such as SPDY on spire.io or, for example for the right customer requirement, SSL throughout the network stack. We're also fond of Stud running on our internal servers. I do think ELB is the right tool for our cloud today.

sdepablos · on May 3, 2012

Haven't you run into performance terminating SSL on ELB? For me the performance is so-so, as I'm using a 2048 bit key and it seems I hit the maximum requests per second limit pretty fast. There're a pair of threads regarding this issue on the AWS forums, where a user did a really exhaustive test of ELB and even got an Amazon engineer to really look into that issue:

https://forums.aws.amazon.com/thread.jspa?messageID=327283 https://forums.aws.amazon.com/thread.jspa?messageID=327715

lolcatstevens · on May 3, 2012

We're looking pretty good in AWS West 1a. The second thread you linked shows great performance from markdcorner's second load balancer -- the https://forums.aws.amazon.com/servlet/JiveServlet/download/3... image -- vs the original ELB, which SpencerD@AWS describes as a custom ELB with some sort of customer-requested secret sauce (maybe some sort of "slow start" to some backend servers?). Indeed we have reached out to AWS a few times in the past and had some magic (at the time ciphers and removing SSL v2) done to our ELBs.

We have seen issues with performance on ELB which is why we originally went with TCP mode HAProxy on the edge of our stack to a cluster of STunnel servers, but again reliability was an issue here and our ELB performance with up to 10K rps looks great in benchmarks. Past 10K we are considering separate dispatchers behind a separate ELB. But at that point I am also tempted to, frankly, switch to our own metal.

Curious: are you comparing ELB performance vs High I/O EC2 instances (say m1.xlarge) open to the world?

sdepablos · on May 10, 2012

Well, I didn't know you could fine tune your ELB's. In fact reading the Developer Guide (http://awsdocs.s3.amazonaws.com/ElasticLoadBalancing/latest/..., page 46) it seems that it's now possible to choose SSL protocols and ciphers via the web interface (and I supposse also via the API).

Regarding the comparison you suggest, we are too short-handed right know not only to do this kind of comparisons but even to think to manage our own load balancers ;) Thanks for the suggestion anyway.

spladow · on May 3, 2012

Thanks, we're fixing it now.

rwolf · on May 3, 2012

I like the apologetics for node.js in the first section. a similar argument for why you are using jruby instead of node.js for the "backend whatchamajigs" would be nice.

automatthew · on May 3, 2012

The short answer is: Use of synchronous Redis calls made it easier to develop and to experiment with storage patterns.

Workers can be written in any language, which was another major design consideration. We have about a dozen types of worker right now; when we want to evaluate a new language, we can port just one. Alternatively, we can write any new worker types in an arbitrary language.

Thus we're not married to JRuby.

pydanny · on May 3, 2012

I saw a presentation on it last night and it looked pretty awesome. We're planning to use their Messaging API in our upcoming project.

spladow · on May 3, 2012

Cool. What are you planning to build, if you don't mind me asking?

pydanny · on May 3, 2012

One of our client projects involves having customer service staff talk and resolve payment issues with paying users. The messaging API means we have one less thing to worry about, allowing us to focus on handling payment resolution rather than the communication side of things.

skrebbel · on May 3, 2012

Software architecture stories without pictures make me sad :-(