Answering your questions about Heroku routing and web performance

ebbv · on April 3, 2013

It seems to me that Heroku is still failing to understand (or at least cop to) the fact that the switch from intelligent to randomized routing was a loss of a major reason people chose Heroku in the first place.

A lot of Heroku's apparent value came from the intelligent routing feature. Everybody knew that it was harder to implement than random routing, that's why they were willing to pay Heroku for it.

Nobody's arguing random routing isn't easier and more stable; of course it is.

The problem is that by switching over to it, Heroku gave up a major selling point of their platform. Are they really blind enough not to know this? I have a hard time believing that.

It seems to me the real way to make people happy is to discount the "base" products which come with random routing and make intelligent routing available as a premium feature. Of course, people who thought they were getting intelligent routing should be credited.

adamwiggins · on April 3, 2013

I hear you. Heroku's value proposition is that we abstract away infrastructure tasks and let you focus on your app. Keeping you from needing to deal with load balancers is one part of that. If you're worried about how routing works then you're in some way dealing with load balancing.

However, if someone chose us purely because of a routing algorithm, we probably weren't a great fit for them to begin with. We're not selling algorithms, we're selling abstraction, convenience, elasticity, and productivity.

I do see that part of the reason this topic has been very emotional (for us inside Heroku as well as our customers and the community) is that the Heroku router has historically been seen as a sort of magic black box. This matter required peeking inside that black box, which I think has created a sense of some of that magic being dispelled.

ebbv · on April 3, 2013

I doubt anyone chose Heroku solely because of the routing algorithm, but the sales point of intelligent routing was certainly an appealing one.

Developers don't want to pay for abstraction just for abstraction's sake, they want to pay for abstraction of difficult things. Intelligent routing is one of those difficult things. Random routing is easy, which is of course why it's also more reliable, but also why you're seeing people feeling like they didn't get what they paid for.

I should be clear; this doesn't affect me personally but I am totally sympathetic with the customers who are bent out of shape about this and I still see a divide between your response and why people are upset, and I'm trying to help you bridge that divide.

adamwiggins · on April 3, 2013

Well said.

It's interesting — very few customers are actually bent out of shape about this. (A few are, for sure.) It's more non-customers who are watching from the sidelines that seem to be upset. I do want to try to explain ourselves to the community in general, and that's what this post was for. But my first loyalty is to serving our current customers well.

natmaster · on April 3, 2013

What about potential customers? I've been evaluating your platform and just be completely frustrated with 3 days wasted trying to solve these performance issues.

I'm using gunicorn with python, and if I use the sync worker the request queue easily hits 10 seconds and nothing works; if I switch to gevent or eventlet, new-relic tells me that redis is taking the same time stuck getting a value. This is using the same code in my current provider that works just fine with eventlet and scales well.

To add insult to injury, adding dynos actually degrades performance.

adamwiggins · on April 3, 2013

That sucks. It may or may not be related to how the router works, but it's definitely about performance and visibility, which is what this is all about.

Can you email me at adam at heroku dot com and I'll connect you with one of our Python experts? I can't promise we'll solve it, but I'd like to take a look.

nantes · on April 3, 2013

I'd be really interested to see some public information resulting from debugging Python apps. We're holding pretty steady, but see a fairly constant stream of timeouts due, apparently, to variance in response times. To be sure, we're working on that. But, in the meantime, our experiments with gevent and Heroku have been less than inspiring.

adamwiggins · on April 3, 2013

I've connected Nathaniel (poster above) with one of our Python folks. Looking forward to seeing what they discover.

Would you be willing to pair up with one of our developers on your app's performance? If so email me (adam at heroku dot com).

thruflo · on April 3, 2013

I'm an existing customer using python with gunicorn. I'd be very keen to see any learnings about an optimal setup.

Fwiw, I've found the addon / db connection limits to be the primary blocker when load testing so far.

eaurouge · on April 3, 2013

"It's interesting — very few customers are actually bent out of shape about this"

Seems to me you don't get it. Sure there are some very vocal non-customers but you also have a lot of potential customers and users (spinning up free instances) evaluating your product and hoping for a better product. I agree that your true value is the abstraction you provide. Some of these potential customers want to ensure Heroku is as good an abstraction as promised to justify the cost and commitment.

adamwiggins · on April 3, 2013

Fair enough. I think the best thing we can do for those potential future customers is be really clear about what the product does and give them good tools and guidance to evaluate whether it fits their needs.

I'd argue that we dropped the ball on that before (on web performance, at least), and are rectifying it now.

perucoder · on April 3, 2013

If its only a few customers that are bent out of shape, how come you haven't quickly offered them refunds?

adamwiggins · on April 3, 2013

We did.

perucoder · on April 4, 2013

When did this happen? As of March 2, Rap Genius was still seeking to get money refunded.

http://venturebeat.com/2013/03/02/rap-genius-responds/

adamwiggins · on April 4, 2013

Just in the last few weeks. I won't disclose details for any particular customer, but I assume that Tom @ RG will make a public statement about it at some point.

cynicalkane · on April 3, 2013

You sound like a politician talking to someone of the opposite party, in that you say "I hear you", but then completely fail to address anyone's concerns. Selling a "magic black box" that guarantees certain properties, changes them, then lies about having changed them presents a liability for users who want to do serious work.

A major selling point of Heroku is that scaling dynos wouldn't be a risk. This guarantee is now gone and it's not coming back soon even if routing behavior is reverted, because users prefer good communication and trust with their providers. The responses of Heroku are blithe non-acknowledgement acknowledgements of this problem.

bcgraham · on April 3, 2013

This is really unfair. This comment:

>A lot of Heroku's apparent value came from the intelligent routing feature. Everybody knew that it was harder to implement than random routing, that's why they were willing to pay Heroku for it.

is being addressed by Adam in this comment:

>Heroku's value proposition is that we abstract away infrastructure tasks and let you focus on your app. Keeping you from needing to deal with load balancers is one part of that. If you're worried about how routing works then you're in some way dealing with load balancing.

I think Adam is getting to a really fair point here, which is that nobody really minds whether the particular algorithm is used. If A-Company is using "intelligent routing" and B-Company uses "random routing," but B-Company has better performance and slower queue times, who are you going to choose? You're going to choose B-Company.

At the end of the day, "intelligent routing" is really nothing more than a feather in your cap. People care about performance. That's what started this whole thing - lousy performance. Better performance is what makes it go away, not "intelligent routing."

cynicalkane · on April 3, 2013

Intelligent routing and random routing have different Big O properties. For someone familiar with routing, or someone who's looked into the algorithmic properties, "intelligent routing" gives one high-level picture of what the performance will be like (good with sufficient capacity), whereas random routing gives a different one (deep queues at load factors where you wouldn't expect deep queues).

This is why it was good marketing for Heroku to advertise intelligent routing, instead of just saying 'oh it's a black box, trust us'. You need to know, at the very least, the asymptotic performance beavhior of the black box.

And that's why the change had consequences. In particular, RapGenius designed their software to fit intelligent routing. For their design the dynos needed to guarantee good near-worst-case performance increased with the square of the load, and my back-of-the-envelope math suggests the average case increases by O(n^1.5).

The original RapGenius post documents them here: http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics

The alleged fix, "switch to a concurrent back-end", is hardly trivial and doesn't solve the underlying problem of maldistribution and underutilization. Maybe intelligent routing doesn't scale but 1) there are efficient non-deterministic algorithms that have the desired properties and 2) it appears the old "doesn't scale" algorithm actually worked better at scale, at least for RapGenius.

incision · on April 3, 2013

>If A-Company is using "intelligent routing" and B-Company uses "random routing," but B-Company has better performance and slower queue times, who are you going to choose?

That's not the point.

As I think of it, "performance" is an observation of a specific case while intelligent/random can be used to predict performance across all cases.

adamwiggins · on April 3, 2013

Harsh. I'm going to ignore the more inflammatory parts of this (feel free to restate if you want me to engage in discussion), but one bit did grab my attention:

"A major selling point of Heroku is that scaling wouldn't be a risk"

This is interesting, especially the word "risk." Can you expand on this?

jfabre · on April 3, 2013

Very small customer here. I don't know much about the abstraction you provide us and I don't want to know as long as things go well.

To my point of view, the routing is "random" thus kind of unpredictable. If scaling becomes more of an issue with my business, the last thing I want is to have random scaling issues that I can not do anything about because the load balancer is queuing requests randomly to my dynos.

I want my business to be predictable and if I'm not able to have it I'm going to pack my stuff and move somewhere else.

For now, I'm happy with you except for your damn customer service. They take way too long to answer our questions!

Cheers! :)

adamwiggins · on April 3, 2013

Absolutely right, I totally agree. Random scaling issues that you can't either see or control is exactly the opposite of what we want to be providing.

Can you email me (adam at heroku dot com) some links to your past support tickets? I'd like to investigate.

Thanks for running your app with us. Naturally, I expect you to move elsewhere if we can't provide you the service you need. That's the beauty of contract-free SaaS and writing your app with open standards: we have to do a good job serving you, or you'll leave. I wouldn't want it any other way.

on April 3, 2013

[deleted]

adamwiggins · on April 3, 2013

Very clarifying, thanks.

Fire-fighting during the scaling phase is a problem that every fast-growing software-as-a-service business will probably have to face. I think Heroku makes it easier, perhaps way easier; but I hope our marketing materials etc have not implied a scaling panacea. Such a thing doesn't exist and is most likely impossible, in my view.

davidkatz · on April 3, 2013

My company has been building apps for startups for years, and I can confirm that Heroku is consistently perceived as a "I never have to worry about scaling" solution.

adamwiggins · on April 3, 2013

Very useful observation. I'd love to figure out how we can better communicate that while we aim to make scaling fast and easy, "you never have to worry about scaling" is much too absolute.

davidkatz · on April 7, 2013

Your marketing materials clearly use phrases like "forget about servers", "easily scale to millions of users", "scale effortlessly" and so forth. You're making it very easy to misunderstand you.

aphyr · on April 3, 2013

"the switch from intelligent to randomized routing"

As I understand it, Heroku (on the Bamboo stack) didn't up and decide "Hey, we're gonna switch from intelligent to random routing." The routers are still (individually) intelligent. It's just that there are more of them now, and they were never designed to distribute their internal queue state across the cluster. The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases.

adamwiggins · on April 3, 2013

True, although it's more confusing than this. We didn't make an explicit decision for Bamboo, and thus the big problem — docs fell out of date, at a minimum.

For Cedar, we did make an explicit decision. People on the leading edge of web development were running concurrent backends and long-running requests. Our experimental support for Node.js in 2010 was a big driver here, but also people who wanted to use concurrency in Ruby, like Ilya Grigorick's Goliath and Rails becoming threadsafe. These folks complained that the single-request-per-backend algorithm was actually in their way.

This plus horizontal scaling / reliability demands caused us to make an explicit product decision for Cedar.

obilgic · on April 3, 2013

> true

that makes sense, i was wondering how intelligent routing was implemented in the first place.

adamwiggins · on April 3, 2013

Oh, got it. How's this:

In the early days, Heroku only had a single routing node that sat out front. So it wasn't a distributed systems problem at that point. You could argue that Heroku circa 2009 was more of a prototype or a toy than a scalable piece of infrastructure. You couldn't run background workers, or large databases. We weren't even charging money yet.

Implementing a single global queue in a single node is trivial. In fact, this is what Unicorn (and other concurrent backends) do: put a queue within a single node, in this case a dyno. That's how we implemented it in the Heroku router (written in Erlang).

Later on, we scaled out to a few nodes, which meant a few queues. This was close enough to a single queue that it didn't matter much in terms of customer impact.

In late 2010 and early 2011 our growth started to really take off, and that's when we scaled out our router nodes to far more than a handful. And that's when the global queue effectively ceased to exist, even though we hadn't changed a line of code in the router.

The problem with this, of course, is that we didn't give it much attention because we had just launched a new product which made the explicit choice to leave out global queueing. It's this failure to continue full stewardship of our existing product that's the mistake that really hurt customers.

So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of. It was a single node's queue, a page of code in Erlang which is not too different from the code that you'll find in Unicorn, Puma, GUnicorn, Jetty, etc.

nthj · on April 3, 2013

> So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of.

This sentence is really good, and I would humbly suggest you consider hammering it home even more than you have.

I gathered early on that there were inherent scaling issues with the initial router (which makes sense intuitively if you think about Heroku's architecture for more than 10 seconds), but I feel like most of the articles I've seen the past few weeks have this "Heroku took away our shiny toys because they could!" vibe. (Alternative ending: "Heroku took away our shiny toys to expand their market to nodeJS!")

Anyway, that's my take.

marshray · on April 4, 2013

> So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of.

So it was an oversimplified system that worked great but wasn't scalable and was at some point going to completely fall over under increasing load.

IMExp, this is not a wrong thing to build initially and it's not wrong to replace it either. But the replacement is going to have a hard time being as simple or predictable. :-)

twic · on April 4, 2013

Not a wrong thing to build initially, but perhaps a wrong thing to advertise a feature based on, unless you have a plan for how to continue to deliver that feature as you scale up.

marshray · on April 4, 2013

Everybody who scales rapidly has some growing pains, so I'm sympathetic. But I agree that by advertising that as a feature they're specifically asking customers to outsource this hard problem to them.

josh2600 · on April 3, 2013

I don't understand how the system behaves more intelligently when the edge routers increase. The core routers random behavior gets worse with the larger load, and increasing the number of intelligent routers doesn't help to solve that problem in any meaningful way.

Sorry, correct me if I missed something, but I believe that as the overall volume of system transactions increases (thus necessitating more "intelligent" nodes) the volume of random dispersal from the core routers increases as well, which can create situations like what we saw with RapGenius where some requests are 100ms and others are 6500ms (the reason being that random routing is not intelligent and can assign jobs to a node that's completely saturated). Adding more and more intelligent nodes doesn't solve the crux of the issue, which is the random assignment of jobs in the core routers to the "intelligent" routers/nodes.

This whole situation boils down to this: "Intelligent routing is hard, so fuck it" and that's why everyone is pissed off. Heroku could've said "hey Intelligent routing is hard so we're not doing that anymore" but instead they just silently deprecated the service. It's a textbook example of how to be a bad service provider.

adamwiggins · on April 3, 2013

> "Intelligent routing is hard, so fuck it"

Ok, let's really dig in on this. Is this truly a case of us being lazy? We just can't be bothered to implement something that would make our customers' lives better?

The answer to these questions is no.

Single global request queues have trade-offs. One of those tradeoffs is more latency per request. Another is availability on the app. Despite the sentiment here on Hacker News, most of our customers tell us that they're not willing to trade lower availability and higher latency per request for the efficiencies of a global request queue.

Are there other routing implementations that would be a happy medium between pure random (routers have no knowledge of what dynos are up to) and perfect, single global queue (routers have complete, 100% up-to-date knowledge of what dynos are up to)? Yes. We're experimenting with those; so far none have proven to be overwhelmingly good.

In the meantime, concurrent backends give the ability to run apps at scale on Heroku today; and offer other benefits, like cost efficiencies. That's why we're leaning on this area in the near term.

moe · on April 4, 2013

most of our customers tell us that they're not willing to trade lower availability and higher latency per request

What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application?

Also why would that mean "lower availability" or "higher latency"? Did you look into zookeeper?

adamwiggins · on April 4, 2013

> What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application?

This is how it works. Dynos register their presence into a dyno manager which publishes the results into a feed, and then all the routing nodes subscribe to that feed.

But dyno presence is not the rapidly-changing data which is subject to CAP constraints; it's dyno activity, which changes every few milliseconds (e.g. whenever a request begins or ends). Any implementation that tracks that data will be subject to CAP, and this is where you make your choice on tradeoffs.

> why would that mean "lower availability" or "higher latency"?

I'll direct you back to the same resources we've referenced before:

http://aphyr.com/posts/278-timelike-2-everything-fails-all-t... http://ksat.me/a-plain-english-introduction-to-cap-theorem/

> Did you look into zookeeper?

This is the best question ever. Not only did we look into it, we actually invested several man-years of engineering into building our own Zookeeper-like datastore:

https://github.com/ha/doozerd

Zookeeper and Doozerd make almost the opposite trade-off as what's needed in the router: they are both really slow, in exchange for high availability and perfect consistency. Useful for many things but not tracking fast-changing data like web requests.

moe · on April 5, 2013

Hm. Until now I thought dyno-presence is your issue, but now I realize you're talking about the actual "leastconn" part, i.e. the requests queueing up on the dynos itself?

If that's what you actually mean then I'd ask: Can't the dynos reject requests when they're busy ("back pressure")?

AFAIK that's the traditional solution to distributing the "leastconn" constraint.

In practice we've implemented this either with the iptables maxconn rule (reject if count >= worker_threads), or by having the server immediately close the connection.

What happens is that when a loadbalancer hits an overloaded dyno the connection is rejected and it immediately retries the request on a different backend.

Consequently the affected request incurs an additional roundtrip per overloaded dyno, but that is normally much less of an issue than queueing up requests on a busy backend (~20ms retry vs potentially a multi-second wait).

PS: Do you seriously consider Zookeeper "really slow"?! http://zookeeper.apache.org/doc/r3.1.2/zookeeperOver.html#Pe...

marshray · on April 4, 2013

Note: Just a bystander here

> What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application

I suspect this is a consequence of the CAP theorem. You'll end up with every loadbalancer needing a near-instantaneous perception of every server's queue state and then updating that state atomically when routing a request. Now consider the failure modes that such a system can enter and how they affect latency. Best not to go there.

My understanding is that Apache Zookeeper is designed for slowly-changing data.

moe · on April 4, 2013

You'll end up with every loadbalancer needing a near-instantaneous perception of every server's queue

But that's not true. Only the loadbalancers concerned with a given application need to share that state amongst one another. And the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).

Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science. I'm a little surprised by the heroku response here, hence my question which constraint I might have missed.

My understanding is that Apache Zookeeper is designed for slowly-changing data.

Dyno-presence per application is very slowly-changing data by zookeeper standards.

marshray · on April 4, 2013

Again, I'm no expert on Heroku's architectre. Just thinking out loud here, and feel free to tell me to RTFA. :-)

> the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).

So most Heroku sites have only a single frontend loadbalancer doing their routing, and even these cases are getting random routed with suboptimal results?

Or is the latency issue mainly with respect to exactly those popular sites that end up using a distributed array of loadbalancers?

> Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science.

To me the short history of "cloud-scale" (sorry) app proxy load balancing shows that very well-resourced and well-engineered systems often work great and scale great, that is until some weird failure mode unbalances the whole system and response time goes all hockey stick.

> Dyno-presence per application is very slowly-changing data by zookeeper standards.

OK, but instantaneous queue depth for each and every server? (within a given app)

prospero · on April 3, 2013

You seem to have misread the above comment. Here's what it actually said:

The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases

The point is that this process was gradual and implicit, so there's no point at which the intelligence was explicitly "deprecated". That doesn't excuse how things ended up, but it does explain it to some degree.

aphyr · on April 3, 2013

Er, I wasn't trying to claim that the system behaves more intelligently. The performance degradation is just an emergent consequence of gradually adding more nodes to the routing tier. This post might be useful for some context: http://aphyr.com/posts/277-timelike-a-network-simulator

tptacek · on April 3, 2013

Is anybody operating at Heroku's scale offering centralized request routing queues? At what price?

ebbv · on April 3, 2013

Not that I know of but that's why I'm saying it would be a premium product. Likely pricing would have to scale with the number of dynos running behind the router.

But that's the service people thought they were getting and what they wanted.

If Heroku prices out the intelligent routing and says; "Ok you can have intelligent routing with your current backend stack, but it's going to cost you $25/mo for evert 10 dynos, or you can switch your stack and use randomized routing for free." Then they are empowering their customers to make the choice rather than dictating to them what they should do.

tptacek · on April 3, 2013

If it's truly impossible to get centralized request routing queues at Heroku's scale in any other product offering, that is evidence that a demand that Heroku provide it might be unreasonable.

Aside from that, I am extremely sympathetic to Heroku's engineering point here --- it's obviously hard for HN to extract the engineering from the drama in this case! Randomized dispatch seems like an eminently sound engineering solution to the request routing problem, and the problems actually implementing it in production seem traceable almost entirely to††† the ways Rails managed to set back scalable web request dispatch by roughly a decade††††.

††† IT IS ALL LOVE WITH ME AND THIS POINT COMING UP HERE...

†††† ...it was probably worth it!

jules · on April 3, 2013

Random routing vs fully centralized request routing is a false dichtonomy. Suppose you have 100 nodes, and you have a router that routes randomly to one of those 100 nodes. This works very poorly. Now suppose you have 100 nodes, and you have a router that routes intelligently too one of those 100 nodes, e.g. to the one with the smallest request queue. From a theoretical perspective this works really well but it may be impossible to implement efficiently.

The solution is to combine the two approaches. You split the 100 nodes into 10 groups of 10, you route randomly to one of the groups, and then within a group you route intelligently. This works really well. The probability of one of the request queues filling up is astronomically small, because for a request queue to fill up, all 10 request queues in a group have to fill up simultaneously (and as we know from math, the chance that an event with probability p occurs at n places simultaneously is exponentially small in n). Even if you route randomly to 50 groups of 2, that works a lot better than routing randomly to 100 groups of 1 (though obviously not as well as 10 groups of 10). There is a paper about this: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...

This is essentially what they are suggesting: run multiple concurrent processes on one dyno. Then the requests are routed randomly to a dyno, but within a dyno the requests are routed intelligently to the concurrent processes running on that dyno. There are two problems with this: (1) dynos have ridiculously low memory so you may not be able to run many (if any) concurrent processes on a single dyno (2) if you have contention for a shared resource on a dyno (e.g. the hard disk) you're back to the old situation. They are partially addressing point (1) by providing dynos with 2x the memory of a normal dyno, which given a Rails app's memory requirements is still very low (you probably have to look hard to find a dedicated server that doesn't have at least 20x as much memory).

They could be providing intelligent routing within groups of dynos (say groups of 10) and random routing to each group, but apparently they have decided that this is not worth the effort. Another thing is that apparently their routing is centralized for all their customers. Rapgenius did have what, 150 requests per second? Surely that can even be handled by a single intelligent router if they had a dedicated router per customer that's above a certain size (of course you still have to go to the groups of dynos model once a single customer grows beyond the size that a single intelligent router can handle).

tptacek · on April 3, 2013

I understand and don't disagree with everything you are saying, but the focus of my attention is on what you're talking about in your 3rd graf. When you talk about your example problems (1) and (2) with routing to concurrent systems on large number of dynos, what you're really discussing is an engineering flaw in the typical Rails stack.

There's a tradeoff between:

* a well-engineered request handler (a solved problem more than a decade ago) and

* an efficient development environment (arguably a nearly-unsolved problem before the Rails era)

And I feel like mostly the Heroku drama is a result of Rails developers not grokking which end of that tradeoff they've selected.

jules · on April 3, 2013

I'm not sure I agree. Yes, it's a Rails problem that it is using large amounts of memory (on the other hand (2) isn't Rails specific at all, it applies equally to e.g. Node). But it's a Heroku problem that it gives Dynos just 512MB of memory. It's a Heroku problem that it doesn't have a good load balancer. Heroku is in the business of providing painless app hosting, and part of that is painless request routing. These problems may not be completely trivial to solve, but they're not rocket science either. Servers these days can hold hundreds of gigabytes of memory, the 512MB limitation is completely artificial on Heroku's part. Intelligent routing in groups is also very much achievable. Sure, it requires engineering effort, but that's the business Heroku is in.

Of course Heroku is under no obligation to do anything, but its customers have to justify its cost and low performance relative to a dedicated server. And most applications run just fine on a single or at most a couple dedicated servers, which means you don't have routing problems at all, whereas to get reasonable throughput on Heroku you have to get many Dynos, plus a database server. A database server with 64GB ram costs $6400 per month. You can get a dedicated server with that much ram for $100 per month. Heroku is supposed to be worth that premium because it is convenient to deploy on and scale. Because of these routing problems which may require a lot of engineering effort in your application it's not even clear that Heroku is more convenient (e.g. making it use less memory so that you can run many concurrent request handlers on a single Dyno).

tptacek · on April 3, 2013

If there is another provider that seamlessly operates at Heroku's scale (ie, that can handle arbitrarily busy Rails apps) at a reasonable price that has better request dispatching, I think it's very easy to show that you're right.

I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.

As a system for efficiently handling database-backed web requests, Rails is archaic. Not just because of its memory use requirements! It is simultaneously difficult to thread and difficult to run as asynchronous deferrable state machines.

These are problems that Schmidt and the ACE team wrote textbooks about more than 10 years ago.

(Again, Rails has a lot of compensating virtues; I like Rails.)

jules · on April 4, 2013

I certainly already agreed that Rails' architecture is bad (though the reason that it has this problem is its memory usage, and not any of the other reasons you mention). Herokus architecture is bad as well. It's the combination of these that causes the problem. But that does not mean that it's impossible, or even hard, to solve the problem at Herokus end.

> I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.

This is not sound logic. I described above two methods for solving the problem: (1) increase the memory per Dyno (see below: they're doing this, going from 512MB to 1GB per Dyno IIRC, which although still low will be a great improvement if that means that your app can now run 2 concurrent processes per Dyno instead of 1), or (2) do intelligent routing for small groups of Dynos. Do you understand the problem with random routing, and why either of these two would solve it? If not you might find the paper I linked to previously very interesting:

"To motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n / log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d >= 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n / log d + Θ(1) with high probability [ABKU99].

The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e., d = 2) yields a large reduction in the maximum load over having one choice, while each additional choice beyond two decreases the maximum load by just a constant factor."

-- http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...

tptacek · on April 4, 2013

I understand that one approach to dispatching requests at the load balancer is superior to the other, just as I understand that one way of absorbing requests at the app server is better than the other.

Most things are inferior to other substitutable things! :)

jules · on April 4, 2013

That's a mild way of putting it. With the current way of dispatching requests you need exponentially many servers to handle the same load at the same queuing time, if your application uses too much memory to run multiple instances concurrently on a single server.

friism · on April 4, 2013

I work at Heroku. To address you concerns about memory limitations, know that we're fast-tracking 2X dynos (this is also mentioned in the FAQ blog post). Extra memory will make it easier to get more concurrency out of each dyno.

jules · on April 4, 2013

Yes, that will be a huge improvement!

bigiain · on April 3, 2013

"You split the 100 nodes into 10 groups of 10, you route randomly to one of the groups, and then within a group you route intelligently."

And here we've re-invented the airport passport checking queue - everybody hops onto the end of a big long single queue, then near the front you get to choose the shortest of the dozen or two individual counter queues

I wonder what the hybrid intelligent/random queue analogues of the in-queue intelligence gathering and decision making you caan do at the airport might be? "Hmmm, a family with small children, I'll avoid their counter queue even if it's shortest", "a group of experienced-looking business travellers, they'll probably blow through the paperwork quickly, I'll queue behind them". I wonder if it's possible/profitable to characterize requests in the queue in those kinds of ways?

chc · on April 3, 2013

$25 a month? Did you forget a few zeroes?

ebbv · on April 3, 2013

It was just a placeholder price. :)

wmf · on April 3, 2013

Amazon ELB? It does cost significantly more than Heroku AFAIK.

adamwiggins · on April 3, 2013

My understanding is that ELBs are HAProxy, and they may be set to use the leastconn algorithm (a global request queue that is friendly to concurrent backends). However, once you get any amount of traffic they start to scale out the nodes in the ELB, which produces essentially the same results as the degradation of the Bamboo router that we've documented.

The difference, of course, is that ELBs are single-tenant. So a big app might only end up with half a dozen nodes, instead of the much larger number in Heroku's router fleet.

Offering some kind of single-tenant router is one possibility we've considered. Partitioning the router fleet, homing... all are ideas we've experimented with and continue to explore. If one of these produces conclusive evidence that it provides a better product for our customers and in keeping with the Heroku approach to product design, obviously we'll do it.

twic · on April 4, 2013

I hope you'll be able to share your findings with us, even if they're negative. As someone who has no stake in Heroku, i have the luxury of finding this problem simply interesting!

My hypothesis is that tenant-specific intelligent load balancers would be plausible; i would guess that you would never need more than a handful of HAProxy or nginx-type balancers to front even a large application. Your main challenge would then be routing requests to the right load balancer cluster. If you had your own hardware, LVS could handle that (i believe that Wikipedia in 2010 ran all page text requests through a single LVS in each datacentre), but i'm not sure what you do on EC2.

However, "hypothesis" is just a fancy way of saying "guess", which is why your findings from actual experiments would be so interesting.

bgentry · on April 3, 2013

ELBs have a least-conn per node routing behavior. If your ELB is present in more than 1 AZ, then you have more than one node. If you have any non-trivial amount of scale, then you probably have well more than 1 node.

jcampbell1 · on April 3, 2013

Random routing will work fine so long as the operating system has the ability to intelligently schedule the work. The problem is that the dynos were setup to handle request sequentially. Unicorn will help mitigate the problem, but the best solution is to not try and serve webpages from hardware that is about as powerful as an old mobile phone. The 2x dynos are a step in the right direction, but I have no idea why they don't offer a real app server like a 16x dyno.

adamwiggins · on April 3, 2013

It's probably easy to guess that once we have 2X dynos, 4X dynos (and up) may be on the way. We'll drive this according to demand, so if you need/want dynos of a particular size, drop us a line.

rhizome · on April 3, 2013

I'm sure at least one person's job depends on this working out in the direction they've been trying to go, cf. Blaine Cook (who I think got a raw deal).

on April 3, 2013

[deleted]

dangrossman · on April 3, 2013

http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics

There are screenshots of the website where it said that in multiple places.

habosa · on April 3, 2013

Unless I'm entirely misunderstanding the situation, this was definitely the premise of the "intelligent routing mesh" that Heroku used for so long on the Bamboo stack.

deedubaya · on April 3, 2013

Haters gonna hate.

I wonder how many people bitching here are actual customers who are having problems that haven't been address with a solution. I'm guessing that number is low.

Oh, you're a potential customer? That's why you're bitching? About a problem you may or may not have if you actually choose the product? Think about that argument for a second.

I've never seen such a transparent response and follow up as I have from Heroku on this issue. Most other companies would have gone into immediate damage mitigation mode and let the wound heal instead of re-opening it and giving feedback on how to fix the problem as Heroku has.

I applaud the Heroku team for their effort on their platform and being a kickass company.

stickfigure · on April 4, 2013

I'm a real customer with real problems. Grep this page for latchkey's description of it.

The funny thing is, I don't have much sympathy for Rails users. Scaling problems with a single-threaded, serial request-processing architecture? No surprise there. But we have inexplicable H12 problems with Node.js. There's something broken in the system and it isn't random routing.

Cabal · on April 4, 2013

You're talking like Node offers real concurrency. It doesn't.

stickfigure · on April 5, 2013

There's nothing wrong with Node's concurrency. Our app, like most webapps, is I/O bound. Any individual instance should be able to handle thousands of concurrent requests as long as they are all blocked on I/O.

Being able to process more than one concurrent request (as Node can) is "real concurrency". Java-style native threading is a step above and beyond this, and unnecessary for most web applications.

bradleyjg · on April 3, 2013

Random routing to concurrent servers works fairly well if the kind of long running requests you need to worry about spend a lot of time blocking for some external service (e.g. a database call). Then you can get a lot of benefit on the from cooperative or preemptive multitasking on the server, and so the performance characteristics, from the point of view of a new request, of each server is roughly the same, and so going to a random server is pretty good.

However, if you have long running requests because they make intensive use of server resources (CPU, RAM) then concurrent servers buys you very little. In that case, sending a new request to a server that is chugging along on a difficult problem is significantly different than sending it to one that isn't. That's where knowing the resource state of each server, and routing accordingly is of huge benefit.

While load balancing is a very difficult problem, with some counterintuitive aspects, it is an area of active research, and there are some very clever algorithms out there.

For example, this article (http://research.microsoft.com/pubs/153348/idleq.pdf) from UIUC and Microsoft introduces the Join-Idle-Queue, which is suitable for distributed load balancers, has considerably lower communication overhead than Shortest Queue (AFAICT the original 'intelligent' routing algorithm), and compares its charateristics to both SQ and random routing.

tvladeck · on April 3, 2013

All of the stuff that Heroku are doing now to mitigate the routing issues for their Bamboo customers are things they should have done when it first became an issue. They are not going above and beyond in any way to make up for the lost time and money their customers.

Again, this is not about

-how advanced Heroku's technology is on an absolute level

-how challenging routing is for their scale

-what competitors offer in this space and for what prices

This is only about the delta between what Heroku sold their customers and what their customers received. They are collapsing the delta now, by being honest about what they are selling (and improving their offering, it seems), but they are doing nothing to address the long time for which the delta was significant for a subset of their customers.

chc · on April 3, 2013

Your complaints are all about the past. What are you looking for here? For Heroku to invent a time machine?

ambrice · on April 3, 2013

How about refunds? Isn't that the normal response when a company screws up in a way that causes the customers to pay more money?

adamwiggins · on April 3, 2013

Yes, we've given credits (or a refund, at the customer's preference) in cases where lack of visibility or inaccurate docs led to over-provisioning of dynos.

There's actually very few cases where people paid more money than they would have otherwise. Heroku is a service with your bill prorated to the second. For the most part, if people don't like the performance (which is measurable externally via benchmarks and browser inspectors), they leave the service. Many people who hit problems with visibility and performance did exactly that.

Naturally, we'll be working hard to try to recapture their business, as well as to remove any reasons that existing customers might leave as they hit performance or visibility problems scaling up in the future.

cmelbye · on April 3, 2013

They still fail to understand that using Unicorn doesn't magically fix this issue. Like, at all. It simply means that the dunk gets tied up when n+1 requests (n is number of unicorn workers) get randomly routed to it instead of just 1. It's in no way comparable to a node.js server that handles thousands of concurrent requests asynchronously. They're simply two different designs, and Ruby's traditional design is fundamentally incompatible with Heroku's router.

mattsoldo · on April 3, 2013

That's not right. In any configuration the goal is to minimize queue time. What is critical to doing this is having a request queue and a pool of concurrent "workers" (to use the generic queueing theory term) in back of it.

Unicorn uses the operating system's TCP connection queue to queue incoming requests that it is not able to immediately server. While n+1 requests can (and will) get routed to a single dyno, this only results in 1 request being queued. It will be queued until the first of the in-process requests is served, which will take ~ the average response time for the app. Given that the other n requests did not get queued (queue time = 0), the average queue time will equal Sum(queue times) / # requests = Average Response Time / n+1.

streptomycin · on April 3, 2013

He forgot to explain why they won't be refunding customers who were defrauded.

res0nat0r · on April 3, 2013

Because they weren't victims of fraud.

dandelany · on April 3, 2013

I think you're right, and defraud is the wrong word, because fraud implies malicious intent. However, that's almost beside the point - many customers paid for the product that was advertised/documented and received something completely different. Whether or not it was intentional, it should be remediated rather than painted over.

Let's completely ignore the random vs. intelligent routing question for the moment and just talk about the New Relic analytics add-on. Heroku customers were paying ~$40 per month PER DYNO for this service. One of the most compelling things about New Relic is that it not only tracks average request time, but also breaks down this data into its components so developers can see which systems are slowing down their requests. Not only did it fail to report the queueing component, it failed to account for it in the total request time, meaning both values were wrong and basically useless.

I don't understand how Heroku can admit that the tools for which their customers paid obscene amounts of money were completely broken for two years straight and yet do nothing about it besides apologize. The fact that most customers did not realize it was broken does not mean it didn't cause real, tangible harm to their businesses.

It's hard to draw an analogy here since it's rare to use a product for so long without noticing it's not what you paid for, but I'll try - imagine you lease a car for two years. You can pay $500/month for the standard model with n tons of CO2 emissions/mile, or $800/month for a hybrid model with n/5 tons of emissions. You opt for the more expensive hybrid, but after driving the car around happily for two years, you pop the hood only to find that the company gave you the standard model. After complaining to the company, they apologize and give you the standard model. Wouldn't you expect them to also retroactively refund you $300/month for every month you paid for a product that you didn't receive? Does the fact that this was a genuine mistake, rather than an attempt to defraud, change your expectation for receiving a refund? It certainly wouldn't for me.

adamwiggins · on April 3, 2013

Fair enough, and good analogy. As you said it's hard to make comparisons between physical goods and a metered software service.

The New Relic question is tricky. The free version of NR includes queue time, so that implies that the incremental value you're getting from the paid service does not include this. I'm also not sure how "this product you've gotten for free has a bug" fits into this equation. But overall, yes, NR is a product that is designed to give you visibility, and due to incorrect data being reported, some of that visibility wasn't there.

Therefore, we've given credits to people who have had substantial financial impact of the sort you describe. There aren't very many in this category and I believe we've already covered them all, but if you believe you're in this category please email me: adam at heroku dot com

dandelany · on April 4, 2013

Thanks for the thoughtful reply, Adam. FWIW, I think you guys provide a very valuable service & I wish you the best as you work through these issues.

dangrossman · on April 3, 2013

Is promising one service but delivering another not fraud?

Promises from the Heroku website pre-Rap Genius posts:

    "Incoming web traffic is automatically routed to web dynos, with intelligent distribution of load instantly"

    "Intelligent routing: The routing mesh tracks the availability of each dyno and balances load accordingly. Requests are routed to a dyno only once it becomes available. If a dyno is tied up due to a long-running request, the request is routed to another dyno instead of piling up on the unavailable dyno’s backlog."

    "Intelligent routing: The routing mesh tracks the location of all dynos running web processes (web dynos) and routes HTTP traffic to them accordingly."

    "the routing mesh will never serve more than a single request to a dyno at a time"

Actual service provided: requests are routed randomly to dynos regardless of how many requests they are currently handling or their current load.

res0nat0r · on April 3, 2013

Your definition of fraud differs from most: "Wrongful or criminal deception intended to result in financial or personal gain."

They weren't intentionally and purposefully misleading people. Not having docs up to date on your website, or you not knowing how the underlying backend works is not fraud.

As I've mentioned before if every AWS customer could sue Amazon for not understanding how all of the underlying tech worked, or could sue when some of the docs were out of date, there would be more lawyers working there than engineers.

streptomycin · on April 3, 2013

Also the misleading performance metrics which hid the fraud.

smackfu · on April 4, 2013

Seems like false advertising at least.

streptomycin · on April 3, 2013

Even if that's true, you're just distracting from the main point. If I buy X from a company and they don't provide it, I expect a refund.

Like if I order a laptop from Dell and, oops, it gets lost in the mail, it's not okay to just say "oh, we changed our shipping company, so that should happen less in the future."

andrewem · on April 3, 2013

Your example is extremely clear-cut, but doesn't match the situation with Heroku very closely. If Heroku promised an intelligent router, but then it was always down and never served you any requests, that would make your Dell example a good analogue.

A closer comparison might be: What if Dell shipped you a laptop that they advertised as having an SSD with really good performance (and they thought that was true, or at least they did when they wrote the ad for the laptop), and it turns out that for your workload the performance isn't so great? Would you expect a refund then?

streptomycin · on April 3, 2013

I think most people would expect a refund in that situation. However, wouldn't a better analogy be like if they advertised an SSD and shipped an HDD (allegedly without knowing the performance difference), and also the diagnostic tools that they shipped just happened to not report the additional latency (allegedly without knowing that it would be a problem)? Then customers had to figure out on their own that it actually was an HDD despite their documentation and diagnostic tools saying otherwise, after wasting tons of time trying to figure out why performance sucked?

scorpion032 · on April 3, 2013

The same way Apple doesn't refund the buyers of iPhone 4, at the launch of an iPhone 5?

hglaser · on April 3, 2013

This is a great response. I'm curious why something like this wasn't posted within 24 hours of RapGenius going public? I'd bet that a more thorough, technical reply would have mitigated a lot of the PR issues.

adamwiggins · on April 3, 2013

Agreed, I wish we could have done it much sooner. It took a shocking amount of time to sort through all the entangled issues, emotion, and speculation to try to get to the heart of the matter, which ultimately was about web performance and visibility.

Also, we wanted to respond to our customers first and foremost, and general community discussion second. So we spent close to a month on skype/hangout/phone with hundreds of customers understanding how and at what magnitude this affected their apps.

That was hugely time-consuming, but it gave us the confidence to speak about this in a manner that would be focused on customer needs instead of purely answering community speculation and theoretical discussions.

hglaser · on April 3, 2013

Thanks for replying. As a paying Heroku customer (who's not affected by the routing issue), while seeing a blog post earlier would have been nice, it's great to hear that you spent so much time with affected customers.

adamwiggins · on April 3, 2013

Glad to hear you're not affected. But we always like talking to customers, feel free to drop me a line at adam at heroku dot com if you'd ever like to spend a few minutes chatting on skype or jabber.

jfim · on April 3, 2013

It's pretty much what they have been saying all along; Heroku switched to random routing to support other stacks than only Rails and they kept the routing infrastructure between Cedar and Bamboo the same, resulting in an undocumented performance degradation for large customers still on the old stack or running non-concurrent web frameworks on the new one. The last part was an oversight on Heroku's part, but the rest makes sense, especially considering Cedar supports other stacks than just Rails, including massively concurrent ones(Nodejs, Play, etc.).

The rest of the whole thing was people who were unsatisfied by the change and wanted the old product offering to return and/or felt shortchanged by Heroku. Luckily, Heroku doesn't lock you into their platform like GAE does, for example; it's essentially a bunch of shell scripts to deploy binaries on EC2 workers, a hosted instance of Postgres on EC2, some routing and ops as a service. Anyone who wasn't happy about the change could've contacted Heroku to say "you said we would be sold non-random routing, but you've sold us random routing" or moved their app away to another provider or even their own servers.

Does this suck if you wanted the old service? Sure. Witness the flak that the canning of Google Reader got and it's obvious that discontinuing old services isn't exactly a popular decision. On the other hand, you can self-host (it's massively cheaper than Heroku) or switch to another provider. And if there are no other providers that specialize in Rails hosting, isn't that a business opportunity?

steveklabnik · on April 3, 2013

When admitting fault in response to a customer that spends 5 figures/month with you, wise to have a lawyer go over it first.

hglaser · on April 3, 2013

I would argue that the brand risk -- e.g. the risk of what ended up happening -- outweighs the legal risk.

That's not to say I wouldn't send it to a lawyer. But I'd do it with some hustle. Something like: "This is going out in 24 hours. Please comment ASAP!"

res0nat0r · on April 3, 2013

As Adam mentioned in so many words: most of the people pissed off about all of this are most likely right here on HN, and not even customers of Heroku. He mentioned most customers were not that mad about the issue overall.

bifrost · on April 3, 2013

I'd say there's a major flaw here - "the new world of concurrent web backends" - If now was 1995 I might agree with you, but having concurrency in web-app servers is not new. The lack of concurrency in test/demo app situations is totally understandable, in a production environment you'd have to be completely bonkers to think thats ever useful. I also agree that "having to deal with loadbalancing" is something that most people don't get and shouldn't really have to get, but when the way you do loadbalancing is so fundamentally flawed that its worse than Round Robin DNS, you also clearly don't understand it.

To be fair, I have to say I meet people all the time who don't understand it at all and think $randomcrappything is great at loadbalancing. If "connections go to more than one box!" is your metric, then yes, thats loadbalancing. My metric is "Do you send connections to servers that are responsive and not overloaded, maybe with session affinity" and in general most HW loadbalancer products since 1998 have supported that. So if you're not better than 1998 technology, you may want to reevaluate your solution.

bgentry · on April 3, 2013

I'd say there's a major flaw here - "the new world of concurrent web backends" - If now was 1995 I might agree with you, but having concurrency in web-app servers is not new.

Sadly, concurrency is relatively new and unfamiliar territory to many in the Ruby on Rails community.

persei8 · on April 3, 2013

Is it true that you don't fully buffer body of POST request on router? This limitation works against Unicorn design and makes it hardly a "fix": http://rubyforge.org/pipermail/mongrel-unicorn/2013-April/00...

geekylucas · on April 3, 2013

Heroku only buffers the request headers: https://devcenter.heroku.com/articles/http-routing#request-b...

adamwiggins · on April 3, 2013

Plenty of room for improvement here, sure. Some technical discussion in this thread: https://blog.heroku.com/archives/2013/4/3/routing_and_web_pe...

scottshea · on April 3, 2013

So their solution to the random routing is for the customer to switch to Unicorn/Puma on JRuby. Wow.

adamwiggins · on April 3, 2013

Yes, because that is the solution. Empirically.

We've run many experiments over the past month to try other approaches to routing, including recreating the exact layout of the Bamboo routing layer (which would never scale to where we are today, but just as a point of reference). None have produced results that are anywhere near as good as using a concurrent backend. (I'd love to publish some of these results so that you don't have to take my word for it.)

That said, we're not done. There are changes to the router behavior that could have an additive effect with apps running Unicorn/Puma/etc, and we'll continue to look into those. But concurrent backends are a solution that is ready and fully-baked today.

tjb · on April 3, 2013

We also adopted Unicorn pretty early on but still suffered issues as dynos simply ran out of memory. In fact with some apps we have seen improvements (but still far from acceptable) in stability by un-adopting this method. The issues raised by scottshea below, as a consequence concern me, what/will be the charge for these as well?

To be honest it's the fumbling around in the dark that has annoyed me. I am with you 100% on your manifesto and your points about the type of service you provide. However the time we have spent on this (starting before you came clean about the issues) and the time spent on other increasingly suspicious advice to "up dynos" or spend time "optimising your app" sours this slightly. I accept the "magic black box" comes with its compromises and required understanding at our end but it also means needing to be far more communicative and honest about it at yours. Something which you are putting right I can see.

I for one think the premise of Heroku is a great one and you have succeeded for us in many of the things you have set out to achieve. This whole situation has been a real shame, I'm sure this must have a been a pretty shody time for you guys and I hope you come out the better for it. The quicker the better to be honest so you can focus on the new features we'd like to see.

adamwiggins · on April 3, 2013

Thanks for your support. Indeed, communication and transparency into how the product works as far as how it affects your app are two things we'd like to get better at.

Regarding your app: indeed, Unicorn is a huge improvement, but far from the end of the story. "Performance" is like "security" or "uptime" — it's not a one-time feature, something you check off a list and move on. It's something that requires constant work, and every time you fix one problem or bottleneck that just leads you to the next one.

Over time, though, your vigilance pays off with a service that its users deem to be fast or secure or have good uptime. Yet there's no such thing as a finish line on these.

Bringing it back to details. Kazuki from Treasure Data made this Unicorn worker killer that might help you: https://github.com/kzk/unicorn-worker-killer If you're still not happy with your app's performance, give me a shout at adam at heroku dot com and we'll see if we can help.

habosa · on April 3, 2013

Please publish these results. I think a chart showing that Unicorn + Random routing is better than Thin + Intelligent routing would go a long way to ending this whole thing. That's assuming that you can make deploying a Unicorn app as easy as it was with Thin ('git push heroku')

scottshea · on April 3, 2013

Having been part of their efforts to test 2X, and 4X Dynos and using Unicorn long before the RapGenius issue I can tell you the added memory and Unicorn still has issues. We still see periods where queueing is above 500ms. The additional Dyno capacity distributes the chance of queue issues out over a larger numerical set but there is still the possibility of one dyno/Unicorn worker combo getting too much to handle. We use Unicorn Worker Killer to help in that case.

scottshea · on April 3, 2013

I should add that using Unicorn Worker Killer to end a process on thresholds is not a solution as much as it is another stopgap.

adamwiggins · on April 4, 2013

We might. But what does this actually get us? It helps clear Heroku's name, but it doesn't help our customers at all. I'd prefer to spend our time and energy making customer's lives better.

Given the choice between continuing the theoretical debate over routing algorithms vs working on real customer problems (like the H12 visibility problem mentioned elsewhere in this thread), I much prefer the latter.

habosa · on April 4, 2013

I respect that mindset, I just don't think it would hurt. Maybe a middle ground would be a full-scale tutorial on how to switch from Thin on Bamboo/Cedar to Unicorn on Cedar for Rails users. It's a non-trivial process and I know I'd like some help with it. And in this same tutorial/article you could throw down the benchmarks you ran as motivation/justification.

boundlessdreamz · on April 3, 2013

Unless the dyno signals to the router that it is busy, isn't this just postponing the problem? Per dyno, unicorn can handle more requests but requests will still get queued at dyno level if one of the requests is a slow one (say 2-3 seconds)

thinkbohemian · on April 3, 2013

If only 1 request is slow and you have 7-8 unicorn workers, only one of them will stay busy. Unicorn knows which of it's workers are busy and does not queue jobs behind individual worker but rather behind the master who delegates the request to the first available worker.

adamwiggins · on April 3, 2013

Precisely. As mentioned in the FAQ, putting the queueing logic closer to the process which is ultimately going to serve the request is a more horizontal-scale friendly way of tackling the queueing problem.

It works fantastically well for backends that can support 20+ concurrent connections, e.g. Node.js, Twisted, JVM threading, etc. It works less well as you can put fewer connections in each backend, which is part of why we're working on larger dynos.

ROFISH · on April 3, 2013

Something that's been bugging me about Heroku is that the dyno price has stayed the same ever since launch: $0.05 per hour. Compare to services like Digital Ocean and AWS (who lowered prices significantly in the past few years), Heroku is starting to get very expensive.

The 2X dyno at 2x cost doesn't really make me happy, it just invites me to spend more money when it would be more cost-efficient to move.

cjackson27 · on April 3, 2013

I can't speak to the issues that people are running into when they reach large scale, but I run a small app with two dynos and we've been having issues with H12 request timeout errors for weeks now. This has been bringing down our production app for periods of about fifteen minutes almost daily.

I've been completely disappointed with Heroku's support so far. First they obviously skimmed my support request and provided a canned response that was completely off base. Their next response didn't come for four days and only after I called their sales team to see what I could do to get better support. Their only option is a $1k / mo support contract. If you're running a mission critical app, I'd think twice before choosing Heroku.

adamwiggins · on April 3, 2013

The difficulty of diagnosing H12 errors is really challenging. One thing I can recommend is using the http-request-id labs feature: https://devcenter.heroku.com/articles/http-request-id With this enabled and some code in your app, you can correlate your app's request logs directly against the router logs and trace what happened with any particular H12.

I'd be happy to help you do this if you're game. Contact me via adam at heroku dot com.

Could you also email me some links to your support tickets so I can check out what happened there?

derengel · on April 3, 2013

Heroku's message: "If you have slow clients you are screwed" Unicorn is designed to only serve fast clients on low-latency.

And no, they don't do any buffering.

adamwiggins · on April 3, 2013

I think that overstates it a bit, but yes, there are problems with Unicorn and slow clients. We're investigating: https://blog.heroku.com/archives/2013/4/3/routing_and_web_pe...

If this is an immediate problem for you, it might be worth your while to make your app threadsafe, which gives you more concurrent webserver options.

latchkey · on April 3, 2013

I am working for a fairly large heroku app running Node on ~50-100 web dyno's with another 20-50 backends. Here are the problems as I see it:

We get H12's all the time. Randomly. The only suggestion we get from Heroku is to make the requests process as fast as possible. Thus, we've spent considerable amount of time going through everything we can possibly do to make all requests respond as fast as possible. I've given up. I see this as a fundamental issue with the routing system. If you are going to use Heroku for a large production deployment, H12's (and your users getting dropped connections) will become a fact of life.

There is no auto scaling. We have no idea how many dyno's we actually need. So, we over do it in order to handle peek traffic times. This must be a great money maker for Heroku. There is no incentive for them to build auto scaling into their system because that would mean they wouldn't make as much money. Yes, auto scaling is a hard problem to solve, but there should at least be a plan to start on it and there is none that I have found.

Up until someone bitched loudly, nothing was happening to fix any of this. We have an expensive paid support contract with Heroku and before this whole routing issue blew up in public, their only recommendation was to tune the app more and buy into NewRelic for ~$8k / month. We did both and found NewRelic to not give any relevant information to help us. We did a NodeTime trial for ~$49/mo and that actually helped a lot in identifying slow spots in our app. We fixed all the slow spots in our app and still see an endless stream of H12's. Regardless, it shouldn't take a public bitch slapping for a company to listen to their customers.

You log into a dyno and see a load average of 30+. Who knows if that number is accurate or how big the underlying box really is, but regardless, I can't imagine that number being good. Am I getting H12's because I'm on an overloaded box or is it because the routing system is fundamentally broken? I don't know and nobody can tell me. This is not a good position to be in.

I have heard from several sources that Heroku isn't happy being on AWS and has been wanting to migrate off AWS for a while now. So, if your hosting provider isn't happy on their hosting provider, there must be a reason for that and in the bigger picture, you the customer, is getting screwed.

Given these things, I will never recommend that a company use Heroku. It is great if you know you are going to never have more than one dyno, but if you think you are going to go into a large production system with it, it is far better to find something else. Which brings me to another rant... how come none of these other PaaS solutions are as easy as Heroku? The git deploy is seriously the one thing they got mostly right. I'd love to see someone build a layer on top of all the PaaS solutions so that I can just deploy my code to any one of them (or event multiple).

adamwiggins · on April 3, 2013

We're aware of the random H12s problems. Some apps are affected pretty badly, others not at all, and we're not sure why yet. Sorry that you've had such a bad experience with this. We're continuing to investigate. If we're not able to find a solution in a timely fashion, I'll completely understand if you no longer want to use our product for this app.

Knowing how many dynos you need is definitely a problem. We have implemented autoscaling in the past... but it always sucked. It's hard to find a one-size-fits-all-solution. Rather than ship something sub-par we chose not to ship anything at all.

I understand a lot of people do well with autoscaling libraries and 3rd party add-ons. Would be curious to hear your experience with any of these.

I completely agree that it shouldn't take complaining in public to get a company to listen to its customers. That's was our biggest mistake in all of this, IMO — not listening.

For dyno load, have you tried log-runtime-metrics? https://devcenter.heroku.com/articles/log-runtime-metrics It provides per-dyno load averages.

I gladly accept your compliment that our git deploy remains the best on the market. :)

I'm sorry we haven't been able to serve you better. Let me know if you'd be willing to talk via skype sometime — even if you end up leaving the platform (or already have), I'd like to understand in more depth where we went wrong so we can do better in the future.

latchkey · on April 3, 2013

Your response only re-enforces my hard won opinion that Heroku should never be used for a production environment for any business that is trying to be successful and popular. Admitting that you have no idea why critical areas of your infrastructure is causing issues, while at the same time charging people an arm and a leg for services (we pay ~$4k/mo) feels like theft to me. I've built solutions for a large porn company that runs on significantly less infrastructure than what we are running on Heroku and handles 100x more traffic. Something is wrong here with the dyno/router model and maybe it is that you guys are just oversubscribed and not admitting to it in public.

Yes, autoscaling is hard. I have apps on Google AppEngine and see their issues as well. That said, at least they are trying. Maybe even take one of those 3rd party libraries and try to harden and adapt that and make it a real solution? I think the real problem though lies in the fact that there isn't any good metrics for what dynos are doing so there is no metric for when something is too busy or not. Yes, log-runtime-metrics puts out some numbers, but those numbers are meaningless when all I have is a slider to change the amount of money we are paying you.

I should qualify that git deploy compliment because there are issues with that as well. For example, why do you have to rebuild the npm modules from scratch each time? Why not have a directory full of pre-built modules for your dyno's that are just copied into my slug? This relatively simple change would increase the speed of deployments greatly. Never mind that deployments aren't reliable and fail randomly. At least it is easy to just try again.

adamwiggins · on April 4, 2013

Again, sorry to hear about your bad experience. Hard-to-diagnose errors, no autoscaling or other method to know proper dyno provisioning, and slow deploy times — these things suck.

Would love the chance to win back your trust and hang onto your business. Let me know your app name (in email if you prefer) and I can see if there's anything we can do for you in the near term.

pytrin · on April 4, 2013

With that kind of money invested in hosting and scaling, why not get a dedicated professional to handle devops on your team, and go with more traditional hosting solutions? I'm interested in hearing why people still use Heroku at the scale you're describing

latchkey · on April 4, 2013

That's a really good question. Heroku is offering to handle the hosting, devops and scaling issues for us so that we can focus our energy on building a killer product. When considering the costs of hiring a devops team and someone to wear a pager 24/7 when servers go down, paying the premium for using Heroku becomes a lot more attractive.

'why not get a dedicated professional to handle devops on your team'

It sure is easy to write that, but the reality isn't as rosy. I've gone through the process at two companies to try to do that, interviewing ~50-80 people and it has been a nightmare. It is really really difficult to find quality devops people. Again, this makes PaaS like AWS, Heroku and AppEngine a lot more attractive. They are betting their entire business on being able to hire good devops people, so they tend to attract better talent.

dblock · on April 4, 2013

We have been experimenting with AWS OpsWorks as an option to move forward. It is excellent. Does git.

omfg · on April 4, 2013

Have you checked out Elastic Beanstalk yet? http://aws.amazon.com/elasticbeanstalk/

Any thoughts on that. It offers a Heroku'esque deploy.

latchkey · on April 4, 2013

EBS just recently added support for Node. It's definitely on the list as the next PaaS to try out. AppFog is another one that we've done a JVM deployment to and like a lot except for the fact that it just feels very alpha quality. Their website is painfully slow and under documented and the 'af' command isn't nearly as cool as just doing git deploy remote master.

I should also add that one thing that Heroku did get 100% correct is the heroku logs -t command (aka: tailing the logs). Nobody else does that one quite as well.

omfg · on April 8, 2013

Oh wow, never knew about that. That's very helpful thanks.

jholman · on April 4, 2013

I'd like to start by acknowledging that I'm one of the "non-customers who are watching from the sidelines". I think Adam's right that this is an important distinction.

Adam, there's something that confuses me about this. I'm no expert in routing theory, nor have I done the experiments, so forgive me if my reasoning misses something.

I understand why RapGenius took you up on your original promises of "intelligent routing", and I think I understand what you're saying about scaling, and how scaling "intelligent routing" is so far unsolvable, and the motivation for your transition from Bamboo to Cedar, especially in the context of concurrent clients. What I don't understand is this:

It seems to me that if you split into two (or more) tiers, and random-load-balance in the front tier (hit first by the customer), and then at the second tier only send requests to unloaded clients, that you eliminate RapGenius's problem for customers who followed your specific recommendations for good performance on Bamboo (to go single-threaded and trust the router).

Do you have reason to believe that this doesn't one-shot RapGenius's problem? Do you have strategic/architectural reasons for rejecting this even though it would work? Did you try it and it failed? What's the story there?

Maybe I'll write a simulator to (dis)prove my naive theory. :P

adamwiggins · on April 4, 2013

> It seems to me that if you split into two (or more) tiers, and random-load-balance in the front tier (hit first by the customer), and then at the second tier only send requests to unloaded clients [...]

I'm unclear how you'd think introducing a second tier changes things. That tier would need to track dyno availability and then you're right back to the same distributed state problem.

Perhaps you mean if the second tier was smaller, or even a single node? In that case, yes, we did try a variation of that. It had some benefits but also some downsides, one being that the extra network hop added latency overhead. We're continuing to explore this and variations of it, but so far we have no evidence that it would provide a major short-term benefit for RG or anyone else.

> Do you have reason to believe that this doesn't one-shot RapGenius's problem?

As a rule of thumb, I find it's best to avoid one-shots (or "specials"). It's appealing in the short term, but in the medium and long term it creates huge technical debt and almost always results in an upset customer. Products made for, and used by, many people have a level of polish and reliability that will never be matched by one-offs.

So if we're going to invest a bunch of energy into trying to solve one (or a handful) of customer's problems, a better investment is to get those customers onto the most recent product, and using all the best practices (e.g. concurrent backend, CDN, asset compilation at build time). That's a more sustainable and long-term solution.

jholman · on April 8, 2013

Sorry, yes, I'm supposing that the second tier serves fewer dynos; sufficiently few that your solutions from 2009 (that motivated you to advertise intelligent routing in the first place) are still usable.

> As a rule of thumb, I find it's best to avoid one-shots (or "specials").

Absolutely, and I would never suggest that. However, it's not just RG that has this problem, right? If I understand correctly, isn't it every single customer who believed your advertising and followed your suggested strategy to use single-threaded Rails, and doesn't want to switch?

So it's not about short or medium term; it's about letting customers take the latency hit (as you note), in order to get the scaling properties that they already paid for.

jordanthoms · on April 4, 2013

My biggest issue with heroku is the general slowness with the API - maybe I'm just impatient, but most simple commands like listing releases, viewing logs etc take at least a second, sometimes five before anything happens. Pushes also take quite a while, even the Git push part is much slower than pushing to github. It's just a general sluggishness which gets annoying after a while.

If they could get all the API requests down under 500ms I'd be much happier.

adamwiggins · on April 4, 2013

Yeah, the developer-facing control surfaces on the platform (API calls and git push mainly) have gotten slower over the past year or two. This is on my list of of personal pet peeves, but so far has not made it onto our list of priorities.

We try to drive priorities based on what customers want, not what we want: and what we've heard in the last year or so is all about app uptime, security, and now performance and visibility.

I'm very much hoping that bringing back "fast is a feature" on the developer-facing portions of the product is something we can work on this year.

randall · on April 4, 2013

I think the most annoying thing is they still don't answer Rap Genius's questions about being owed money for paying megabucks for newrelic. I mean If you offer a service that provides incorrect data for two years and you don't offer any sort of framework for reimbursement, that still seems, at best annoying, at worst, dishonest.

adamwiggins · on April 4, 2013

We spent quite a bit of time trying to find a one-size-fits-all framework. There just wasn't one, so we've done credits on a case-by-case basis.

Sorry you find it annoying. It's what was best for our customers.

bhauer · on April 3, 2013

I am surprised they do not moderate the comments on their blog. They have one visible presently that is plain offensive.

siong1987 · on April 4, 2013

Since we are on the question of visibility of Heroku dyno, what is the amount of CPU power each dyno has?

What about 2Xdyno?

adamwiggins · on April 4, 2013

This is a tough area. If you go look at various types of infrastructure providers (e.g. EC2, Linode, Rackspace) you'll see that they always end up making up vague units of measurement (e.g. "cores") and then showing all the resources in reference to whatever the base unit is. So there's really no good way to talk about CPU power like there is with memory.

That said, I can say that a 1X dyno is not very powerful compared to, say, any server you'd purchase for your own datacenter. Our intention is that 2X dynos will provide twice the CPU horsepower, although CPU and I/O are harder to allocate reliability in virtualized environments.

j-kidd · on April 4, 2013

From the article:

> Q. Did the Bamboo router degrade?

> A. Yes. Our older router was built and designed during the early years of Heroku to support the Aspen and later the Bamboo stack. These stacks did not support concurrent backends, and thus the router was designed with a per-app global request queue. This worked as designed originally, but then degraded slowly over the course of the next two years.

From Adam's message on Feb 17th, 2011 (https://groups.google.com/forum/?fromgroups=#!topic/heroku/8...):

> You're correct, the routing mesh does not behave in quite the way described by the docs. We're working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate. The current behavior is not ideal, but we're on our way to a new model which we'll document fully once it's done.

It looks like random load balancing was already the expected behavior 2 years ago? The "slow degradation" part seems a bit dishonest to me.

adamwiggins · on April 4, 2013

There are two separate issues here, and it's easy to get them confused. One is the slow degradation on Bamboo without any change to the routing algorithm code, and the other was the explicit product choice for Cedar with a different code path in the router. Both are described fully here: https://blog.heroku.com/archives/2013/2/16/routing_performan...

The reason it's easy to confuse these two is also part of what confused us at the time. The slow degradation of the Bamboo routing behavior was causing it to gradually become more and more like the explicit choice we had made for our new product.

But of course it's up to you (and everyone else observing) to judge whether this was some kind of malicious intent to mislead, versus that we made a series of oversights that added up to some serious problems for our customers. And that we are now doing everything in our power to be fully transparent about, to rectify, and to make sure never happen again.

j-kidd · on April 5, 2013

Sorry about the accusation. I read the Bamboo's issue wrongly. The article from Feb 2013 seems to imply that the slow degradation happened from 2011 to 2013. It starts with "Over the past couple of years", I guess that's what got me confused. The FAQ clarifies that the slow degradation happened from 2009 to 2011.

thatthatis · on April 3, 2013

What data do you have to show that the random selection algo has superior performance to a round-robin algo?

adamwiggins · on April 3, 2013

We investigated round-robin. With N routing (or load balancer) nodes, and any degree of request variance, round robin effectively becomes random very quickly.

badgar · on April 3, 2013

> 1k req/min

Also known as <17 requests per second... or a trickle of traffic. Hooray for using bigger numbers and a nonstandard unit to hide inadequacy!

Does Heroku use req/min throughout their service? I can't understand why they would, unless they also can't build the infrastructure to measure on a per-second basis.

> After extensive research and experimentation, we have yet to find either a theoretical model or a practical implementation that beats the simplicity and robustness of random routing to web backends that can support multiple concurrent connections.

Does this CTO think companies like Google and Amazon route their HTTP traffic randomly? No... he knows there are scaleable routing solutions and random routing isn't the best. So he cites "simplicity and robustness." Here, this means "we can't be bothered."

ozgune · on April 3, 2013

(I was in the bigger engineering team at Amazon that looked into this between between '04-'08.)

After having notable issues with Cisco's hardware load balancers, there was an internal project at Amazon aimed at developing scalable routing solutions.

After years of development effort, it turned out that the "better" solutions didn't work well in production, at least not for our workloads. So we went back to million $ hardware load balancers and random routing.

I don't know if things changed after I left, but I can tell you it wasn't an easy problem. So I completely buy the robustness and simplicity argument these guys are making.

adamwiggins · on April 3, 2013

Awesome info, thanks. This has been exactly our experience.

In theory, clever load distribution algorithms (of which one can imagine many variations) are very compelling. Maybe like object databases, or visual programming, or an algorithm that can detect when your program has hit an infinite loop. These are all compelling, but ultimately impractical or impossible in the real world.

senderista · on April 3, 2013

Nope, DRR is still dead :)

adamwiggins · on April 3, 2013

Re: requests. RPM is the metric that New Relic reports, and it's the one most of our customers use when they talk about traffic. I try to speak in whatever terms are most familiar to our customers.

Re: I can't speak to Google and Amazon, and they aren't representative of the size of our customers anyway. We have discussed with many folks who run ops at many companies that are more on par with the size of our mid- and large-sized customers, and single global request queues are exceedingly rare.

The most common setup seems to be clusters of web backends (say, 4 clusters of 40 processes each) that each have their own request queue, but with random load balancing across those clusters. This is a reasonable happy medium between pure random and global request queue, and isn't too different from what you get running (say) 16 processes inside a 2X dyno and 8 web dynos.

meritt · on April 3, 2013

I too was shocked at that. Also that 6 dynos is apparently the average size to handle that load.

It takes $179/mo (6 dynos) to handle 17 requests/second? That's insane.

adamwiggins · on April 3, 2013

Didn't intend to imply that. Number of dynos needed varies extremely widely, with the app's response time and language/framework being used as the main variables.

There are apps on Heroku that serve 30k–50k reqs/min on 10–20 dynos, typically written in something like Scala/Akka or Node.js and serving incredibly short (~30ms) response times with very little variation. But these are unusual.

The more common case of a website, written in non-threadsafe Rails, with median response times of ~200ms but 95th percentile at 3+ seconds, would probably use those same 10 dynos to do only a few thousand requests per minute. Whether or not you use a CDN and page caching also makes a big difference (see Urban Dictionary for an example that does it well).

But it really depends. We were trying to quantify when you should be worried. If you're running a blog that serves 600 rpm / 10 reqs/sec off of two dynos, you don't need to sweat it.

latchkey · on April 3, 2013

And if you get into any sort of slowness (like say mongo decides to pull something from disk instead of ram), it is instantly H12's all over the place and there is nothing you can do about it.

adamwiggins · on April 4, 2013

This comes back to visibility: knowing where the problem lies (especially when you're using a variety of add-on services or calling external APIs) and being able to understand what's happening, or what happened in retrospect.

Visibility is hard no matter where you run your app. But this is an area where Heroku can get a lot better, and we intend to.

latchkey · on April 4, 2013

Visibility is one part of the problem. 50k requests/min is only ~833/s. The reality is that a single dyno should be able to more than handle that sort of load, especially if it is a simple app. People are doing 10k connections on a single laptop, 833s should be a piece of cake. So, yes, visibility is a big issue here because you have no idea if you need 10, 11, 12 or 20 dyno's to serve 50k requests/min. You just guess and when you guess wrong, it ends up with cascading failure of H12's and other issues. Never mind that very few apps have a steady stream of traffic and most have big dips depending on the time of day and HN popularity... and now we are back to the auto scaling discussion.

Another key part of your statement is 'with very little variation'. The code pretty much can't be doing anything other than serving up some static content because as soon as anything that requires any sort of IO or cpu will instantly throw the system into H12 hell. Yes, a CDN will take load off your Heroku dyno's because god forbid that your dyno actually do anything itself. Except that you forget that not all apps are webapps and in my case, there is no reason to add a CDN when I'm just serving requests and responses to an iphone app.

The other part of the problem is being able to actually do something about it. I've tried anywhere between 50 and 300 dynos (yes we got that number increased). If we could just throw money at the problem that would be one thing, but nothing was able to resolve the H12's that we see and our paid support contract was no help either.

"If you're running a blog that serves 600 rpm / 10 reqs/sec off of two dynos, you don't need to sweat it."

Once again, we are back at the same conclusion... don't use Heroku if you want to run a production system.

peterwoo · on April 4, 2013

How could one single-threaded dino serve more than 833 requests / sec?

latchkey · on April 4, 2013

Where do you get that dyno's are single threaded? Please read: https://devcenter.heroku.com/articles/dynos#dynos-and-reques...

peterwoo · on April 4, 2013

500-800req/sec over 10-20 servers with ~30ms response times. 1 req / server seemed plausible.

Thanks. I admit I'm not familiar with the platform.