EC2 Users Should be Cautious When Booting Ubuntu 10.04 AMIs

mihasya · on May 16, 2011

We ran into what we believe to be this very problem at SimpleGeo and spent a long time figuring out what the cause of the JVM locking up for tens of minutes at a time were; we believe this bug (or general class of bugs) to be responsible. Upgrading to 10.10 has caused the majority of the symptoms to disappear.

My personal hunch is that there are probably startups and developers that dumped Cassandra due to "stability issues," which were in reality symptoms of this bug. There's obviously no way of confirming or denying this, so I'll reiterate that it's just a hunch.

gary4gar · on May 16, 2011

Instead of 10.10, I would suggest switching to Debian Or CentOS.

Due its short support cycle, Maverick isn't suited for servers.

mihasya · on May 16, 2011

You're neglecting the fact that which distro you choose has a large influence on the kernel version you get to run. With the amount of work going into stabilizing the kernel when running in a virtualized environment, chaining yourself to a slow-moving distro will cause the exact opposite of stability when running on Xen. See: this blog post.

jerf · on May 17, 2011

Recompiling a Linux kernel with a custom patch is in the core set of skills that I would expect any Linux admin to have. It's not that hard. It's not even that hard to take existing distro kernels and add a patch to them, and maintain your patch going forward as the distro continues their refinements.

When you have Linux on your server, you own it. You can do whatever you want. The distro is the beginning of your power, not the end of it. If you're running a Linux server and you currently have the attitude that you are boxed in by your distro I recommend that you immediately dig into the relevant packaging system and learn enough to put your own patch on top of any existing software package, and recreate the package in the relevant manner (new RPMs, new .deb, whatever).

(Yes, there's a cost/benefit tradeoff to each such patch you have to carry on, but there is still economic value merely in having the option.)

nupark2 · on May 17, 2011

It's 2011, and screwing around with kernel compilation is a waste of my time. It should 'just work'.

The only time I recompile a kernel is when I'm working on kernel code. If UNIX distributions are doing their jobs, a sysadmin should never have to touch it.

jerf · on May 17, 2011

It's 2011. Everything should just work.

Alas.

coderintherye · on May 17, 2011

I really have to agree with the other reply that fooling around with re-compiling a kernel is definitely beyond what I would consider to be a rational expectation. You may not think it's that hard, and it may not be for someone whose sole job is to do just that, but for the vast majority of admins that are wearing many hats this is going to be a unneeded time sink.

moe · on May 17, 2011

You're neglecting the fact that which distro you choose has a large influence on the kernel version you get to run

That's nonsense. Most EC2 AMIs are linked to the amazon AKIs which are unrelated to whatever distro the AMI contains. Most of my debian instances run on a kernel tagged "fc8xen".

The ability to chainload a self-compiled kernel on EC2 is a relatively recent invention (mid-2010) and I have yet to see a good reason to do that for linux.

The article does unfortunately not mention which AKI(s) are affected, but it seems likely this bug was introduced because someone figured "newer is better" and went with the latest Ubuntu kernel instead of sticking to a proven amazon AKI.

mmalone · on May 17, 2011

This bug affected several "supported" AMIs running 2.6.32 series kernels that we tested at SimpleGeo, including the official AMI released by Canonical. After we ran out of patience debugging this stuff we contacted Amazon and worked on the issue with a guy from their kernel team (who was really helpful, fwiw). He agreed that the behavior was bizarre and opened an upstream bug with Canonical [1].

You're sort of contradicting yourself here. You suggest that the distro you're running is independent of the kernel version you're running. But then you go on to claim that this bug was introduced by someone who was not running the default supported kernel. Are you saying that people should run the supported kernel, and be tied to whatever's supported upstream, or are you saying they should risk building their own? Clearly there are benefits and drawbacks either way.

[1] https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/708...

moe · on May 17, 2011

No, I'm saying you should run the kernels that Amazon provides and/or a time-tested one.

mmalone · on May 17, 2011

Amazon's official linux build, at this time, is a custom distribution that uses RPM and is different from the Ubuntu/Debian world in several significant ways (e.g., a new libc implementation). Migrating your WordPress blog to a new platform might be easy, but when you're managing hundreds of machines running thousands of packages that sort of change is not trivial.

Running "time-tested" kernels is not really the best advise either in this case. Xen is a fairly new environment, and EC2's implementation has some quirks, so there's a pretty regular stream of bug fixes and other improvements in recent kernels that are often worth picking up. If I went to Canonical with a "time-tested" kernel bug they'd tell me to upgrade before they'd give any real support.

When we talked to Amazon about switching to their AMIs they advised us that it was probably _not_ worth switching, that switching might not fix the problem, and that the AMIs we were running were widely used and supported. They made it clear that they work closely with Canonical and other providers to get high quality AMIs into their ecosystem. Long story short, the people who you admit know the most about the EC2 environment advised us that they weren't necessarily the best option, or at least not the only option, for good AMIs (sort of like how hardware manufacturers aren't the best option for an operating system).

So the answers aren't really cut-and-dry here. Every time Amazon changes their dom0 there's a chance your "time-tested" kernel will stop working. And just because Amazon runs the infrastructure doesn't mean they're the best choice for a Linux distribution.

moe · on May 17, 2011

You keep lumping the linux distribution in with the linux kernel. They are separate things. You can run Ubuntu on an Amazon AKI. I run Debian on Amazon AKIs. And if the latest Ubuntu depends on a particular kernel feature that the tested kernels don't have then it's probably not a good idea to run the latest Ubuntu.

josephruscio · on May 17, 2011

Canonical maintains and updates their own AKIs for their official Ubuntu AMIs, which is what we're running in production. I suggest deferring to their judgement. Here's a full release history of the Ubuntu 10.04 server AMI's, note the changing AKIs and note that none of them are the default ones provided by Amazon:

http://uec-images.ubuntu.com/query/lucid/server/released.txt

moe · on May 17, 2011

And their rationale for doing that is?

Looks like their judgement didn't work out so well this time. I'd be wary of running the latest untested kernel for no reason other than "because we can".

mihasya · on May 17, 2011

So Canonical, a distribution maker, is not to be trusted for their kernel suggestions, but Amazon is infallible? On what basis?

We asked the Amazon kernel team if we should try switching to one of their kernels/distros, and they said "No, just upgrade to Maverick and the accompanying kernel." It's been pointed out that Maverick has its own set of Xen bugs. I guess Amazon doesn't know everything.

The horse you're getting on about using the "proven" Amazon kernels is a bit high. Turns out this whole virtualization thing is somewhat new, and the kinks are still being worked out. Old kernel builds don't work particularly well because a lot of their assumptions are broken by virtualization; new kernels are what they are - new.

(Edit, forgot initially): Finally, we ran 10.04 - the Long Term Support release of Ubuntu from a year ago. There was no "because we can."

Frankly, I'm a bit amazed at your disdain for people sharing their findings from practical experience running into these issues in high-load production environments.

moe · on May 17, 2011

So Canonical, a distribution maker, is not to be trusted for their kernel suggestions, but Amazon is infallible?

Neither is infallible. But Amazon probably knows the intricacies of their platform better than Canonical. And they likely run some of their own stuff on these kernels for a while before releasing them to the public.

Old kernel builds don't work particularly well

Don't work as in what? This is the first time I hear about a kernel problem on EC2.

disdain

I don't see where I voiced disdain. I merely responded to the guy who claimed your EC2 kernel is linked to the distro you run. That's simply not true.

mmalone · on May 17, 2011

If this is the first time you've heard about a kernel problem on EC2 you're probably not managing a very large EC2 infrastructure [1, 2]. Even in non-virtualized environments, at scale, it's common to run into linux kernel bugs, or at least peculiarities. Which is why large tech organizations invariably employ kernel dev teams.

The guy who claimed EC2 kernels are linked to the distro you run was simply claiming that, unless you want to go it on your own, you're tied to the kernel provided by a supported AMI. As you've suggested multiple times, there are benefits to running an environment that is supported and that other people have operational experience with. Honestly, I'm not even sure what you're arguing anymore... seems like you're just being antagonistic.

[1] https://bugs.launchpad.net/ubuntu/+source/linux-ec2/ [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bugs?field....

moe · on May 17, 2011

unless you want to go it on your own

There are plenty AMIs based on stable AKIs out there. Moreover if you manage a "very large EC2 infrastructure" then you don't rely on 3rd party AMIs, do you?

Finally, your links point to... Ubuntu bugs. If I missed one that was tracked back to an amazon AKI then a deeplink would be appreciated.

imbriaco · on May 16, 2011

The patch in the bug report seems solves the problem and allows you to stay on 10.04 LTS. It's a tiny patch and has been stable for us under production load.

mmalone · on May 16, 2011

Or even FreeBSD, which works on EC2 now. It uses a different technique for scheduling, so this issue will not affect it.

dialtone · on May 17, 2011

The importance of the short release cycle has decreased a lot for us since we moved to EC2. It's really simple to re-deploy the entire infrastructure on newer and supported images. Using puppet and cloudformation or whatever it is that you prefer you can upgrade with very little trouble.

saurik · on May 17, 2011

I would be careful with Maverick (or Lenny; Karmic was unaffected): there is a serious issue with it dropping interrupts on EC2. If you are going yo use Ubuntu on EC2, you /really/ want to be using Natty.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/666211

mikeyk · on May 17, 2011

Seconded. We had our 10.04 machines regularly, and become totally unresponsive. Then, we had even worse luck with Maverick--starting with the issue where it didn't recognize half the RAM on the 68GB instance type. Natty seems to be holding up the best so far.

dialtone · on May 17, 2011

This is even more serious than the one in 10.04. While in 10.04 the bug would slow down the machine for a while in 10.10 with 2.6.35 kernel will simply hang the machine and requires a reboot (which can only be achieved by rebooting from the AWS console multiple times) of the instance to be "fixed".

I'm in the process of upgrading all of our instances to Natty from 10.04 or younger. It's actually weird that this issue didn't get any attention whatsoever.

aaronblohowiak · on May 16, 2011

librato has an amazing product and continues to have a great blog. I love how they dig into the root cause (linking to the typo in the patch that introduced the bug) and also have easy graphs to take in the impact.

mey · on May 17, 2011

Is this strictly a Ubuntu/Xen bug, or does this have impacts on other VM environments (Virtual Box/VMWare)?

RyanKearney · on May 17, 2011

Honestly if you're using Ubuntu for a server then performance is not your priority anyway.

mihasya · on May 17, 2011

What is your recommendation?

hiroprot · on May 17, 2011

FWIW, we're using the Amazon Linux AMI (from Amazon themselves) and are quite happy with it.

It is a pretty minimalist AMI, which is what we wanted. One caveat: it is in beta.

maratd · on May 17, 2011

This is a hybrid between CentOS and Fedora.

RyanKearney · on May 17, 2011

CentOS if you can't use RHEL.

mihasya · on May 17, 2011

Do you have some benchmarks handy that indicate that CentOS or RHEL are more performant than Ubuntu?

maratd · on May 17, 2011

Any performance differences should be negligible. Use what you're used to. I personally use Amazon/CentOS/RHEL because I know where everything is and how everything goes.

mihasya · on May 17, 2011

rcrowley · on May 17, 2011

Because Python 2.4 is still AWESOME!

RyanKearney · on May 17, 2011

Oh man, good thing you can install python yourself on RHEL. But I wouldn't expect someone using Ubuntu to know how to download and compile the source code themselves.

burgerbrain · on May 18, 2011

Because they don't have to.

(And I say that as a longtime RH fanboi)

dreamdu5t · on May 16, 2011

This applies to any AMI, Ubuntu or not.

Ubuntu shouldn't be used for web servers to begin with. You should be using a distribution with a long and thorough stable release cycle with minimal packages, such as Debian.

derobert · on May 16, 2011

There are two (sometimes more) versions of Ubuntu: Server and Destkop. Desktop is definitely inappropriate for running production web servers (AMI or otherwise), for the reasons you indicate.

Server, however, is perfectly appropriate, especially the LTS release (which is supported for five years). Ubuntu LTS will probably actually be supported longer than (for example) Debian Stable.

krobertson · on May 16, 2011

Can you clarify? I wouldn't use non-LTS on servers, but the LTS ones have a stable and long enough shelf life.

I've personally found Ubuntu more bare bones out of the box than CentOS.

dreamdu5t · on May 16, 2011

The main difference between Ubuntu-LTS and Debian is that Debian's stable release cycle is longer and more thorough. This means that Ubuntu-LTS will have slightly newer software, but Debian will likely have less bugs.

Debian's primary concern with its release cycle is stability. Other distributions like Ubuntu or CentOS trade a little stability for newer software.

My method: Use Debian but when you must have newer versions just add an unofficial repository to your apt sources (assuming you're prepared to deal with the complexity and inconsistency this might introduce).

jsight · on May 16, 2011

> The main difference between Ubuntu-LTS and Debian is that Debian's stable release cycle is longer and more thorough.

I believe that the debian stable release cycle is currently at about 2 years, w/ a very short support cycle afterwards. The LTS cycle is 2 years with a 3 years of additional support afterwards.

krobertson · on May 17, 2011

I have to agree more there. We recently moved from Debian to Ubuntu and like it far more.