So... they're not running any kind of devops system at all. If they were, they could just run the upgrade on the image, test it, deploy it. All the "months of careful planning and many many tests" they did are basically wasted time.
I wouldn't be proud of this, quite the opposite. I would suggest critically reviewing the entire infrastructure management strategy since months lost to a single upgrade is obviously indicative of greater problems.
It sounds to me like they are in fact running a giant devops system, all for the purpose of not using virtual static IPs.
Instead of just provisioning fresh VMs and migrating customer data they're doing this massive upgrade in place on existing machines to avoid losing the assigned IPs.
I guess they decided the benefits of being cloud provider agnostic outweighed the downside of spending months of man hours automating in-place OS upgrades.
I'm not sure that I'm quite as critical, the technique itself looks useful for some scenarios.
I've worked on a Windows project where we solved a similar problem by booting from vhd, so you can "just" write a new vhd, uniquify it with the per machine config, update the boot menu and reboot - all data are on a separate volume, naturally
I'm surprised they went to all this effort for only 2000 machines though.
I work on a "devops" team and this is similar to what we do. We have a "reinstall the whole damn image" flag but using it by default would just cause unnecessary downtime for us, especially in our non-production environments.
FWIW, I personally love Virtual IPs (VIPs) for this (basically, an existing network interface advertises serving more than one IP, and can change that IP dynamically between servers with an arp call). The downside is that there are a lot of cloud providers who don't support externally available VIPs. They do, however, offer their own nearly-identical solution (such as Elastic IPs from Amazon).
The use of VIPs or similar could have potentially avoided the need for such a risky upgrade, potentially also saving millions of dollars in the process. Of course, I could simply be missing some hidden requirement from customers that they couldn't use VIPs but that's pretty uncommon, even in the finance industry.
That's addressed in the article: "We purposely don’t employ dynamic IPs to retain multi-cloud deployment capabilities and prevent vendor lock-in with one platform."
Why not have the machines presented to the clients or other interfaces by a virtual IP from an application delivery controller such as F5's BigIP or similar? And then remove the dependency on static NIC's on a virtual appliance?
Seems counter intuitive to run virtual appliances on static addresses if it can be avoided.
That still makes no sense. AWS and DigitalOcean both allow for static IP addresses that can be migrated between instances. There's no reason for crazy in-place upgrades.
I'd hazard a guess that the cost of moving IPs would severely outweigh the cost of switching VIP implementations, in terms of preventing a move across providers.
They're still taking downtime for this... Even if they're forced to have a no-VIP no-HA no-LB setup (seems insane to me) it should be much simpler to set the DNS TTL to a low value right before and switch it to the new IP after the new box comes up.
Yes, this is a wonderful technical hack, but terrible as a business strategy. They are admitting that they don't have a good architecture that would make upgrades easy.
They shouldn't be giving their customers a static IP, they should be giving them a 'customername.ourplatform.com' address that the customer can point a CNAME at.
This was addressed in the article: There is a tendency within that industry of whitelisting API access etc. by source IP, so their customers do need static IPs not primarily for inbound traffic, but to be able to access the APIs they need.
Now it's still stupid of them to not abstract that away from the individual customer servers but, but this issue isn't solved with cname's.
On the one hand I am very impressed they managed this, but on the other, it does seem very sledgehammer/nut-esque. Even without virtual IPs, it seems a little silly that their customers weren't running N+1 redundant instances that could be taken out, upgraded and then swapped without disrupting normal operations.
Again, very impressive as an academic exercise, especially considering the given script isn't actually that complicated, but wow, they had some serious guts running this in production!
I think the authors lost a good opportunity to move towards containers to avoid these problems in the future. While interesting academically, is wrong for the long run
Given how much memory some servers have these days, which for an application node is often more than the necessary hard-disk capacity, this is quite a clever approach.
I wouldn't be proud of this, quite the opposite. I would suggest critically reviewing the entire infrastructure management strategy since months lost to a single upgrade is obviously indicative of greater problems.