Similar thing (catastrophic aircon failure due to a flood in a crap colocated DC) happened to us too before we shifted to AWS. Photos from the colo were pretty bizarre - fans balanced on random boxes, makeshift aircon ducting made of cardboard and tape, and some dude flailing an open fire door back and forth all day to get a little bit of fresh air in. Bizarre to see in 2010-ish with multi million dollar customers.
We ended up having to strategically shut servers down as well, but the question of what's critical, where is it in the racks, and what's next to it was incredibly difficult to answer. And kinda mind-bending - we'd been thinking of these things as completely virtualised resources for years, suddenly having to consider their physical characteristics as well was a bit of a shock. Just shutting down everything non-critical wasn't enough - there were still now critical non-redundant servers next to each other overheating.
All we had to go on was an outdated racktables install, a readout of the case temperature for each, and a map of which machine was connected to which switch port which loosely related to position in the rack - none completely accurate. In the end we got the colo guys to send a photo of the rack front and back and (though not everything was well labelled) we were able to make some decisions and get things stable again.
In the end one server that was critical but we couldn't get to run cooler we got lucky with - we were able to pull out the server below and (without shutting it down) have the on site engineer drop it down enough to crack the lid open and get some cool air into it to keep it running (albeit with no redundancy and on the edge of thermal shutdown).
We came really close to a major outage that day that would have cost us dearly. I know it sounds like total shambles (and it kinda was) but I miss those days.
> have the on site engineer drop it down enough to crack the lid open
Took me four reads to find an alternative way to read it other than "we asked some guy that doesn't even work for us to throw it on the ground repeatedly until the cover cracks open", like that Zoolander scene.
In our defence, he offered. It had hit hour 6 of both the primary and the backup aircon being down, on a very hot day - everyone was way beyond blame and the NOC staff were basically up for any creative solution they could find.
Wait, you didn't mean "he repositioned it a couple levels down on the rack to make some room above so he could unscrew the cover and crack it a bit open it like a grand piano"?
I find it’s much less stressful to rescue situations where it wasn’t your fault to begin with. Absent the ability to point fingers at a vendor, crises like that are a miserable experience for me.
> Similar thing (catastrophic aircon failure due to a flood in a crap colocated DC) happened to us too before we shifted to AWS. Photos from the colo were pretty bizarre - fans balanced on random boxes, makeshift aircon ducting made of cardboard and tape, and some dude flailing an open fire door back and forth all day to get a little bit of fresh air in. Bizarre to see in 2010-ish with multi million dollar customers.
I'd have considered calling a few friends from the fire brigade or the catastrophe protection there.
It's not an emergency, yes. However, if you want a situation for your trainees to figure out how to ventilate a building with the force of a thousand gasoline driven fans without anyone complaining and no danger to any person... well be my guest because I can't hear you anymore. Those really big fans are loud AF, seriously.
And, on a more serious note, you could show those blokes how a DC works. Where power goes, what components do, how to handle uncontrolled fire in areas. Would be a major benefit to the local fire fighters.
Good plan! I think this is a relatively common practice within some corners of the telecom world.
At university (Western WA in B'ham), I worked for our campus resnet, which had extensive involvement with other networking groups on campus. They ran layers 3 and below on the resnets, we took DNS+DHCP, plus egress, and everything through to layers 8 and 9.
The core network gear was co-located in a few musty basements along with the telephone switches. DC and backup power was available, but severely limited under certain failure scenarios.
All of the racked networking gear in the primary space was labeled with red and green dots. Green was first to go in any load-shedding scenario. Think: redundant LAN switches, switches carrying local ports, network monitoring servers, other +1 redundant components, etc.
I'm not sure if the scheme was ever required in real life, but do know it was based on hard-earned experiences like the author here.
Used to run data centers for ISPs and such around NoVA.
This was built into the building plan by room, with most rooms going down first and Meet-Me-Rooms + the rooms immediately adjacent where the big iron routers were, being the last to fail. It's been a while but IIRC there weren't any specific by-rack or by system protocols.
Asset management is definitely a thing. Tag your environments, tag your apps, and provide your apps criticality ratings based on how important they are to running the business. Then it's a matter of a query to know which servers can be shut, and which absolutely must remain.
That seems like a poorly run company. Idk. Maybe we’ve worked in very different environments, but devs have almost always been aware of the criticality of the app, so convincing people wasn’t hard. In most places, the answer hinges on “is it customer facing?” and/or “does it break a core part of our business?” If the answer is no to both, it’s not critical, and everyone understands that. There’s always some weird outlier, “well, this runs process B to report on process A, and sends a custom report to the CEO …”, but hopefully those exceptions are rare.
>devs have almost always been aware of the criticality of the app
I'm sure that developers are aware of the how important their stuff is to their immediate customer, but they're almost never aware of the relative criticality vis-a-vis stuff they don't own or have any idea about.
I maintain a couple of apps that are pretty much free to break or to revert back to an older build without much consequence, except for one day a week, when half the team uses them instead of just me.
Any other day I can use them to test new base images, new coding or deployment techniques, etc. I just have to put things back by the end of our cycle.
Welcome to University IT, where organizational structures are basically feudal (by law!). Imagine an organization where your president can't order a VP to do something, and you have academia :)
agree. ethbr1 is 100% right about this being a problem; if politics is driving your criticality rating, it's probably being done wrong. it should be as simple as your statement, being mindful of some of those downstream systems that aren't always obviously critical (until they are unavailable for $time)
edit: whoops, maybe I read the meaning backward, but both issues exist!
Some kind of cost shedding to the application owner (in many enterprises this is not the infra owner) is definitely needed otherwise everything becomes critical.
"Everything is critical" should sound a million alarm bells in the minds of "enterprise architects" but most I've discussed this with are blissfully unaware.
In moments of crisis, immediate measures like physical tagging can be crucial. Yet, a broader challenge looms: our dependency on air conditioning. In Toronto's winter, the missed opportunity to design buildings that work with the climate, rather than defaulting to a universal AC solution, underscores the need for thoughtful asset management tailored to specific environments.
Toronto's climate and winters is dramatically changing, the universal AC solution is almost mandatory due to the climate not being as cold in this area as it once was.
Average temp probably isn’t what you need here - peak temperature and length of high temperature conditions would be more important when figuring out if you need to have artificial cooling available.
* Feb 1 2014 MAX 5.5 °C HR MEAN -7.8 °C MEAN M/M -8.3 °C MIN -21.3 °C
* Feb 1 2024 MAX 15.7 °C HR MEAN 1.3 °C MEAN M/M 1.3 °C MIN -8.2 °C
* Jan 1 2014 MAX 7.5 °C HR MEAN -8.3 °C MEAN M/M -8.6 °C MIN -24.2 °C
* Jan 1 2024 MAX 6.5 °C HR MEAN -2.3 °C MEAN M/M -2.1 °C MIN -15.5 °C
* Dec 1 2013 MAX 15.6 °C HR MEAN -4.0 °C MEAN M/M -4.2 °C MIN -17.8 °C
* Dec 1 2023 MAX 13.1 °C HR MEAN 2.7 °C MEAN M/M 2.7 °C MIN -4.9 °C
Looking at the MEANs shows the real story - as well as seeing that the Mins are getting nowhere close to what they used to.
Also I live here, snow volumes have been low enough that I no longer need a snowblower. I used to build a backyard rink and haven't been able to properly last couple of years because the weather is too mild and I can't enough solid time that it'll be frozen. Public outdoor rinks (that aren't artificially chilled) suffer the same fate and are rarely if ever available.
Even in Ottawa where at this time a decade ago people would have been skating on the Canal for weeks by now - it's still not frozen over and open for the public.
Several data centres in Toronto (including the massive facilities at 151 Front Street West where most of the internet for the province passes through) make use of the deep lake cooling loop that takes water pumped in from Lake Ontario to cool equipment before moving on to other uses. Water is pumped in from a sufficient depth such that the temperature is fairly constant year round.
I think the system just has an isolated loop that heat exchanges with the incoming municipal water supply. Unsure if the whole system cools the loop glycol further or not, but ultimately there’s still a compressor-based aircon system sitting somewhere, probably at each building, that they’re depending on. They’re just not rejecting heat to the air (as much?).
Would love to know if a data centre could get paid for rejecting it’s heat to the system during what is heating time for other users.
> IP-Only, Interxion and Advania Data Centers are building data centers on the Kista site, which is connected to Stockholm's district heating system so tenants get paid for their waste heat, which is used to warm local homes and businesses
I upvoted, but I agree so much, I had to comment, too. I wonder how long it’d take to recoup the loss of retrofitting such a system. Despite this story today, this type of problem must be rare. I imagine most of the savings would be found in the electric bill, and it’d probably take a lot of years to recoup the cost.
I vaguely remember some other whole building DC designs that used a central vent which opened externally based on external climate for some additional free cooling. Can't find the reference now though. But geothermal is pretty common for sure.
You may be thinking about Yahoo’s approach from 2010?
> The Yahoo! approach is to avoid the capital cost and power consumption of chillers entirely by allowing the cold aisle temperatures to rise to 85F to 90F when they are unable to hold the temperature lower. They calculate they will only do this 34 hours a year which is less than 0.4% of the year.
No, what I was remembering was a building design for datacenters, but I can't find a reference. Maybe it was only conceptual. The design was to pull in cold exterior air, pass thru the dehumidifiers to bring some of the moisture levels down, and vent heat from a high rise shaft out the top. All controlled to ensure humidity didn't get wrecked.
I know someone who did that in the Yukon during the winter, just monitor temperatures and crack a window when it got too hot. Seems like a great solution except that they were in a different building so they had to trudge through the snow to close the window if it got too cold.
Having an application, process and hardware inventory is a must if you are going to have any hope of disaster recovery. Along with regular failovers to make sure you haven’t missed anything.
I’m absolutely loving the term Unix herder and will probably adopt it :)
I’m generally with you and the wider industry on the cattle-not-pets thing but there are a few things to keep in mind in the context of a university IT department that are different than what we regularly talk about here:
- budgets often work differently. You have a capex budget and your institution will exist long enough to fully depreciate the hardware they’ve bought you. They won’t be as happy to dramatically increase your opex.
- storage is the ultimate pet. In a university IT department you’re going to have people who need access to tons and tons of speedy storage both short-term and long-term.
I’m smiling a little bit thinking about a job 10 years ago who adopted the cattle-not-pets mentality. The IT department decided they were done with their pets, moved everything to a big vSphere cluster, and backed it by a giant RAID-5 array. There was a disk failure, but that’s ok, RAID-5 can handle that. And then the next day there was a second disk failure. Boom. Every single VM in the engineering department is gone including all of the data. It was all backed up to tape and slowly got restored but the blast radius was enormous.
At the risk of no true Scotsman, that doesn’t sound like “cattle not pets“; when the cattle are sent to the slaughterhouse there isn’t any blast radius, there’s just more cattle taking over. You explicitly don’t have to replace them with exact clones of the original cattle from tape very slowly, you spin up a herd of more cattle in moments.
If you are using clustered storage (Ceph, for example) instead of a single RAID5 array, ideally the loss of one node or one rack or one site doesn't lose your data it only loses some of the replicas. When you spin up new storage nodes, the data replicates from the other nodes in the cluster. If you need 'the university storage server' that's a pet. Google aren't keeping pet webservers and pet mailbox servers for GMail - whichever loadbalanced webserver you get connected to will work with any storage cluster node it talks to. Microsoft aren't keeping pet webservers and mailbox servers for Office365, either. If they lose one storage array, or rack, or one DC, your data isn't gone.
'Cattle' is the idea that if you need more storage, you spin up more identikit storage servers and they merge in seamlessly and provide more replicated redundant storage space. If some break, you replace them with identikit ones which seamlessly take over. If you need data, any of them will provide it.
'Pets' is the idea that you need the email storage server, which is that HP box in the corner with the big RAID5 array. If you need more storage, it needs to be expansion shelves compatible with that RAID controller and its specific firmware versions which needs space and power in the same rack, and that's different from your newer Engineering storage server, and different to your Backup storage server. If the HP fails, the service is down until you get parts for that specific HP server, or restore that specific server's data to one new pet.
And yes, it's a model not a reality. It's easier to think about scaling your services if you have "two large storage clusters" than if you have a dozen different specialist storage servers each with individual quirks and individual support contracts which can only be worked on by individual engineers who know what's weird and unique about them. And if you can reorganise from pets to cattle, it can free up time, attention, make things more scalable, more flexible, make trade offs of maintenance time and effort.
It's stored in another system that is probably treated as a pet, hopefully by somebody way better at it with a huge staff (like AWS). Even if it's a local NetApp cluster or something, you can leave the state to the storage admins rather than random engineers that may or may not even be with the company any more.
Very differently. Instead of a system you continually iterate on for its entire lifetime, if you're in a more regulated research area you might build it once, get it approved, and then it's only critical updates for the next five (or more!) years while data is collected.
Not many of the IT principles developed for web app startups apply in the research domain. They're less like cattle or pets and more like satellites which have very limited ability to be changed after launch.
From my past life in academia, you're totally right. But that kind of reinforces the point... you do occasionally get some budget for servers and then have to make them last as long as possible :). Those one-time expenses are generally more palatable than recurring cloud storage costs though.
> moved everything to a big vSphere cluster, and backed it by a giant RAID-5 array
I’m with sibling commenter, if said IT department genuinely thought that the core point in “cattle-not-pets” was met by their single SuperMegaCow, then they missed the point entirely.
>>The IT department decided they were done with their pets, moved everything to a big vSphere cluster, and backed it by a giant RAID-5 array. There was a disk failure, but that’s ok, RAID-5 can handle that.
Precisely why, when I was charged with setting up a 100 TB array for a law firm client at previous job, I went for RAID-6, even though it came with a tremendous write speed hit. It was mostly archived data that needed retention for a long period of time, so it wasn't bad for daily usage, and read speeds were great. Had the budget been greater, RAID 10 would've been my choice. (requisite reminder: RAID is not backup)
Not related, but they were hit with a million dollar ransomware attack (as in: the hacker group requested a million dollar payment), so that write speed limitation was not the bottleneck considering internet speed when restoring. Ahhh.... what a shitshow, the FBI got involved, and never worked for them again. I did warn them though: zero updates (disabled) and also disabled firewall on the host data server (windows) was a recipe for disaster. Within 3 days they got hit, and the boss had the temerity to imply I had something to do with it. Glad I'm not there anymore, but what a screwy opsec situation I thankfully no longer have to support.
> the boss had the temerity to imply I had something to do with it.
What was your response? I feel like mine would be "you are now accusing me of a severe crime, all further correspondence will be through my lawyer, good luck".
> even though it came with a tremendous write speed hit
Only on a writes < stripe. If your writes are bigger then you can have way more speed than RAID10 on the same set, limited only by the RAID controller CPU.
Due to network limitations and contract budgeting, I never got the chance to upgrade them to 10 Gb, but can confirm I could hit 1000 Mbps (100+ MB/s) on certain files on RAID-6. It sadly averaged out to about 55-60 MB/s writes (HDD array, Buffalo), which again, for this use case was acceptable, but below expectations. I didn't buy the unit, I didn't design the architecture it was going into, merely a cog in the support machinery.
The devops meme referenced, "cattle not pets", probably popularized by a book called "The Phoenix Project".
The real point is that pets are job security for the Unix herder. If you end up with one neckbeard running the joint that's even worse as a single point of failure.
> Author: proudly declares himself Unix herder, wants to keep track of which systems are important.
Because not all environments are webapps which dozens or hundreds of systems configured in a cookie cutter matter. Plenty of IT environments have pets because plenty of IT environments are not Web Scale™. And plenty of IT environments have staff churn and legacy systems where knowledge can become murky (see reference about "archaeology").
An IMAP server is different than a web server is different than an NFS server, and there may also be inter-dependencies between them.
I work with one system where the main entity is a “sale” which is processed beginning to end in some fraction of a second.
A different system I work with, the main “entity” is, conceptually, more like a murder investigation. Less than 200 of the main entity are created in a year. Many different pieces of information are painstakingly gathered and tracked over a long period of time, with input from many people and oversight from legal experts and strong auditing requirements.
Trying to apply the best lessons and principles from one system to the other is rarely a good idea.
These kind of characteristics of different systems make a lot of difference to their care and feeding.
So much this. Not everything fits cattle/chicken/etc models. Even in cases where those models could fit, they are not necessarily the right choice, given staffing, expertise, budgets, and other factors.
I think you're missing some aspects of cattle. You're still supposed to keep track of what happens and where. You still want to understand why and how each of the servers in the autoscaling group (or similar) behaves. The cattle part just means they're unified and quickly replaceable. Buy they still need to be well tagged, accounted for in planning, removed when they don't fulfil the purpose anymore, identified for billing, etc.
And also importantly: you want to make sure you have a good enough description for them that you can say "terraform/cloudformation/ansible: make sure those are running" - without having to find them on the list and do it manually.
When you are responsible for the full infrastructure, sequencing power down and power on in coordination with your UPS is a common solution. Network gear needs a few minutes to light up ports, core services like DNS and identity services might need to light up next, then storage, then hypervisors and container hosts, then you can actually start working on app dependencies.
This sort of sequencing leads itself naturally to having a plan for limited capacity “keep the lights on” workload shedding when facing a situation like the OP.
Not everyone has elected to pay Bezos double the price for things they can handle themselves, and this is part of handling it.
If you’re running a couple ec2 instances in one AZ then yeah it’s closer to 100x, but if you wanted to replicate the durability of S3, it would cost you a lot in terms of redundancy (usually “invisible” to the customer) and ongoing R&D and support headcount.
Yes, even when you add it all up, Amazon still charges a premium even over that all-in cost. That’s sweat equity.
Even if you're running "cattle", you still need to keep track of which systems are important, because to surprise of many, the full infrastructure is more like the ranch, and cattle is just part of it.
(and here I remind myself again to write the screed against "cattle" metaphor...)
The industry has this narrative because it suits their desire to sell higher-margined cloud services. However in the real world, especially in academia as cks is, the reality is that many workloads are still not suitable for the cloud.
The cattle metaphors really is a bad one. Anyone raising cattle should do the same thing, knowing which animals are the priority in case of draught, disease, etc.
Hopefully one never has to face that scenario, but its much easier to pick up the pieces when you know where the priorities are whether you're having to power down servers or thin a herd.
Cattle are often interchangeable. You cull any that catch a disease (in some cases the USDA will cull the entire herd if just one catches something - bio security is a big deal) In the case of drought you pick a bunch to get rid of - based on market prices (If everyone else is you will try to keep yours because the market is collapsing - but this means managing feed and thus may mean culling more of the herd latter.)
Some cattle we can measure. Milk cows are carefully managed as to output - the farmer knows how much the milk each one is worth and so they can cull the low producers. However milk is so much more valuable than meat that they never cull based on drought - milk can always outbid meat for feed. If milk demand goes down the farmer might cull come - but often the farmer is under contract for X amount of milk and so they cannot manage prices.
But if you are milking you don't have bulls. Maybe you have one (though almost everyone uses artificial insemination these days). Worrying about milking bulls is like worrying about the NetWare server - once common but has been obsolete since before many reading this were even born.
Of course the pigs, cows, and chickens are not interchangeable. Nor are corn, hay, soybeans.
I think it's a way of thinking about things, rather than a true/false description. e.g. VMware virtual hosts make good cattle - in some setups I have worked on the hosts are interchangeable, move virtual machines between them without downtime. In others the hosts have different storage access, different connectivity and it matters which combination of hosts are online/offline together, and which VMs need the special connectivity.
The regular setups are easier to understand, nicer to work on. The irregular ones are a trip hazard, they need careful setup, more careful maintenance, more detailed documentation, more aware monitoring. But there's probably ways they could be made regular, if the unique connectivity was moved out to a separate 'module' e.g. at the switch layer, or if the storage had been planned differently, sometimes with more cost, sometimes just with different design up-front.
Along these lines, yes DNS is not NTP but you could have a 'cattle' template Linux server which can run your DNS or NTP or SMTP relay which can be script deployed, and then standard DNS/NTP/SMTP containers deployed on top. Or you could build a new Linux server by hand and deploy a new service layer by hand, every time, each one slightly different depending how rushed you are and what verison of installers are conveniently available and whether the same person does the work following the latest runbook or an outdated one or going from memory. You could deploy a template OpnSense VM which can front DNS or NTP or SMTP instead of having to manually login to a GUI firewall interface and add rules for the new service by hand.
'Cattle not pets' is a call to standardise, regularise, modularize, template, script, automate; to move towards those ways of doing things. Servers are now software which can be copypasted in a way they weren't 10-30 years ago, at least in my non-FAANG world. To me it doesn't mean every server has to mean nothing to you, or every server is interchangeable, it means consider if thinking that wasy can help.
It might have been the original idea (though taken into account the time period and context, I suspect we're missing possible overfocus on deployment by AWS ASG and smallish set of services).
What grinds my gears is that over years I found it a thought limiting meme - it effectively swings a metaphor too hard into one direction, and some early responses under original article IMO present quite well the issue. It's not like people are stupid - but metaphors like this exist to make shortcuts for thinking and discussion, and for last few years I've seen that it short-circuits the discussion too hard, making people either stop thinking about certain interdependencies, or stopping noticing that there are still systems they treat like "pets", just named differently and in different scope, but now mentally pushing out how fragile they can be.
The issue here is not much the hardware, but the services that run on top of them.
I guess that many companies that use "current practices" have plenty of services that they don't even know about running on their clusters.
The main difference is that instead of the kind of issues that the link talks about, you have those services running year after year, using resources, for the joy of the cloud companies.
This happens even at Google [1]:
"There are several remarkable aspects to this story. One is that running a Bigtable was so inconsequential to Google’s scale that it took 2 years before anyone even noticed it, and even then, only because the version was old. "
I would argue even the "wider industry" still administrates systems that must be treated as pets because they were not designed to be treated as cattle.
I'm pretty sure anyone in the industry that draws this distinction between cattle and pets has never worked with cattle and only knows of general ideas about the industrial cattle business.
Google tells me beef cattle are slaughtered after 2 years. Split 2 years among 44,000 cattle and you get to spend at most 24 minutes with each one, if you dedicate 2 years of your life to nothing else but that, not even sleep, travel, eating. If you let them live their natural life expectancy of 20 years, you get 240 minutes with each cow - two hours in its 20 year life.
"I care about my cattle", yes, I don't think "cattle" is supposed to mean "stop caring about the things you work on". "I know each and every one as well as the family dog I've had for ten years", no. That's not possible. You raise them industrially and kill them for profit/food, that's a different dynamic than with Spot.
I believe this is just a great example of why cattle shouldn't be raised in such high volume industrial processes.
Have you ever been around cattle? Or helped them calve? Or slaughtered one for meat?
I know every one of my animals and understand the herd dynamics, from who the lead cow is to who is the asshole that is the one often starting fights and annoying the others.
We shouldn't be throwing so many animals into such a controlled and confined system that they are reduced to numbers on a spreadsheet. We shouldn't raise an animal for slaughter after dedicating at most 24 minutes to them.
"cattle not pets" is about computer servers, it's not about ethical treatment of living creatures. Whether or not living animals should be numbers on a spreadsheet, servers can be without ethical concerns[1]. "I treat my cattle like pets" is respectable, but not relevant - unless you are also saying "therefore you should treat your servers like pets", which you would need to expand on.
> "I know every one of my animals"
And you are still dodging the part where there is a point at which you could not do that if you had more and more animals. You can choose not to have more animals, but a company cannot avoid having more servers if they are to digitise more workflows, serve more customers, offer more complex services, have higher reliability failovers, DR systems, test systems, developer systems, monitoring and reporting and logging and analysis of all of the above - and again the analogy is not saying "the way companies treat cows is the right way to treat cows", it's saying "the ruthless, ROI-focused, commodity way companies actually do treat cows is a more scalable and profitable way to think about the computer servers/services that you currently think about like family pets".
[1] at least, first order ones in a pre-AI age. Energy use, pollution, habitat damage, etc. are another matter.
Ranchers do eat their pets. They generally do love the cattle, but they also know at the end of a few years they get replaced - it is the cycle of life.
HPC admin here (and possibly managing a similar system topology with their room).
In heterogeneous system rooms, you can't stuff everything into a virtualization cluster with a shared storage and migrate things on the fly, thinking that every server (in hardware) is a cattle and you can just herd your VMs from host to host.
A SLURM cluster is easy. Shutdown all the nodes, controller will say "welp, no servers to run the workloads, will wait until servers come back", but storage systems are not that easy (ordering, controller dependencies, volume dependencies, service dependencies, etc.).
Also there are servers which can't be virtualized because they're hardware dependent, latency dependent, or just filling the server they are in, resource wise.
We also have some pet servers, and some cattle. We "pfft" to some servers and scramble for others due to various reasons. We know what server runs which service by the hostname, and never install pet servers without team's knowledge. So if something important goes down everyone at least can attend the OS or the hardware it's running on.
Even in a cloud environment, you can't move a VSwitch VM as you want, because you can't have the root of a fat SDN tree on every node. Even the most flexible infrastructure has firm parts to support that flexibility. It's impossible otherwise.
Lastly, not knowing which servers are important is a big no-no. We had "glycol everywhere" incidents and serious heatwaves, and all we say is, "we can't cool room down, scale down". Everybody shuts the servers they know they can, even if somebody from the team is on vacation.
> wants to keep track of which systems are important.
I mean obviously? Industry does the same. Probably with automation, tools and tagging during provisioning.
The pet mentality is when you create a beautiful handcrafted spreadsheet showing which services run on the server named "Mjolnir" and which services run on the server named "Valinor". The cattle mentality is when you have the same in a distributed key-value database with UUIDs instead of fancyful server names.
Or the pet mentality is when you prefer to not shut down "Mjolnir" because it has more than 2 years of uptime or some other silly reason like that. (as opposed to not shutting it down because you know that you would loose more money that way than by risking it overheating and having to buy a new one.)
Pets make sense sometimes. I also think there are still plenty of companies, large ones, with important services and data, that just don't operate in a way that allows the data center teams to do this either. I have some experience with both health insurance and life insurance companies for example where "this critical thing #8 that we would go out of business without" stillnloves solely on "this server right here". In university settings you have systems that are owned by a wide array of teams. These organizations aren't ready or even looking to implement a platform model where the underlying hardware can be generic.
Not sure if the goal was just to make an amusing comparison, but these are actually two completely different concerns.
Building your systems so that they don't depend on permanent infrastructure and snowflake configurations is an orthogonal concern from understanding how to shed load in a business-continuity crisis.
It's generally a good idea to have some documentation that states what a machine is used for, the "service", and how important said service is relative to others.
At my company we kind of enforce this by not operating machines that are not explicitly assigned to any service.
However you have to anticipate that the quality of documentation still varies immensely which might result in you shutting down a service that is actually more important than stated.
Fortunately documentation improves after every outage because service owners reiterate on their part of the documentation when their service was shut down as "it appeared unimportant".
Years ago during a week-long power outage, a telephone central office where we had some equipment suffered a generator failure. The telephone company had a backup plan (they kept one generator on a trailer in the city for such a contingency,) and they had battery capacity[0] for their critical equipment to last until the generator was hooked up.
They did have to load shed, though: they just turned off the AC inverters. They figured anything critical in a central office was on DC power, and if you had something on AC, you were just going to have to wait until the backup-backup generator was installed.
0 - at the time, at least, CO battery backup was usually sized for 24 hours of runtime.
> had some equipment suffered a generator failure. The telephone company had a backup plan (they kept one generator on a trailer in the city for such a contingency
Well that’s more forward thinking than AT&T’s Nashville CO when it got bombed. They just depended on natgas grid for their generators if grid electric went out. They underestimated the correlation between energy grids.
When natgas got cutoff and the UPSs died, they had no ability to hook in roll-up generators and had to hastily install connection points (or hardline them). And no standby contracts for them.
To be fair, the CO I described[0] is a local switching center. The Nashville “CO” was the area long lines/tandem/long distance site. Now, I’m not defending the lack of a backup plan, but tandems tend to be far larger than a typical local/class 5 CO.
0 - which like the Nashville long lines building also happens to be a pre-divestiture AT&T site, although at the time it was owned by BellSouth. It is also in Tennessee, although in Memphis, not Nashville.
One place I worked did a backup power test and when they came back from the diesels to the grid the entire datacentre lost power due to a software bug for about 10 seconds. It caused a massive outage.
The problem was a lot of the machines pulled their OS image from central storage servers and there was no where near enough IO to load everything and they had to prioritise what to bring up first to lighten the load and stop everything thrashing. It was a complete nightmare even though the front end to take sales were well isolated from the backend. Working out what was most important across an entire corporation took as long as the problem resolving slowly by just bringing things up randomly.
Nowadays you would just run multiple datacentres or cloud HA and we have SSDs but I just can't see such an architecture understanding being possible for any reasonably large company. The cost of keeping it and the dependencies up to date would be huge and it would always be out of date. More documentation isn't the solution, its to have multiple sites.
That brings back memories of a similar setup with hundreds of Windows servers booting over the network. We had regular “brownouts” even during the day just because the image streaming servers couldn’t handle the IOPS. Basic maintenance would slow down the servers for ten thousand users and generate support tickets.
I jumped up and down and convinced management to buy one of the first enterprise SSDs on the market. It was a PCIe card form factor and cost five digits for a tiny amount of storage.
We squeezed in the images using block-level deduplication and clever copy scripts that would run the compaction routine after each file was copied.
The difference was staggering. Just two of those cards made hundreds of other servers run like greased lightning. Boot times dropped to single digit seconds instead of minutes. Maintenance changes could be done at any time with zero impact on users. The whole cluster could be rebooted all at once with only a slight slowdown. Fun times.
Writing down and graphing out these relationships is a good way to identify and normalize them.
I once had a system with layers of functionality; lvl 0 services were the most critical; lvl 3+ was "user shit" that could be sloughed off at need.
Had some stub servers at lvl 0 and 1 that did things like providing a file share of the same name as the lower level services, but not populated; so that inadvertent domain crossing dependencies weren't severe problems.
There was a "DB server" stub that only returned "no results." The actual DB server for those queries was on the monster big rack SPARC with the 3dz disks that took 10min to spin up fully. When it came up it took over.
I’m really glad we realized that before disaster struck. We have a project in-progress to do exactly this. It’d even better if SWE wrote ADRs (or whatever) that document all this stuff up front, but … well, there are only so many battles anyone can fight, right?
If your machines are all hypervisors you could migrate important VMs to a couple hosts and turn off the rest. You could also possibly throttle the vcpus, which would slow down the VMs but allow you to run the machines cooler, or more VMs per machine. Finally the ones with long running jobs could just be snapshotted and powered down and restored later, resuming their computation.
There's a reason us old fogies were so excited when virtual machines got increasingly robust. We could use them to solve problems quickly that used to be nearly impossible.
Cooling system prices seem to scale fairly linearly with the cooling power above a few kW, so instead of one 100 kW system you could buy four 25 kW systems so a single failure won't be a disaster.
Can you provide a cost centre or credit card for which they can bill this to? In case you didn't notice the domain, it is UToronto: academic departments aren't generally flush with cash.
Further, you have to have physical space to fit the extra cooling equipment and pipes: not always easy or possible to do in old university buildings.
If you designed the system like this from the start or when replacing it anyways, N+1 redundancy might not me much more expensive than one big cooling unit. The systems can mostly share their ductwork and just have redundancy in the active components, so mostly the chillers.
Of course these systems only get replaced every couple decades, if ever, so they are pretty much stuck with the setup they have.
University department IT is not designed, it grows over decades.
At some point some benefactor may pay for a new building and the department will move, so that could be a chance to actually design. But modern architecture and architects don't really go well with hosting lots of servers in what is ostensibly office space.
I've been involved in the build-out of buildings/office space on three occasions in my career, and trying to get a decent IT space pencilled has always been like pulling teeth.
> Of course these systems only get replaced every couple decades, if ever
This is despite the massive energy savings they could get if they replaced those older systems. Universities often are full of old buildings with terrible insulation heated/cooled by very old/inefficient systems. In 20 years they would be money ahead by tearing down most buildings on campus and rebuilding to modern standards - assuming energy costs don't go up which seems unlikely) But they consider all those old buildings historic and so won't.
> In 20 years they would be money ahead by tearing down most buildings on campus and rebuilding to modern standards - assuming energy costs don't go up which seems unlikely) But they consider all those old buildings historic and so won't.
It has nothing to do with considering those building historic.
The problem is unless someone wants to donate a $50-100M, new buildings don't happen. And big donors want to donate to massive causes "Build a new building to cure cancer!" not "This building is kind of crappy, let's replace it with a better one".
It doesn't matter that over 50 years something could be cheaper if there's no money to fix it now.
This kind of thing is like insurance. Maybe IT failed to state the consequences of not having redundancy, maybe people in control of the money failed to understand.. or maybe the risks were understood and accepted.
Either way, by not paying for the insurance (redundant systems) up front the organization is explicitly taking on the risk.
Whether the cost now is higher is impossible to say as an outsider, but there's a lot of expenses: paying a premium for emergency repairs/replacement; paying salaries to a bunch of staff who are unable to work at full capacity (or maybe at all); a bunch of IT projects delayed because staff is dealing with an outage; and maybe downstream ripple effects, like classes cancelled or research projects in jeopardy.
I've never worked in academics, but I know people that do and understand the budget nonsense they go through. It doesn't change the reality, though, which is systems fail and if you don't plan for that you'll pay dearly.
With that model, you’d probably want 5 instead of 4 (N+1), but the other thing to consider is if you can duct the cold air to where it needs to go when one or more of the units has failed.
Maybe, but costs are not linear and the nonlinear goes different ways for different parts of the install. The costs of the smaller systems installed could be cheaper than just the large system not installed if the smaller systems are standard parts.
It means 5 times the number of failures as you intentionally put in an extra unit so that one can be taken offline at any time for maintenance (which itself will keep the whole system more reliable), and if one fails the whole keeps up. The cost is only slightly more to do this when there are 5 smaller units. Those smaller units could be standard off the shelf units as well, so it could be cheaper than a large unit that isn't mad in as large a quantity - this is a consideration that needs to be made case by case)
Even if you cheap out and only install 4 units, odds are your failure doesn't happen on the hottest day of the year and so 3 can keep up just fine. It is only when you are unlikely that you need to shut anything down.
Cloud environments have an elegant solution in the form of "spot" or "pre-emptible" instances: if your workload can tolerate interruptions because it's not time sensitive or not terribly important, you can get a pretty steep discount.
Also important servers should also go lower in the rack, but not too low to flood. Learned that from an aircon failure and the top server hit 180F+ and shutdown. Thankfully had temp logs in Grafana to figure that out.
People like to rag on Kubernetes for it's complexity, but this is the exact sort of scenario where k8s really does shine.
The answer to "which machines are important" is "only enough to provide the resources to run everything". You can kill whatever nodes you like just so long as there is enough of them to keep the cluster health and k8s will simply migrate workloads where they need to be.
That being said, storage is still an issue. Perhaps NAS is the one place where you might mark "these are the important machines".
And what if you only have 33% of your nominal cluster capacity available because the AC goes out in your server room?
Now which containers should your cluster stop running?
You haven't actually solved anything with this, you've just changed the abstraction layer at which you need to make decisions. Probably an improvement, but does not obviate some kind of load shedding heuristic.
But that's not what this was about. It was about "hey, which of these boxes had the ldap running on it? We need to make sure that gets shifted somewhere else!"
K8S let's you say "ok, we don't have enough capacity to run everything so let's shut down the Bitcoin deployment to free up capacity".
There's no leg work or bookkeeping to figure out what was running where, instead it's "what do I need to run and what can I shut down or tune down". All from the comfort of the room with AC.
And if you're really cleaver, you went ahead and gave system critical pods elevated property. [1]
I see. I guess there's two components - prioritization for loadshedding, and knowing how to actually turn things off to shed the load. You're saying k8s makes the latter very easy, and it was actually about that latter exercise of mapping workload to infra, not enumeration and prioritization of services.
Isn't this the point of decoupling your compute and datastores using CSI with disaggregated storage Kubernetes? So long as you keep your datastores available, whatever compute you can manage to attach it from Kubernetes can run whatever you truly need at capacities that you can handle with that level of hardware. Similarly, you could scale down the workloads on all the machines so they generated less heat without turning anything off at the expense of performance.
This is where the cloud kicks ass. Run multiple nodes with geo redundancy (where based on various concerns: cost, contracts, legal). But nodes should cross data centres. Maybe if one city gets nuked (literally or a fire/power outage) you still have uptime. Use Kubernetes maybe.
Turning off servers seems like the wrong call instead of transitioning servers into a lower powered state which can be exited once the power budget is available again.
The right answer is turn them all off - anything important is in a redundant data center. But odds are they don't have that.
If a redundant data center isn't an option, then you should put more into ensuring the system is resilient - fireproof room (if a server catches on fire it can't spread to the next - there are a lot of considerations here that I don't know about that you need to figure out), plenty of backup power, redundant HVAC, redundant connections to the internet - and you should brainstorm other things that I didn't think of.
I just did a due diligence with a company that only had servers in one data center. They were supremely confident that there was no way a whole DC could be impacted by an event.
We ended up having to strategically shut servers down as well, but the question of what's critical, where is it in the racks, and what's next to it was incredibly difficult to answer. And kinda mind-bending - we'd been thinking of these things as completely virtualised resources for years, suddenly having to consider their physical characteristics as well was a bit of a shock. Just shutting down everything non-critical wasn't enough - there were still now critical non-redundant servers next to each other overheating.
All we had to go on was an outdated racktables install, a readout of the case temperature for each, and a map of which machine was connected to which switch port which loosely related to position in the rack - none completely accurate. In the end we got the colo guys to send a photo of the rack front and back and (though not everything was well labelled) we were able to make some decisions and get things stable again.
In the end one server that was critical but we couldn't get to run cooler we got lucky with - we were able to pull out the server below and (without shutting it down) have the on site engineer drop it down enough to crack the lid open and get some cool air into it to keep it running (albeit with no redundancy and on the edge of thermal shutdown).
We came really close to a major outage that day that would have cost us dearly. I know it sounds like total shambles (and it kinda was) but I miss those days.