I’m absolutely loving the term Unix herder and will probably adopt it :)
I’m generally with you and the wider industry on the cattle-not-pets thing but there are a few things to keep in mind in the context of a university IT department that are different than what we regularly talk about here:
- budgets often work differently. You have a capex budget and your institution will exist long enough to fully depreciate the hardware they’ve bought you. They won’t be as happy to dramatically increase your opex.
- storage is the ultimate pet. In a university IT department you’re going to have people who need access to tons and tons of speedy storage both short-term and long-term.
I’m smiling a little bit thinking about a job 10 years ago who adopted the cattle-not-pets mentality. The IT department decided they were done with their pets, moved everything to a big vSphere cluster, and backed it by a giant RAID-5 array. There was a disk failure, but that’s ok, RAID-5 can handle that. And then the next day there was a second disk failure. Boom. Every single VM in the engineering department is gone including all of the data. It was all backed up to tape and slowly got restored but the blast radius was enormous.
At the risk of no true Scotsman, that doesn’t sound like “cattle not pets“; when the cattle are sent to the slaughterhouse there isn’t any blast radius, there’s just more cattle taking over. You explicitly don’t have to replace them with exact clones of the original cattle from tape very slowly, you spin up a herd of more cattle in moments.
If you are using clustered storage (Ceph, for example) instead of a single RAID5 array, ideally the loss of one node or one rack or one site doesn't lose your data it only loses some of the replicas. When you spin up new storage nodes, the data replicates from the other nodes in the cluster. If you need 'the university storage server' that's a pet. Google aren't keeping pet webservers and pet mailbox servers for GMail - whichever loadbalanced webserver you get connected to will work with any storage cluster node it talks to. Microsoft aren't keeping pet webservers and mailbox servers for Office365, either. If they lose one storage array, or rack, or one DC, your data isn't gone.
'Cattle' is the idea that if you need more storage, you spin up more identikit storage servers and they merge in seamlessly and provide more replicated redundant storage space. If some break, you replace them with identikit ones which seamlessly take over. If you need data, any of them will provide it.
'Pets' is the idea that you need the email storage server, which is that HP box in the corner with the big RAID5 array. If you need more storage, it needs to be expansion shelves compatible with that RAID controller and its specific firmware versions which needs space and power in the same rack, and that's different from your newer Engineering storage server, and different to your Backup storage server. If the HP fails, the service is down until you get parts for that specific HP server, or restore that specific server's data to one new pet.
And yes, it's a model not a reality. It's easier to think about scaling your services if you have "two large storage clusters" than if you have a dozen different specialist storage servers each with individual quirks and individual support contracts which can only be worked on by individual engineers who know what's weird and unique about them. And if you can reorganise from pets to cattle, it can free up time, attention, make things more scalable, more flexible, make trade offs of maintenance time and effort.
It's stored in another system that is probably treated as a pet, hopefully by somebody way better at it with a huge staff (like AWS). Even if it's a local NetApp cluster or something, you can leave the state to the storage admins rather than random engineers that may or may not even be with the company any more.
Very differently. Instead of a system you continually iterate on for its entire lifetime, if you're in a more regulated research area you might build it once, get it approved, and then it's only critical updates for the next five (or more!) years while data is collected.
Not many of the IT principles developed for web app startups apply in the research domain. They're less like cattle or pets and more like satellites which have very limited ability to be changed after launch.
From my past life in academia, you're totally right. But that kind of reinforces the point... you do occasionally get some budget for servers and then have to make them last as long as possible :). Those one-time expenses are generally more palatable than recurring cloud storage costs though.
> moved everything to a big vSphere cluster, and backed it by a giant RAID-5 array
I’m with sibling commenter, if said IT department genuinely thought that the core point in “cattle-not-pets” was met by their single SuperMegaCow, then they missed the point entirely.
>>The IT department decided they were done with their pets, moved everything to a big vSphere cluster, and backed it by a giant RAID-5 array. There was a disk failure, but that’s ok, RAID-5 can handle that.
Precisely why, when I was charged with setting up a 100 TB array for a law firm client at previous job, I went for RAID-6, even though it came with a tremendous write speed hit. It was mostly archived data that needed retention for a long period of time, so it wasn't bad for daily usage, and read speeds were great. Had the budget been greater, RAID 10 would've been my choice. (requisite reminder: RAID is not backup)
Not related, but they were hit with a million dollar ransomware attack (as in: the hacker group requested a million dollar payment), so that write speed limitation was not the bottleneck considering internet speed when restoring. Ahhh.... what a shitshow, the FBI got involved, and never worked for them again. I did warn them though: zero updates (disabled) and also disabled firewall on the host data server (windows) was a recipe for disaster. Within 3 days they got hit, and the boss had the temerity to imply I had something to do with it. Glad I'm not there anymore, but what a screwy opsec situation I thankfully no longer have to support.
> the boss had the temerity to imply I had something to do with it.
What was your response? I feel like mine would be "you are now accusing me of a severe crime, all further correspondence will be through my lawyer, good luck".
> even though it came with a tremendous write speed hit
Only on a writes < stripe. If your writes are bigger then you can have way more speed than RAID10 on the same set, limited only by the RAID controller CPU.
Due to network limitations and contract budgeting, I never got the chance to upgrade them to 10 Gb, but can confirm I could hit 1000 Mbps (100+ MB/s) on certain files on RAID-6. It sadly averaged out to about 55-60 MB/s writes (HDD array, Buffalo), which again, for this use case was acceptable, but below expectations. I didn't buy the unit, I didn't design the architecture it was going into, merely a cog in the support machinery.
The devops meme referenced, "cattle not pets", probably popularized by a book called "The Phoenix Project".
The real point is that pets are job security for the Unix herder. If you end up with one neckbeard running the joint that's even worse as a single point of failure.
> Author: proudly declares himself Unix herder, wants to keep track of which systems are important.
Because not all environments are webapps which dozens or hundreds of systems configured in a cookie cutter matter. Plenty of IT environments have pets because plenty of IT environments are not Web Scale™. And plenty of IT environments have staff churn and legacy systems where knowledge can become murky (see reference about "archaeology").
An IMAP server is different than a web server is different than an NFS server, and there may also be inter-dependencies between them.
I work with one system where the main entity is a “sale” which is processed beginning to end in some fraction of a second.
A different system I work with, the main “entity” is, conceptually, more like a murder investigation. Less than 200 of the main entity are created in a year. Many different pieces of information are painstakingly gathered and tracked over a long period of time, with input from many people and oversight from legal experts and strong auditing requirements.
Trying to apply the best lessons and principles from one system to the other is rarely a good idea.
These kind of characteristics of different systems make a lot of difference to their care and feeding.
So much this. Not everything fits cattle/chicken/etc models. Even in cases where those models could fit, they are not necessarily the right choice, given staffing, expertise, budgets, and other factors.
I think you're missing some aspects of cattle. You're still supposed to keep track of what happens and where. You still want to understand why and how each of the servers in the autoscaling group (or similar) behaves. The cattle part just means they're unified and quickly replaceable. Buy they still need to be well tagged, accounted for in planning, removed when they don't fulfil the purpose anymore, identified for billing, etc.
And also importantly: you want to make sure you have a good enough description for them that you can say "terraform/cloudformation/ansible: make sure those are running" - without having to find them on the list and do it manually.
When you are responsible for the full infrastructure, sequencing power down and power on in coordination with your UPS is a common solution. Network gear needs a few minutes to light up ports, core services like DNS and identity services might need to light up next, then storage, then hypervisors and container hosts, then you can actually start working on app dependencies.
This sort of sequencing leads itself naturally to having a plan for limited capacity “keep the lights on” workload shedding when facing a situation like the OP.
Not everyone has elected to pay Bezos double the price for things they can handle themselves, and this is part of handling it.
If you’re running a couple ec2 instances in one AZ then yeah it’s closer to 100x, but if you wanted to replicate the durability of S3, it would cost you a lot in terms of redundancy (usually “invisible” to the customer) and ongoing R&D and support headcount.
Yes, even when you add it all up, Amazon still charges a premium even over that all-in cost. That’s sweat equity.
Even if you're running "cattle", you still need to keep track of which systems are important, because to surprise of many, the full infrastructure is more like the ranch, and cattle is just part of it.
(and here I remind myself again to write the screed against "cattle" metaphor...)
The industry has this narrative because it suits their desire to sell higher-margined cloud services. However in the real world, especially in academia as cks is, the reality is that many workloads are still not suitable for the cloud.
The cattle metaphors really is a bad one. Anyone raising cattle should do the same thing, knowing which animals are the priority in case of draught, disease, etc.
Hopefully one never has to face that scenario, but its much easier to pick up the pieces when you know where the priorities are whether you're having to power down servers or thin a herd.
Cattle are often interchangeable. You cull any that catch a disease (in some cases the USDA will cull the entire herd if just one catches something - bio security is a big deal) In the case of drought you pick a bunch to get rid of - based on market prices (If everyone else is you will try to keep yours because the market is collapsing - but this means managing feed and thus may mean culling more of the herd latter.)
Some cattle we can measure. Milk cows are carefully managed as to output - the farmer knows how much the milk each one is worth and so they can cull the low producers. However milk is so much more valuable than meat that they never cull based on drought - milk can always outbid meat for feed. If milk demand goes down the farmer might cull come - but often the farmer is under contract for X amount of milk and so they cannot manage prices.
But if you are milking you don't have bulls. Maybe you have one (though almost everyone uses artificial insemination these days). Worrying about milking bulls is like worrying about the NetWare server - once common but has been obsolete since before many reading this were even born.
Of course the pigs, cows, and chickens are not interchangeable. Nor are corn, hay, soybeans.
I think it's a way of thinking about things, rather than a true/false description. e.g. VMware virtual hosts make good cattle - in some setups I have worked on the hosts are interchangeable, move virtual machines between them without downtime. In others the hosts have different storage access, different connectivity and it matters which combination of hosts are online/offline together, and which VMs need the special connectivity.
The regular setups are easier to understand, nicer to work on. The irregular ones are a trip hazard, they need careful setup, more careful maintenance, more detailed documentation, more aware monitoring. But there's probably ways they could be made regular, if the unique connectivity was moved out to a separate 'module' e.g. at the switch layer, or if the storage had been planned differently, sometimes with more cost, sometimes just with different design up-front.
Along these lines, yes DNS is not NTP but you could have a 'cattle' template Linux server which can run your DNS or NTP or SMTP relay which can be script deployed, and then standard DNS/NTP/SMTP containers deployed on top. Or you could build a new Linux server by hand and deploy a new service layer by hand, every time, each one slightly different depending how rushed you are and what verison of installers are conveniently available and whether the same person does the work following the latest runbook or an outdated one or going from memory. You could deploy a template OpnSense VM which can front DNS or NTP or SMTP instead of having to manually login to a GUI firewall interface and add rules for the new service by hand.
'Cattle not pets' is a call to standardise, regularise, modularize, template, script, automate; to move towards those ways of doing things. Servers are now software which can be copypasted in a way they weren't 10-30 years ago, at least in my non-FAANG world. To me it doesn't mean every server has to mean nothing to you, or every server is interchangeable, it means consider if thinking that wasy can help.
It might have been the original idea (though taken into account the time period and context, I suspect we're missing possible overfocus on deployment by AWS ASG and smallish set of services).
What grinds my gears is that over years I found it a thought limiting meme - it effectively swings a metaphor too hard into one direction, and some early responses under original article IMO present quite well the issue. It's not like people are stupid - but metaphors like this exist to make shortcuts for thinking and discussion, and for last few years I've seen that it short-circuits the discussion too hard, making people either stop thinking about certain interdependencies, or stopping noticing that there are still systems they treat like "pets", just named differently and in different scope, but now mentally pushing out how fragile they can be.
The issue here is not much the hardware, but the services that run on top of them.
I guess that many companies that use "current practices" have plenty of services that they don't even know about running on their clusters.
The main difference is that instead of the kind of issues that the link talks about, you have those services running year after year, using resources, for the joy of the cloud companies.
This happens even at Google [1]:
"There are several remarkable aspects to this story. One is that running a Bigtable was so inconsequential to Google’s scale that it took 2 years before anyone even noticed it, and even then, only because the version was old. "
I would argue even the "wider industry" still administrates systems that must be treated as pets because they were not designed to be treated as cattle.
I'm pretty sure anyone in the industry that draws this distinction between cattle and pets has never worked with cattle and only knows of general ideas about the industrial cattle business.
Google tells me beef cattle are slaughtered after 2 years. Split 2 years among 44,000 cattle and you get to spend at most 24 minutes with each one, if you dedicate 2 years of your life to nothing else but that, not even sleep, travel, eating. If you let them live their natural life expectancy of 20 years, you get 240 minutes with each cow - two hours in its 20 year life.
"I care about my cattle", yes, I don't think "cattle" is supposed to mean "stop caring about the things you work on". "I know each and every one as well as the family dog I've had for ten years", no. That's not possible. You raise them industrially and kill them for profit/food, that's a different dynamic than with Spot.
I believe this is just a great example of why cattle shouldn't be raised in such high volume industrial processes.
Have you ever been around cattle? Or helped them calve? Or slaughtered one for meat?
I know every one of my animals and understand the herd dynamics, from who the lead cow is to who is the asshole that is the one often starting fights and annoying the others.
We shouldn't be throwing so many animals into such a controlled and confined system that they are reduced to numbers on a spreadsheet. We shouldn't raise an animal for slaughter after dedicating at most 24 minutes to them.
"cattle not pets" is about computer servers, it's not about ethical treatment of living creatures. Whether or not living animals should be numbers on a spreadsheet, servers can be without ethical concerns[1]. "I treat my cattle like pets" is respectable, but not relevant - unless you are also saying "therefore you should treat your servers like pets", which you would need to expand on.
> "I know every one of my animals"
And you are still dodging the part where there is a point at which you could not do that if you had more and more animals. You can choose not to have more animals, but a company cannot avoid having more servers if they are to digitise more workflows, serve more customers, offer more complex services, have higher reliability failovers, DR systems, test systems, developer systems, monitoring and reporting and logging and analysis of all of the above - and again the analogy is not saying "the way companies treat cows is the right way to treat cows", it's saying "the ruthless, ROI-focused, commodity way companies actually do treat cows is a more scalable and profitable way to think about the computer servers/services that you currently think about like family pets".
[1] at least, first order ones in a pre-AI age. Energy use, pollution, habitat damage, etc. are another matter.
Ranchers do eat their pets. They generally do love the cattle, but they also know at the end of a few years they get replaced - it is the cycle of life.
HPC admin here (and possibly managing a similar system topology with their room).
In heterogeneous system rooms, you can't stuff everything into a virtualization cluster with a shared storage and migrate things on the fly, thinking that every server (in hardware) is a cattle and you can just herd your VMs from host to host.
A SLURM cluster is easy. Shutdown all the nodes, controller will say "welp, no servers to run the workloads, will wait until servers come back", but storage systems are not that easy (ordering, controller dependencies, volume dependencies, service dependencies, etc.).
Also there are servers which can't be virtualized because they're hardware dependent, latency dependent, or just filling the server they are in, resource wise.
We also have some pet servers, and some cattle. We "pfft" to some servers and scramble for others due to various reasons. We know what server runs which service by the hostname, and never install pet servers without team's knowledge. So if something important goes down everyone at least can attend the OS or the hardware it's running on.
Even in a cloud environment, you can't move a VSwitch VM as you want, because you can't have the root of a fat SDN tree on every node. Even the most flexible infrastructure has firm parts to support that flexibility. It's impossible otherwise.
Lastly, not knowing which servers are important is a big no-no. We had "glycol everywhere" incidents and serious heatwaves, and all we say is, "we can't cool room down, scale down". Everybody shuts the servers they know they can, even if somebody from the team is on vacation.
> wants to keep track of which systems are important.
I mean obviously? Industry does the same. Probably with automation, tools and tagging during provisioning.
The pet mentality is when you create a beautiful handcrafted spreadsheet showing which services run on the server named "Mjolnir" and which services run on the server named "Valinor". The cattle mentality is when you have the same in a distributed key-value database with UUIDs instead of fancyful server names.
Or the pet mentality is when you prefer to not shut down "Mjolnir" because it has more than 2 years of uptime or some other silly reason like that. (as opposed to not shutting it down because you know that you would loose more money that way than by risking it overheating and having to buy a new one.)
Pets make sense sometimes. I also think there are still plenty of companies, large ones, with important services and data, that just don't operate in a way that allows the data center teams to do this either. I have some experience with both health insurance and life insurance companies for example where "this critical thing #8 that we would go out of business without" stillnloves solely on "this server right here". In university settings you have systems that are owned by a wide array of teams. These organizations aren't ready or even looking to implement a platform model where the underlying hardware can be generic.
Not sure if the goal was just to make an amusing comparison, but these are actually two completely different concerns.
Building your systems so that they don't depend on permanent infrastructure and snowflake configurations is an orthogonal concern from understanding how to shed load in a business-continuity crisis.
Industry: don’t treat your systems like pets.
Author: proudly declares himself Unix herder, wants to keep track of which systems are important.