"Immutable infrastructure" what a laugh. In a large deployment, configuration somewhere is always changing - preferably without restarting tasks because they're constantly loaded. We have (most) configuration under source control, and during the west-coast work day it is practically impossible to commit a change without hitting conflicts and having to rebase. Then there are machines not running production workloads, such as development machines or employees' laptops, which still need to have their configuration managed. Are you going to "immutable infrastructure" everyone's laptops?
(Context: my team manages dozens of clusters, each with a score of services across thousands of physical hosts. Every minute of every day, multiple things are being scaled up or down, tuned, rearranged to deal with hardware faults or upgrades, new features rolled out, etc. Far from being immutable, this infrastructure is remarkably fluid because that's the only way to run things at such scale.)
Beware of Chesterton's Fence. Just because you haven't learned the reasons for something doesn't mean it's wrong, and the new shiny often re-introduces problems that were already solved (along with some of its own) because of that attitude.
Are you sure you two are talking about the same thing?
My understanding of immutable infrastructure is the same as immutable data structures: once you create something, you don't mess with it. If you need a different something, you create a new one and destroy the old one.
That doesn't mean that the whole picture isn't changing all the time. Indeed, I think immutability makes systems overall more fluid, because it's easier to reason about changes. Mutability adds a lot of complexity, and when mutable things interact, the number of corner cases grows very quickly. In those circumstances, people can easily learn to fear change, which drastically reduces fluidity.
Yup. We do this. When our servers need a change, we change the AMI for example, and then re-deployment just replaces everything. Most servers survive a day, or a few hours.
Makes sense to me. I was talking with a group of CTOs a couple years back. One of mentioned that they had things set up that any machine more than 30 days old was automatically murdered, and others chimed in with similar takes.
It seemed like a fine idea to me. The best way to be sure that everything can be rebuilt is to regularly rebuild everything. It also solves some security problems, simplifies maintenance, and allows people to be braver around updates.
Probably the most insightful comment in this entire thread. Thank you. In many cases, an "image" is just a snapshot of what configuration management (perhaps not called such but still) gives you. As with compiled programming languages, though, doing it at build time makes future change significantly slower and more expensive. Supposedly this is for the sake of consistency and reproducibility, but since those are achievable by other means it's a false tradeoff. In real deployments, this just turns configuration drift into container sprawl.
So, once you create a multi-thousand-node storage cluster, if you need to change some configuration, replace the whole thing? Even if you replace onto the same machines - because that's where the data is - that's an unacceptable loss of availability. Maybe that works for a "stateless" service, but for those who actually solve persistence instead of passing the buck it just won't fly.
Could you say more about why your particular service can't tolerate rolling replacement of nodes? You're going to have to rebuild nodes eventually, so it seems to me that you might as well get good at it.
And just to be clear, I'm very willing to believe that your particular legacy setup isn't a good match for cattle-not-pets practices. But I think that's different than saying it's impossible for anybody to bring an immutable approach to things like storage.
The person you're replying to didn't say "replace every node," they said "replace the whole thing."
To give a really silly example, adding a node to a cluster is a configuration change. It wouldn't make sense to destroy the cluster and recreate it to add a new node. There are lots of examples like this where if you took the idea of immutable infrastructure to the extreme it would result in really large wastes of effort.
Could you please point me at prominent advocates of immutable infrastructure who propose destroying whole clusters to add a node? Because from what I've seen, that's a total misunderstanding.
As I said, it's a silly example just to highlight an extreme. In between there are more fluid examples. I don't think it's that ridiculous to propose destroying and recreating the cluster in its entirety when you're deploying a new node image. However as you say I'm not sure anyone would advocate that except in specific circumstances.
On the other hand, while my suggestion of doing it to add a node sounds ridiculous I'm sure there are circumstances in which it's not only understandable but necessary, due to some aspect of the system.
I'm saying it's not even an extreme, in that I don't believe what people are calling "immutable infrastructure" includes that.
If your biggest objection to an idea is that you can make up a silly thing that sounds like it might be related, I'm not understanding why we need to have this discussion. I'd like to focus on real issues, thanks.
I'm not objecting categorically to anything. I think that immutable infrastructure is a spectrum, and depending on your needs you may have just about everything immutably configured, or almost nothing. I just don't think it's so black and white as "you should always use immutable infrastructure."
I also think it's a cool idea to destroy the entire cluster just to add a node, and it sounds ridiculous but also like there's some circumstances where it makes perfect sense.
Again, do you have a citation for the notion that it's a spectrum? The original post that coined the term doesn't talk about it that way, and neither do the other resources I found in a quick search. As I see it, it's binary: when you need to change something on a server, you either tinker with the existing server or you replace the server with a fresh-built one that conforms to the new desire.
Wow, look at those goalposts go! If you make enough exceptions to allow incremental change, then "immutable" gets watered down to total meaninglessness. That's not an interesting conversation. This conversation is about configuration management, which is still needed in a "weakly immutable" world.
Again, could you please point me at notable advocates of immutable infrastructure proposing the approach you take such exception to? And note that I'm not proposing any exceptions.
Interesting to say you've "solve[d] persistence" when you seem to be limited by it here. Is there a particular reason your services can't be architected in less stateful, more 12-factor way?
Kick the persistence can down the road some more? Sure, why not? But sooner or later, somebody has to write something to disk (or flash or whatever that doesn't disappear when the power's off). A system that stores data is inherently stateful. Yes, you can restart services that provide access or auxiliary services (e.g. repair) but the entire purpose of the service as a whole is to retain state. It's the foundation on top of which all the slackers get to be stateless themselves.
The vast majority of people simply redefine the terms to fit whatever they are selling.
If your systems are immutable they can run read-only. In the in nineties Tripwire, the integrity checker, popularized it. You could run it off cdrom. Today immutable infrastructure is VMs/containers that can be ran off a SAN or a pass through file system that is readonly. It means snapshots are completely and immediately replicatable. When you need to deploy, you take a base image/container, install a code onto it, run tests to ensure that it is not broken and replicate it as many times as you need, in a read-only state. This approach also has an interesting property where because system is readonly ( as in exported to the instance read-only/mounted by the instance readonly ) it is extremely difficult to do nasty things to it after a break in - if it is difficult to create files, it is difficult to stage exploits.
That's the only kind of infrastructure where configuration management on the instances themselves is not needed
The hosts are managed via chef, the jobs/tasks running on those hosts by something roughly equivalent to k8s.
As for the conflicts, I have to say I loathe the way the more dynamic part of configuration works. It might be the most ill conceived and poorly implemented system I've seen in 30+ years of working in the industry. Granted, it does basically work, but at the cost of wasting thousands of engineers' time every day. The conflicts occur because (a) it abuses source control as its underlying mechanism and (b) it generates the actual configs (what gets shipped to the affected machines) from the user-provided versions in a non-deterministic way which causes spurious differences. All of its goals - auditability, validation, canaries, caching, etc. - could be achieved without such aggravation if the initial design hadn't been so mind-bogglingly stupid.
But I digress. Sorry not sorry. ;) To answer your question, my personal solution is to take advantage of the fact that I'm on the US east coast and commit most of my changes before everybody else gets active.
Sometimes you have to work with what you're given in a brownfield env and a config managment tool is useful in that case, but it's possible that you are working with a less than ideal architecture with less than ideal time/money to make changes.
State is always the enemy in technology.
I can't even imagine managing hundreds of servers whose state is unpredictable at any moment and they can't be terminated and replaced with a fresh instance for fear of losing something.
> can't even imagine managing hundreds of servers whose state is unpredictable at any moment
Be careful not to conflate immutability with predictability. The state of these servers is predictable. All of the information necessary to reconstruct them is on a single continuous timeline in source control. But that doesn't mean they're immutable because the head of that timeline is moving very rapidly.
> can't be terminated and replaced with a fresh instance for fear of losing something.
No, there's (almost) no danger of losing any data because everything's erasure-coded at a level of redundancy that most people find surprising until they learn the reasons (e.g. large-scale electrical outages). But there's definitely a danger of losing availability. You can't just cold-restart a whole service that's running on thousands of hosts and being used continuously by even more thousands without a lot of screaming. Rolling changes are an absolute requirement. Some take minutes. Some take hours. Some take days. Many of these services have run continuously for years, barely resembling the code or config they had when they first started, and their users wouldn't have it any other way. It might be hard to imagine, but it's an every-day reality for my team.
> Be careful not to conflate immutability with predictability.
I don't trust predictability. Drift is always a nightmare. Nothing is ever as predictable as you would like it to be.
>You can't just cold-restart a whole service that's running on thousands of hosts and being used continuously by even more thousands without a lot of screaming.
Except that state and its manipulation is usually the primary value in technology.
> I can't even imagine managing hundreds of servers whose state is unpredictable at any moment and they can't be terminated and replaced with a fresh instance for fear of losing something.
Yes, that sounds awful. That's why we have backups and, if necessary, redundancy and high availability.
(Context: my team manages dozens of clusters, each with a score of services across thousands of physical hosts. Every minute of every day, multiple things are being scaled up or down, tuned, rearranged to deal with hardware faults or upgrades, new features rolled out, etc. Far from being immutable, this infrastructure is remarkably fluid because that's the only way to run things at such scale.)
Beware of Chesterton's Fence. Just because you haven't learned the reasons for something doesn't mean it's wrong, and the new shiny often re-introduces problems that were already solved (along with some of its own) because of that attitude.