I think we can establish that the database is the biggest culprit in making this difficult.
As an independent developer, I have seen several teams that either back sync the prod db into the staging db OR capture known edge cases through diligent use of fixtures.
I am not trying to counter your point necessarily, but just trying to understand your POV. Very possible that, in my limited experience, I haven't come across all the problems around this domain.
The variety of requests and load in prod never matches production along with all the messiness and jitter you get from requests coming from across the planet and not just from your own LAN. And you'll probably never build it out to the same scale as production and have half your capex dedicated to it, so you'll miss issues which depend on your own internal scaling factors.
There's a certain amount of "best practices" effort you can go through in order to make your preprod environments sufficiently prod like but scaled down, with real data in their databases, running all the correct services, you can have a load testing environment where you hit one front end with a replay of real load taking from prod logs to look for perf regressions, etc. But ultimately time is better spent using feature flags and one box tests in prod rather than going down the rabbit hole of trying to simulate packet-level network failures in your preprod environment to try to make it look as prodlike as possible (although if you're writing your own distributed database you should probably be doing that kind of fault injection, but then you probably work somewhere FAANG scale, or you've made a potentially fatal NIH/DIY mistake).
The article doesn't talk about any of that though. The article says staging diffs prod because of:
> different hardware, configurations, and software versions
The hardware might be hard or expensive to get an exact match for in staging (but also, your stack shouldn't be hyper fragile to hardware changes). The latter two are totally solvable problems
With modern cloud computing and containerization, it feels like it has never been easier to get this right. Start up exactly the same container/config you use for production on the same cloud service. It should run acceptably similar to the real thing. Real problem is the lack of users/usage.
I was responding to other commentors not really the title article.
The stuff you cite there is pretty simple to deal with, configuration management is basically a solved problem and IDK how you can't just fix the different hardware.
The more universal problem of making preprod look just like prod so that you have 100% confidence in a rollout without any of the testing-in-prod patterns (feature flags, smoke tests, odd/even rollouts, etc) is not very solvable though.
A lot of things seem like they shouldn’t be, until you’ve debugged a weird kernel bug or driver issue that causes the kind of one-off flakiness that becomes a huge issue at scale.
IME, when you are not webscale, the issues you will miss from not testing in staging are bigger than the other way round. But that doesn't mean that all the extra efforts you have to put in the "test in prod only" scenario should not be put even when you do have a staging env.
I think we can establish that the database is the biggest culprit in making this difficult.
As an independent developer, I have seen several teams that either back sync the prod db into the staging db OR capture known edge cases through diligent use of fixtures.
I am not trying to counter your point necessarily, but just trying to understand your POV. Very possible that, in my limited experience, I haven't come across all the problems around this domain.