I am curios, why do you think it's impossible? I think we can establish that the...

lamontcg · on April 3, 2022

The variety of requests and load in prod never matches production along with all the messiness and jitter you get from requests coming from across the planet and not just from your own LAN. And you'll probably never build it out to the same scale as production and have half your capex dedicated to it, so you'll miss issues which depend on your own internal scaling factors.

There's a certain amount of "best practices" effort you can go through in order to make your preprod environments sufficiently prod like but scaled down, with real data in their databases, running all the correct services, you can have a load testing environment where you hit one front end with a replay of real load taking from prod logs to look for perf regressions, etc. But ultimately time is better spent using feature flags and one box tests in prod rather than going down the rabbit hole of trying to simulate packet-level network failures in your preprod environment to try to make it look as prodlike as possible (although if you're writing your own distributed database you should probably be doing that kind of fault injection, but then you probably work somewhere FAANG scale, or you've made a potentially fatal NIH/DIY mistake).

nickelpro · on April 3, 2022

The article doesn't talk about any of that though. The article says staging diffs prod because of:

> different hardware, configurations, and software versions

The hardware might be hard or expensive to get an exact match for in staging (but also, your stack shouldn't be hyper fragile to hardware changes). The latter two are totally solvable problems

Gigachad · on April 3, 2022

With modern cloud computing and containerization, it feels like it has never been easier to get this right. Start up exactly the same container/config you use for production on the same cloud service. It should run acceptably similar to the real thing. Real problem is the lack of users/usage.

lamontcg · on April 4, 2022

I was responding to other commentors not really the title article.

The stuff you cite there is pretty simple to deal with, configuration management is basically a solved problem and IDK how you can't just fix the different hardware.

The more universal problem of making preprod look just like prod so that you have 100% confidence in a rollout without any of the testing-in-prod patterns (feature flags, smoke tests, odd/even rollouts, etc) is not very solvable though.

relaxing · on April 4, 2022

A lot of things seem like they shouldn’t be, until you’ve debugged a weird kernel bug or driver issue that causes the kind of one-off flakiness that becomes a huge issue at scale.

darkwater · on April 3, 2022

IME, when you are not webscale, the issues you will miss from not testing in staging are bigger than the other way round. But that doesn't mean that all the extra efforts you have to put in the "test in prod only" scenario should not be put even when you do have a staging env.

sharken · on April 3, 2022

As if this wasn't enough of a headache, GDPR regulation requires more safeguards before you can put your prod-data in a secured staging environment.

Then there is the database size, which can make it hard and expensive to keep preprod up to date.

And should you want to measure performance, then no one else can use preprod while that is going on.