I test in prod

jdauriemma · on Nov 23, 2021

After reading a few smug comments, I've concluded that some of the folks in this thread have never worked on an application where the production scale is many, many orders of magnitude greater than preproduction environments. There's no substitution for "testing in prod." If you're not "testing in prod" with your really-big application, you're not testing enough.

quadrifoliate · on Nov 23, 2021

> After reading a few smug comments...I've concluded that some of the folks in this thread have never worked on an application where the production scale is many, many orders of magnitude greater than preproduction environments.

I think your comment is smug as well. The author of the article has considerably overloaded the term "testing", reasonably having a lot of people have a knee-jerk reaction of "Don't do that". I get the impression that this is by design, and the article is intended to provoke discussion via flamewar.

I think the best way to combat this is to simply avoid too much discussion on such articles until they are rewritten to be more clear and less liable to cause flamewars. I have some views on the thesis of the article, and practical experience backing those views – but I simply won't express them because I don't want to get embroiled in the numerous minor arguments caused by the confusing terminology, with little to no actionable information. An example of a pattern that you'll see repeating across all the comments in this thread:

Person A: "We take our time and test the application thoroughly under all sorts of load in a preproduction environment. We run end-to-end tests on every build. We value work/life balance, and try to minimize testing-in-production."

Person B: "When did the article say not to do that? It never says you shouldn't test outside prod, it says you should also test in prod and that's a superpower! You're totally misunderstanding the article!"

shkkmo · on Nov 23, 2021

> The author of the article has considerably overloaded the term "testing",

I would disagree. There is zero overloading of the term "testing" as that is an already extremely broad term of art that would seem to clearly apply to every example of production testing provided in the article.

> with little to no actionable information.

The article absolutely provides aome actionable points and breaks them up under "Technical", "Cultural", and "Managerial"

> An example of a pattern that you'll see repeating across all the comments in this thread:

The article repeatedly covers the exact points mentioned in your example exchange so the people you see having that exchange are those who at best only skimmed the article. I find that the light readers can sometimes dominate comment threads early but those comments eventually do become out outnumbered by the more interesting discussion. People who take the time to read carefully and think respond slower and thus tend to be back loaded.

> I have some views on the thesis of the article, and practical experience backing those views – but I simply won't express them because I don't want to get embroiled in the numerous minor arguments

That is unfortunate. The only way the discussion improves is when people do take the time to state their views, even when they don't have the time to follow up on replies. The only way to combat vapid discussion is to plant the seeds of better conversation.

pydry · on Nov 23, 2021

The article also mentions a few things that are usually not tested pre-production (e.g. timeouts, race conditions) but certainly could be with better integration test tooling.

The thing that bugged me was that these were not treated as a cost/benefit trade off but simply as a fait accompli.

Over time I've come to believe that an appreciation of nuance and cost/benefit tradeoffs are at the heart of effective testing, but culturally the practice is steeped in dogmatism and absolutism. This exhibits all of that - e.g. "control freak managers", "only one represents reality" and "saying not today to the gods of downtime".

jdauriemma · on Nov 23, 2021

I think your reply to my smug comment is also smug! It’s smugness all the way down!

quadrifoliate · on Nov 23, 2021

It's possible, I did feel a bit self-satisfied after making it!

Maybe it applies upwards as well? The article is kind of all smug about "I test in prod" too :)

jdauriemma · on Nov 23, 2021

Full-stack smugness

0x0nyandesu · on Nov 23, 2021

Same. I'm processing fitness data 24/7 and I can easily repull and overwrite the normalization. There's no actual down side to testing in prod with this style of data syncing. In fact the dev versions of our oauth2 app on fitbit have only one test account so there's no real data for me to access. 99% of the edge case issues happen with real data only and you will never catch them without it. I only use the dev version for setting up new features not yet in prod but once I've confirmed basic workability the rest of the dev is on my admin or test accounts but with production data.

fer · on Nov 23, 2021

Exactly. Here we crunch loads of data, and by a lot, I mean our production databases run on a three digit cluster of four digit terabytes of data. Good luck having that scale in preproduction environments.

We have loads of redundancy, and dedicated "test in production" machines/datacenters to test actual production loads in actual production-scale sets machines.

Now, tests in production usually involves a pair of hands and 2 other people looking behind (dev + ops + dba), and requires a well defined rollback procedure and post-mortem. We still have absurdly high SLA (99.999%).

dnautics · on Nov 23, 2021

Our production scale is just copied stage environments because our clients are trivially shardable and they only use our service from 9-5 m-f, but hoo boy are there a lot of them

secondaryacct · on Nov 23, 2021

Yeah it s nice to be trivially shardable ! Ours is not: stock market connections to 13 countries with clients having global trading controls across these countries - so we can shard some parts and not others and each country having specificities, guaranteeing a cross market feature works for a client, if it impacts control, is tested in prod, either after market close on mock exchanges provided by the various countries, or during trading hours on small orders.

No amount of testing in the bank has ever been able to spot the weirdest issues, so we continue while really trying hard to make the prod "pilots" (we try not to call these tests) as routine as possible. But, we still find crazy issues, notably those only the client notices (so wrong specs they couldnt prevalidate by plugging their monster system on our monster system in QA)

dnautics · on Nov 23, 2021

why can't you shard by clients (not regions)? Is there some reason why clients must know about other clients?

secondaryacct · on Nov 23, 2021

Let s imagine we build all the trading robots to have one server per group of client. That would def work, we probably cant pay for one set of server (for HA) per client since we have hundreds, but then all these per client servers have to queue up for the exchange line access which has per country controls (cant move the market more than x %, cant trade x or y stock at z time whatever). So yes in a way, end of the line, the clients must "know abt each other" so we can respect the 13 different law codes.

We split by exchange traditionally because it made the most sense when we started 30 years ago in Asia, but... We re more and more splitting clients in groups where we can and allocating them cpu power indeed.

I envy the stock market people because while they must handle even higher volume, they can shard per stock itself and have just one jursidiction.

dnautics · on Nov 23, 2021

cool, thanks for satisfying my curiosity. This sounds like a very interesting problem! Coming from an Elixir/Erlang background, I would probably architect it with clusters of "rate-limiting" backend agents sharded by exchange or jurisdictions and a cluster of client-sharded groups (in its own VPS, even) for client information. But yeah, it would be tough to migrate to such a setup from something more brownfield.

tomrod · on Nov 23, 2021

Some places you simply can't. Medical devices, airplanes, etc.

But non-critical services? Sure

anaganisk · on Nov 23, 2021

They will and do something called acceptance testing, which is nothing but testing in prod.

detaro · on Nov 23, 2021

Not quite. Nobody acceptance tests an airliner on a passenger flight. If we acceptance test a medical diagnostics device, we don't run the test the doctor later relies on with the DUT.

jdauriemma · on Nov 23, 2021

This is a common misconception about testing in prod. It’s not about the logic. To quote the article: “Once you deploy, you aren’t testing code anymore, you’re testing systems.”

detaro · on Nov 23, 2021

I don't quite get the objection. For me, "testing in prod" is "we observe the actual running system, with production traffic and users directly interacting with it and its results being live". That's not quite what "acceptance testing" in the mentioned-above domains is. If you have an acceptance testing stage, prod follows it.

joshuamorton · on Nov 23, 2021

I think the argument breaks down when you have systems as opposed to objects.

You can't acceptance test "Google's search", because at a minimum doing so would require some kind of reverse proxy that is itself part of the system and can't be acceptance tested without...and turtles etc.

Another way of putting this might be that there isn't a good way to acceptance test "airplane manufacturing companies". There isn't a set of acceptance tests you can run to ensure that Boeing is performing to spec before having it build real airplanes.

detaro · on Nov 23, 2021

Maybe, I don't feel particularly strongly either way about if it can be applied to the kind of large services you are thinking of. But also wasn't really the point of my comment, I merely rejected the claim that acceptance testing in fields where it is done is the same as testing in prod.

VBprogrammer · on Nov 23, 2021

Perhaps the closest analogy would be deploying to a limited subset of users on closely monitored boxes.

It's clearly not perfect though, except perhaps if the test aircraft turned itself into a crater on the runway, there is very little one aircraft can do to effect the function of all others.

In software it's hard to get this kind of guarantee, a new delete with a missing where clause in your canary environment is going to take some time to clean up after (assuming you have backups etc).

anaganisk · on Nov 28, 2021

Acceptance tests of aircraft are done by the pilots sent by the owner, not the factory. They do some flying and follow a checklist which is again unique to the company. It's like sending out your software for evaluation, which is in production and its evaluated by clients. You don't send testing/staging software for evaluation by clients.

jdauriemma · on Nov 23, 2021

You absolutely can and should test in prod in these cases.

frozenport · on Nov 23, 2021

We call that FDA testing

ekianjo · on Nov 23, 2021

FDA does not test anything.

Supermancho · on Nov 23, 2021

I'm not sure how this is a useful observation. Units don't test anything, but it's still called Unit Testing.

thewakalix · on Nov 23, 2021

They test my patience!

moksly · on Nov 23, 2021

Even with small applications that don’t have fatal consequences, you aren’t testing if you aren’t testing in production.

Especially not in a world where it is so easy to control traffic to your application in a way where 95% get on the stable solution and 5% get on your staging solution.

Hell, this is why you have LTS and non LTS solutions of the dev tools you use yourself.

BerislavLopac · on Nov 23, 2021

As the old saying goes: Every team has a testing environment; some teams also have a completely separate production environment.

hinkley · on Nov 23, 2021

There is more than one way to define 'testing'. That's not responsible for all of the misunderstandings, but it's responsible for quite a few. For some people, feature toggles fall under testing in production. For other people, they don't.

It's not whether production is orders of magnitude greater that preprod. It's whether production is orders of magnitude more diverse. If you've designed your product so that every user is a snowflake, that's on you, or at least the sales team. That functionality surface area is all undocumented features. Bragging about not being able to test all of the features you've promised users is like the opposite of a humble brag. What would you call that, a Dunning-Kruger brag?

I've worked at a few places that didn't understand this. A couple eventually got it (begrudgingly, and with a hint of resentment. The other never did, and hemorrhaged money and good people.

tiku · on Nov 23, 2021

Well this is why it is a superpower. On such scaled applications a bug is simply lost in the system.

k3liutZu · on Nov 23, 2021

That's why I usually develop in _prod_ (well _clients_)

detaro · on Nov 23, 2021

"some of the"? or the one you already replied to?

jdauriemma · on Nov 23, 2021

The one I replied to, plus others. Hence, "some."

mmaunder · on Nov 23, 2021

But do you edit code on prod? I did for years on our two person high traffic startup. 1 billion application requests per month for one of them and orders of magnitude higher on the more recent biz. And I’d do it again. When you’re a one person dev and ops team with a smart partner to talk it through with, and where the cost of not making the rapid change is far higher than making a wrong change and rolling it back, do it. Every time. Fire up vim in the live box, edit and save. Different process if you’re horizontally scaled but similar idea.

I don’t do that anymore but I am working to get our dev team more in touch with operations now and to treat it less like the holy of holies and instead like a living breathing application that produces a ton of useful telemetry, even when it’s failing, that we can use to improve product and processes.

cardanome · on Nov 23, 2021

> But do you edit code on prod?

For a very small startup that seems like a good approach. I think the right approach differs greatly depending on the size of the company and the culture.

Me, having been employed mostly in medium sized companies, I only do so for absolute time and business critical issues. It gets the management of my back so I can focus and fixing it properly on my own time later.

For everything else. Nope. Everything needs to follow the proper processes. Nothing gets merged without code review. No, I can't directly deploy on prod. We will test on stage first.

You need to be firm or you end up with the slippery slope of "Can you just fix this small thing quickly?". No, sorry, please write a ticket we can discuss it next sprint. You can't let management know how easy deployments are these days, they must feel like it is a huge deal or they will mess up your whole sprint.

So the right approaches have more to do with company politics than what is technically the best and that is what is missing from this discussion. One of the most important priorities for a employed developer is to keep his peace of mind. At my current job I don't even have access to the prod. So even those time critical fixes are not possible. And it absolutely slows me down sometimes from discovering and fixing certain bugs but I sure as hell wont ask for access.

secondaryacct · on Nov 23, 2021

We all do it for urgent fixes, the thing is to try to make it less holy to avoid "we cant fix this crushing bug before next release window" stupidity, but still holy to avoid "Ill fix it now in prod and forget abt that fix that Ill overwrite next release window" :D

names_are_hard · on Nov 23, 2021

We don't all do it for urgent fixes... where I work, I couldn't do it even if I wanted to. I don't have access. I suppose there's someone somewhere who does have access to the containers running my code but I'm not sure who to ask and it would be easier to just merge a PR and then click the release button in the devops UI than to change code in prod.

edgyquant · on Nov 23, 2021

I work at a small startup right now but our motto is if it’s a 1 line fix we can just commit to master and deploy to prod. Anything that requires adding new logic or anything we run by the another engineer just to have another set of eyes look at it.

secondaryacct · on Nov 23, 2021

I work in a regulated bank and yday night I had to have an auditor look at us deploying a new version on one user, trigger one order, revert back. He ll then spend the next 3 days studying the changes (that we didnt test in front of him) to tell us if yes or not we can deploy.

And god forbid we find a bug in prod, to rollback I suspect we ll need to retrigger an audit :D Next time you ask loudly your representatives to regulate banks even more, think of us poor assholes spending our time filling mindless paperwork :D

edgyquant · on Nov 23, 2021

We did at my first web dev job. It was a dozen e-commerce sites all tied to the same backend and we would regularly turn Adsense down for a month while we worked on a new layout for one of the sites.

The owner was semi-tech savvy but not enough to be efficient and when I tried to explain I could just copy the site to a new server and clone the database he’d say, “we don’t have time for that.” Was like working at the McDonalds of web development (I think it was 2010 or 2011) and I don’t recommend it.

shrimpx · on Nov 23, 2021

To add to that: When the platform is down, editing in production is the fastest way to bring it back up.

numbsafari · on Nov 23, 2021

Yes.

https://speakerdeck.com/numbsafari/write-code-in-production

trabant00 · on Nov 23, 2021

The problem is not whether to test in prod or not, the problem is who gets prod access. When I started as a Linux sysadmin almost 20 years ago I did not get prod root access for a full year. The care and respect for production was drilled into me army style. Root access was considered a privilege one earns.

Oldish man rant: IT has been McDonaldised. People don't spend significantly more than one year at the same job and companies desire weeks long induction after which new employees should ship code. Developers don't understand TCP/IP, they don't understand DNS, HTTP, databases, IO, OS process and memory management, etc. Of course you can't allow production access for "I just import the library and start coding". I even suspect K8s to be nothing more than a standardized way to arrange the ice box, potato fryer and grill so anybody can just start flipping burgers on day one.

3pt14159 · on Nov 23, 2021

I understand all those things? Everyone I work with does except for maybe memory management?

If you said assembly or Fortran or Cobol, sure. But most good developers know these concepts.

dcow · on Nov 23, 2021

Your comment is a tautology.

3pt14159 · on Nov 23, 2021

I mean, not really?

There's an implicit frequency hint with the word "good" that implies that it's not so rare as to not be present in a non-trivial portion of the population. If someone said to me "most good doctors can cure metastasize diffuse stomach cancer" and then later said "too bad only 1 in a million doctors are good" I'd roll my eyes pretty hard at them. But if the ratio is closer to 5% then I'd say it's a pretty fair ratio.

Anyway, back to the original point. I think anyone that's dealt with even high level languages like Python or Ruby has managed their way through almost all of those things. They're not so rare.

dcow · on Nov 23, 2021

I am saying that the logical proposition in your comment is inarguably true. You’ve classified good engineers as those that know the things that make good engineers good and then stated that most all good engineers know these things.

It sounds like you are now also saying you only work with good engineers and also that good acceptably applies to the set of engineers in the top 95th percentile. I don’t think my tautology comment contests anything you’ve said.

I would say your experience seems to be biased. In my experience it’s usually a coin toss or two (top 50-25th percentile) wether an engineer just knows syntax or actually understands the machine they’re programming. But that’s just anecdotal and it doesn't seem super apropos to quibble over where that line is drawn at a population level in this thread.

NikolaeVarius · on Nov 23, 2021

I tested in prod all the time. Except

a) I know how to fix 99% of issues I would cause

b) I explicitly call out that this is happening, indicate possible issues, and what to tell angry customers if they call

c) Take ownership for mistakes

This trust breaks down when people don't do any of these and just yeet whatever without knowing the consequences

secondaryacct · on Nov 23, 2021

Also, rules only matter when problems arise. If you dont care abt processes and the error rate is low you'll never have any issue.

But just because you can fix issues quickly doesnt mean you ll survive having one everyday, so testing before prod in this case starts saving you money and time.

paxys · on Nov 23, 2021

Everyone tests in prod, whether intentionally or not. Some teams acknowledge it and plan for it, and they are in a much better place than those who have supposedly complex testing procedures and environments but are not prepared for when their code meets the real world.

notacoward · on Nov 23, 2021

The question is not whether you test in prod, but whether you also take testing seriously elsewhere. Sure, sometimes problems will only happen when the rubber meets the road, or even (still metaphorically speaking) 1000 miles into the trip. That's life, but it doesn't excuse inflicting problems on users that could quite reasonably have been caught by other means.

dcow · on Nov 23, 2021

Our dev and prod environments are literally the same code other than some operational stuff. When you need to test something before it’s continuously integrated on merge, you use the ephemeral feature environment that’s automatically created for you when you open a PR. This forces features to be done done when they’re merged unless they’re gated in some way. And it makes us less spooked about deploying to prod because it happens all the time. It raises the bar for PR reviews because if you approve something broken, unless an automated test catches it, it’s going straight to prod. So most reviewers take the time to actually verify changes work like we’re all supposed to do but usually laze out on. Since dev and prod are always the same, and ephemeral envs use dev resources (DBs), you know exactly what to expect and don’t have the cognitive overhead of keeping track of which versions of things are deployed where. If someone experiences an issue in prod it has always been instantly reproducible in dev. In those ways, we test in prod.

h0l0cube · on Nov 23, 2021

A major dot point of the article that it's impossible to replicate a prod environment. A staging environment won't capture all the issues

dcow · on Nov 23, 2021

That’s essentially why we don't have a staging environment and why “dev” is a direct copy of prod but with different backing data. I’d argue more often than not it’s overwhelmingly difficult to maintain the discipline required to not have a crummy staging that doesn't in any way resemble prod, so I’m sympathetic. We deal with that by taking away the notion that you get to stage your changes anywhere persistent before they land in prod. Of course dev is not 1000% identical to the very last bit, I’m not going to argue that. But it is a hell of a lot better than the type of staging environments I imagine drove the author to take such a stance. Like I said, we’ve yet to experience a prod-only bug that didn't reproduce in our dev env. So in that sense, anecdotally, the point does not hold.

Just to be a little more clear: I agree with the author that issues happen in prod that are unique to prod that you simply won’t catch pre-prod. And I agree with the hot take mantra that “testing in prod” is okay and not to be as frowned upon as people seem to think is trendy. But I’m also suggesting that instead of viewing the ability to test in prod as a badge of honor, it’s also possible to apply this mantra towards traditional notions of a staging environment. You can cut out many of the issues and frustrations surrounding testing in staging by actually practicing continuous integration. Build mechanisms and policy that severely limit the frequency and distances that staging systems diverge from prod and I wager you’d get much much further than the comfortable status quo of merging to staging, manual and automated integrating, and then a cadence-based release to prod. So yeah: test in prod! Just don't use your real prod unless you have to.

h0l0cube · on Nov 23, 2021

> That’s essentially why we don't have a staging environment and why “dev” is a direct copy of prod but with different backing data.

That's fair, and a decent way to make a staging environment, though as echoed elsewhere, the data itself can exercise your code in ways that uncover bugs. I think also this is more feasible on, say, a monolith setup, vs a sharded multi-cluster service that's integrated with manifold 3rd party systems - but yes, if you can, you probably should have this kind of prod-replica staging in adjunct to incremental canary rollouts, prod-safe testing suites etc. and also the article was explicitly suggesting in-prod testing should be adjunct to non-prod testing

dcow · on Nov 24, 2021

You and others also have a point. I’m now thinking of ways we could seed our dev data to be maximally similar to prod. It’s all encrypted blobs though so it would mostly be about scale in our case. But your point is still taken.

selcuka · on Nov 23, 2021

> “dev” is a direct copy of prod but with different backing data

Then it's not a direct copy of prod. Many times it's the data that makes bugs appear.

dcow · on Nov 23, 2021

Let me quote myself:

> Of course dev is not 1000% identical to the very last bit, I’m not going to argue that. … Like I said, we’ve yet to experience a prod-only bug that didn't reproduce in our dev env. …

> You can cut out many of the issues and frustrations surrounding testing in staging by actually practicing continuous integration. Build mechanisms and policy that severely limit the frequency and distances that staging systems diverge from prod and I wager you’d get much much further than the comfortable status quo of merging to staging, manual and automated integrating, and then a cadence-based release to prod. So yeah: test in prod! Just don't use your real prod unless you have to.

I’ve seen plenty of staging envs that look nothing like prod, and that what I’m calling the real sham.

edgyquant · on Nov 23, 2021

That’s true but depends on what you’re building. We don’t have a million users or anything yet but I clone the prod db every month or so, change the passwords, and use that for testing. Before we had a staging db and a prod db but they’d diverge and staging would have almost no data while prod would be full of it.

grandphuba · on Nov 23, 2021

How do you manage sensitive data with this workflow (i.e. do you do it manually everytime, do you automate it, what scripts, etc.)?

I get changing passwords, but say that data leaks (whether by a vulnerability in the clone environment, or by a dev gone rogue), how do you mitigate possible damage done to real users (since you did clone from prod).

I ask not because I question your actions, but because I've been wanting to do something similar in staging env to allow practical testing, but I haven't had the chance to research how to do it "properly".

PeterisP · on Nov 23, 2021

Not the parent post, but working in finance, multiple products had a "scrambling" feature which replaced many fields (names, addresses, etc) with random text, and that was used upon restoring any non-production environments. It's not proper anonymization since there are all kinds of IDs that still are linkable (account numbers, reference numbers) to identities but can't be changed without breaking all the processes that are needed even in testing, but it's a simple action that does reduce some risks.

hinkley · on Nov 23, 2021

> literally the same code

There’s a ton of stuff on this list that you bloody well can test in preproduction and you’re a damn fool (or you work for them) if you don’t/can’t.

    - A specific network stack with specific tunables, firmware, and NICs
    - Services loosely coupled over networks
    - Specific CPUs and their bugs; multiprocessors
    - Specific hardware RAM and memory bugs
    - Specific distro, kernel, and OS versions
    - Specific library versions for all dependencies
    - Build environment
    - Deployment code and process
    - Specific containers or VMs and their bugs
    - Specific schedulers and their quirks

That’s 40% of that list, or 5/8ths of the surface area of 2 problem interactions. CI/CD, Twelve Factor… you can fill an entire bookcase with books on this topic. Some of those books are almost old enough to drink. Someone whose by-line is “been on call half of their life” has had time to read some of them.

dcow · on Nov 23, 2021

All of those are the same for us. Num CPUs and amount of RAM is the only difference.

hinkley · on Nov 23, 2021

To be fair, I've had to argue with a lot of managers prior to The Cloud about how the QA team was given shit hardware instead of identical hardware. The IT manager even had a concrete use case for identical hardware that I thought was for sure going to win me that argument but it didn't.

If you don't have enough identical hardware for pre-prod, then you probably don't have spare servers for production either. If you get a flash traffic due to a news article, or one of your machines develops a hardware fault, then you have to order replacements. At best you might be able to pull off an overnight Fedex, but only if the problem happens in the morning.

If, however, you have identical QA hardware, you can order the new hardware and cannibalize QA. Re-image the machine and plop it into production. QA will be degraded for a couple of days but that's better than prod having an issue.

With the Cloud, the hardware is somewhat fungible, so you can generally pick identical hardware for preprod and prepare an apology if anyone even notices you've done it. If the nascent private cloud computing vendors manage to take off, they'll have to address that phenomenon or lose a lot of potential supporters at customer sites.

dcow · on Nov 23, 2021

I'm sure there are clueless companies/managers that don't quite get it in infra land (and that are still great places/people to work for and products to work on) and if you find yourself in one of those situations, it's pretty rational to need prod if it's the only instance of your problem because of large divergences in the things you and the article mention. You're not wrong. But something that I've been a stickler on since our company's beginnings is that dev is really, as much as is feasible and useful, an exact copy of prod. And it's working so far. We have yet to scale to massive heights, I'll admit that. But it's a principle that I've seen more than a few companies simply neglect.

JauntyHatAngle · on Nov 23, 2021

All teams have a test and a prod environment.

Some teams are lucky enough for those environments to be different.

faizshah · on Nov 23, 2021

I had a campus job where I was making a tiny update to the url of a spreadsheet on one page. The sysadmin showed me how to merge my code, test in staging and then promote to live he didnt show me how to rollback or ssh into the host or restart the host…

So I go in and I make my little spreadsheet update to seed the SQL table with my new data and then I add my one line of code to the git repo to point to the new spreadsheet. Perfect I see it’s working fine in dev, I merge into mainline and promote to staging. I sit there and wait for the deployment, perfect all my new data is there the site is working perfectly this is great, my Saturday sure is going well!

I promote from staging to prod, I go to the site, hey this looks great its working but wait the data doesn’t seem to be updated. Wait a minute why is the page getting slower? Oh god the whole website is down… yea that was a great saturday for the sysadmin where he specifically told me not to merge to prod on saturday but I thought it was just a simple config update…

Since no code is completely bug free, we are always testing in prod to some degree, it’s an important lesson to learn and be prepared for.

georgyo · on Nov 23, 2021

> Some teams are lucky enough for those environments to be different

Many teams are lucky enough to have them separate, but unlucky because they are different.

Effectively, they only really test when they go to prod.

mmaunder · on Nov 23, 2021

Oooh have my upvote. So true, especially in a global context.

the_arun · on Nov 23, 2021

Nice!

physicles · on Nov 23, 2021

Here's another take from 2010 that shaped my thinking on this: https://imwrightshardcode.com/2010/12/theres-no-place-like-p...

(I. M. Wright is the pen name of a once-internal blogger at Microsoft, whose writings are now mostly public)

Our code has a pile of unit tests that run on every commit, but also a few dozen end-to-end/scenario/whatever tests that run against prod every five minutes or every hour, depending on the impact. I don't know how I'd ever sleep at night without this.

koonsolo · on Nov 23, 2021

That article is great! Thanks for sharing.

ufmace · on Nov 23, 2021

Reminds me of what we ended up doing at a previous job.

We built in-house software for analyzing complex engineering systems. We had some unit tests and developer tests for basic sanity, but it's tough to cover really working with the app with those. We tried to set up a User Testing environment and have the normal engineer users test there, but we never could get them to spend enough time doing elaborate enough things to really find everything. After a while, we decided it was a waste of time and did all releases directly to Production, but only for a single client out of a dozen. It worked pretty well - we found the bugs in Production like before, but only for that one client, and were able to fix them before updating everyone, and we no longer wasted time doing pointless User Testing.

There's no general lesson IMO except to not get too tied down to any one process like it's a religion. Embrace things that add value, and abandon processes that don't add enough value in practice. If Production is the only place you can get real tests, then figure out how to make it reasonably safe to do that and embrace it.

mr_tristan · on Nov 23, 2021

I'm not really sure I like the term "test" for what you do in production, because ideally, tests should have well defined starting and ending states.

I feel like chaos engineering [1] is a better definition for what's being discussed here - constantly run experiments in prod, but do not consider them "tests". Making sure you _can_ run _useful_ experiments is the key.

1: https://principlesofchaos.org/

giantDinosaur · on Nov 23, 2021

The most absolute simple form of production testing is simply trying to use the application to make sure all the parts seem to work based on what you'd expect from a glance. This seems quite distinct from randomly terminating a service or injecting random bytes to see what happens.

exdsq · on Nov 23, 2021

I think there's a gap between running tests without "defined starting and ending states" and chaos engineering. This would be closer to exploratory testing for me, however rather than having a QA team run them you're outsourcing to the users at the expense of risk. Everyone does this (it's called a release) and you can deal with this in two ways - have tracing and monitoring in place to collect the logs from your unpaid testers or ignore it. Add techniques like blue green or canary deployments and you somewhat reduce the risk cost. However this cannot be a replacement for decent prior software testing practices.

joshuanapoli · on Nov 23, 2021

In a multi-tenant SaaS application, it should be possible to have a predictable begin and end state for a test case. Once you get this, then it's not to hard to make some canary tests periodically check the most important corners of the system.

keyle · on Nov 23, 2021

This should be renamed to "I let my users do the testing".

And what a great philosophy. Just don't apply for jobs in aviation, or anywhere critical.

hn_throwaway_99 · on Nov 23, 2021

Except there is a saying in aviation, "All the flight regulations and rules are written in blood."

Meaning that, over the past century, there was actually a ton of "testing in prod" in aviation. The industry learned many, many valuable lessons only after the billions and billions of hours of flight time exposed issues that weren't caught in preflight testing.

Elessar · on Nov 23, 2021

Your comment begins with "except", but the content agrees with the GP.

GP would like aviation users ("passengers") to not be the ones doing the testing. You state aviation users historically tested -- it's why flight regulations and rules are written in blood. I suspect the blood of passengers is exactly what the GP would like to avoid.

I can't tell if you didn't make the connection, or if your commenting style is just to naturally start negatively. My suspicion is the former, given the throwaway account.

princeb · on Nov 23, 2021

there's no way to write regulations without some user production testing in aviation.

the more "failed tests" in production, the safer aviation becomes.

we give thanks to those who have fallen (literally, out of the sky).

kortilla · on Nov 23, 2021

Those aren’t tests in any meaningful sense. That’s just telemetry from an already finished product that failed.

kortilla · on Nov 23, 2021

Nope, that’s not the same thing at all. Those regulations are for required test coverage before things make it to prod. And they only happened in prod before the regulation because it was an unforeseen possibility.

You’re deluding yourself if you think the aviation industry just sells directly to customers new aircraft without testing them.

“Testing in prod” means, “I’m gonna try this thing and I’m not sure if it will work and my test environments don’t cover that so I’ll go straight to prod and see if it breaks.”

The software industry is the only place where that works because shipping to prod is literally faster than test suites in lots of cases and the stakes are so low on failure.

Unexpected catastrophic failures in prod != testing in prod.

hn_throwaway_99 · on Nov 23, 2021

It's quite clear the author of the article doesn't at all think that "testing in prod" means how you define it here. I'm not making any argument over what "testing in prod" should mean, but given how the author goes to great lengths explaining that he is not using it to mean "just throw stuff out in prod and let my users deal with it", I don't think it's fair to then say "Oh, keep this guy away from aviation because he tests in prod."

whakim · on Nov 23, 2021

I think the snarkiness of this comment (and the replies, to be fair - I'm not just picking on you) gloss over an actually important point that you're making: testing strategies and infrastructure involve a hidden cost-benefit analysis that we could all benefit from being a little more cognizant of. Writing some internal tool or need to ship tomorrow? Maybe you can skip that test. Building an airplane? Maybe not.

brazzy · on Nov 23, 2021

Quite the opposite. That philosophy is even more important in critical industries.

Because the article very clearly says that you should still test before going to prod, but that it's just as important to treat going to prod as another testing stage where you will find problems you couldn't find before - and that you need to be prepared for that so you can find them early.

Doing so is very much an important best practice in aviation - you monitor planes in production, and you try to make sure that problems which almost lead to disaster are reported and investigated, rather than hidden.

yuliyp · on Nov 23, 2021

Even aviation does test in prod. They just do it after a large amount of testing beforehand. The point of the article is not that you shouldn't test before exposing new changes to production, but you should also test, monitor, and validate in production too. Understanding how things are performing in production is critical to identifying when things misbehave, and there's no getting around that for any large-scale system.

Waterluvian · on Nov 23, 2021

We are all testing in prod whether we observe the results or not.

That is, once you’ve shipped code. It’s live and getting used whether by you or someone else.

Of course you should do a ton of testing before you get to prod. But once there, you may as well be your own worse user before someone else can while you’re asleep.

dspillett · on Nov 23, 2021

> We are all testing in prod whether we observe the results or not.

Production use is the gamma testing phase, or Delta if you give clients a UAT release first. As with any test environment you should observe closely.

a-dub · on Nov 23, 2021

dogma is dumb.

in some environments, the challenges are those of shipping velocity, scale and uptime. in others it's all about ensuring the highest levels of security/privacy for customer data or correctness of the system when running in production- sometimes with massive financial or even criminal penalties if something goes wrong.

neither set of problems is more important, more challenging, or makes you more special because you worked on them. they're different requirements and they require different approaches, if you're any good at all you'll recognize this and advocate for the correct tooling and approach for the job- rather than dogmatically pushing for some principle that may not apply.

Gnitset · on Nov 23, 2021

I never "test" in prod. I "verify" in prod.

jillesvangurp · on Nov 23, 2021

A couple of things you can do to sleep at night and get away with testing in production:

- Aim to catch most of the preventable stuff before it gets anywhere near production. That means integration tests, unit tests, static code analysis, code reviews and all the rest. Use whatever you can get your hands on; anything it catches is preventable. Not catching preventable issues is inexcusable. Life is hard enough without these issues spoiling your day.

- Keep your deltas small. That way there is a lot less that can go wrong. Like exponentially less. If you are sitting on several weeks/months of changes, you don't need to test it to find out it is broken. I can guarantee you it is. It's statistically extremely unlikely to not be broken at that point. So, avoid pushing big changes like that and ship smaller deltas in between.

- Push all the time. Practice makes perfect. This should be a routine action and it should not hurt you. Push with confidence rather than perpetual fear. Iterate. You should be updating production multiple times per day.

- Use defensive coding. Assume errors will happen and have some proper tools in place to diagnose why they happened when they happen. Like log aggregation and usable logging in your code. Implement mitigations for these errors too so you can do some damage control when they inevitably happen. The worst is having errors happen and not knowing that they are happening. With a large enough production system, even the most unlikely combination of things that would cause issues will have a high probability of eventually happening. So plan for that.

- Use feature flags and other means to isolate experimental code. That way you can test it in production without putting it on the critical path to your business.

- Automate your CI/CD. It's stupidly easy with stuff like Github Actions these days. Manual processes are the type of things that people can do wrong. So, the fewer you have of that the better.

- Keep your deployment process fast. When you inevitably break production, the time to recovery is that process. The worst is having to wait 30 minutes for a fix to go live while your users and managers are getting more angry by the minute. Much better if you can get the fix out before they even notice something was broken.

- When stuff breaks, reflect on why it broke and try to prevent further breakage similar to that. A simple test that reproduces the problem can go a long way.

bertil · on Nov 23, 2021

Do you have any recommendation for all of those practices: books, talks, external vendors? Buy vs. build?

Jaruzel · on Nov 23, 2021

Ok, I'll my anecdotal comment to the stack...

I work on large global identity management systems. These systems tend to be connected to many other systems in many datacentres and offices. These all have firewalls on the network connections and/or boxes. The systems need to exchange data using a myriad of protocols and authentication mechanisms.

No business is prepared to replicate this in Dev, or even Test. The best you can do is 'best efforts' and test outside of Prod the stuff you can, and by documenting heavily the Prod topology go into testing in Prod with good knowledge of where it might break, and have a break-fix window during deployment where you can hopefully resolve any blockers you hit. If you can't fix the release during the window, the whole system gets rolled back and you try again in a few more weeks having performed a retrospective wrap up on the failed deployment.

winrid · on Nov 23, 2021

This is what we do at FastComments! Our entire e2e test suite runs on prod post-deploy, and we have tests that periodically hit different components to make sure they're working and live. A lot of these tests even create and teardown their own tenants.

For example, uptime robot hits a URL that will, on the server, connect to the websocket URI, subscribe to a channel, send a message, and wait for the response, and respond with OK to indicate that the pubsub system is up. Lots of things like this make sure we don't break things.

Hell, our e2e test suite will even install the latest version of the WordPress plugin and go through the whole setup of linking the WP account with your FastComments account and doing the data import - on prod.

We test outside, prod, too... but prod is the most important.

dec0dedab0de · on Nov 23, 2021

We have a bunch of our integration tests marked as prodsafe, and they Automatically run after every deployment.

darkwater · on Nov 23, 2021

I read though half of the article thinking "This is like Honeycomb[1] way of thinking" without realizing it was written by Charity Majors[2].

So, keep in mind that the article is biased towards their business - they are in the Observability realm - and the long-term goal is making you think that you have to achieve super-deployment speed, and that involves deploying to prod often, and have a good system (Honeycomb) to know at any moment what's going on in production with real data. And if it breaks, you just have to quickly rollback.

Aaaand, I buy that. I mean, I think she has totally a point on this: having a safe and quick way to production that leads you to be confident on rolling out things quickly because you have a safe and fast path back, it's a good thing for any software engineering team/organization. It's not easy to go there and you still obviously need tests while developing/integrating, you need canary tests, you need to program in a defensive manner etc etc, but not being scared of sending code to prod and "test" how it will really hold the production pressure it something I'm personally trying to achieve at $DAYJOB.

[1] https://www.honeycomb.io/ [2] https://nitter.net/mipsytipsy

yongjik · on Nov 23, 2021

Pet Peeve #324: test environment that's needlessly different from prod environment because that's "simpler", making half of the important tests impossible.

Like, why waste time setting up rolling upgrades in a test environment? It's for test only. Just nuke whatever version was running and start the new version from git master. Much faster!

Thanks, I love the forward thinking, it's really faster. Now pray tell me how I test rolling upgrade?

annoyingnoob · on Nov 23, 2021

Some things are hard to test when you are not in prod. In my experience you cannot effectively emulate the internet.

yummypaint · on Nov 23, 2021

Please don’t tell me to TDD

I do “maybe test after code”, MTAC

-$stdout

dexterlagan · on Nov 23, 2021

After 30-some years in the field, here are my observations:

If it works absolutely perfect in pre-prod, there's a 50% chance it'll fail in prod. Half the time there's a variable unaccounted for.

If it works well in pre-prod but I have a doubt, there's a 90% chance it'll fail in prod. If it doesn't fail right away, it'll fail when you least expect it.

Similarly, if it works, but I don't understand why it does, it'll break when I least expect it. Never leave working code alone if you don't understand it.

If it works OK in pre-prod, but performance could be better, it'll fail miserably in prod. Optimize in pre-prod, test the optimizations in prod.

A minor non-breaking bug in pre-prod becomes a major breaking bug in prod.

If it works in prod and not in pre-prod, sync.

If it works in pre-prod and not in prod, worry.

Edited for formatting

zuhayeer · on Nov 23, 2021

I feel like testing in prod builds the needed urgency to get things working faster. What would've taken a more winded path through regular unit tests, gets to production quality faster through the more important error cases that surface on prod.

Too many times, the issues we predict are going to happen, are not the most common ones or the ones that we really should have foreseen (hah kinda like the process to find product market fit). The best way to know what's wrong is to simply put it out there faster. And then persistent error-correction to make it better (there's never a best).

brazzy · on Nov 23, 2021

One example I recently ran into: the integration of our new credit card payment processor failed after deploying to prod because the test environment integrated the payment processor's test environment since naturally you don't want to pay real money every time you test that functionality.

But in their test environment they only ever loaded pages from their domain into the iframe, whereas in prod it was pages from the card-iasuing banks. And our CSP was not written to accommodate that, because you want that too be restrictive and not use wildcards.

kryptiskt · on Nov 23, 2021

When I worked for a bookmaker the test accounts were running on the real environment, so we gambled with real money during development. I liked it very much, because any amount of faked data wouldn't hit half the issues seen in the real world. Sadly we couldn't withdraw from those accounts.

slaymaker1907 · on Nov 23, 2021

Related, but there is also no replacement for querying over prod data to figure out the shape of your data. When you work with old systems, it is important to validate your assumptions by looking at what your prod data actually is. Don't just assume all data in some column is all lowercase, write a query looking for data with uppercase characters (or number of rows with uppercase characters if said column is PII).

tudorw · on Nov 23, 2021

When I started there was only production servers, I'm not that guy, though really I am, I guess you could say that testing on dev is like getting half way up a mountain, it feels like you are there, but really, you are only halfway up, and nobody remembers the names of those who failed to get to the tippity top.

faridelnasire · on Nov 23, 2021

I think the important distinction is that some of us test locally, test on staging then test on production, and others ONLY test on production. No matter how hard you try, no test environment can truly replicate production, but that doesn't mean you should do away with staging completely.

scrubs · on Nov 23, 2021

>I test in prod.

I'm jealous unless management is looking ... "Huh! Prod? that's crazy."

Jokes aside our dev and beta envs are decent enough that prod surprises are fewish if it works pre-prod.

jensenbox · on Nov 23, 2021

QA in Production

https://www.urbandictionary.com/define.php?term=qaduction

2-718-281-828 · on Nov 23, 2021

Technically any reasonable roll out / deployment strategy (canary, b/g, ...) is effectively a testing strategy as well.

glabfan · on Nov 23, 2021

GitLab team does it all the time.

kovac · on Nov 23, 2021

Looking at the mass of crappy software all around us, I'm not surprised.

wmil · on Nov 23, 2021

Remember, they're not users. They're gamma testers.

dexterlagan · on Nov 23, 2021

What's wrong with testing in both?

kgraves · on Nov 23, 2021

Why is this something to be proud of?

jdauriemma · on Nov 23, 2021

Nowhere in the article did the author say they were "proud." Did you read the full text? It's entirely reasonable.

kgraves · on Nov 23, 2021

Author, second line from headline:

> 'Testing in production is a superpower'

I wouldn't be surprised if most people stopped reading after that point.

> Did you read the full text?

https://news.ycombinator.com/newsguidelines.html

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."

jdauriemma · on Nov 23, 2021

And what in the article do you disagree with?

justin_oaks · on Nov 23, 2021

It's not, just like people who are proud of how little sleep they get, or how they lie and get away with it.

arpa · on Nov 23, 2021

Good on you!

samstave · on Nov 23, 2021

I really hope this may be new to you:

https://devopsreactions.tumblr.com/

And this one is for you:

https://devopsreactions.tumblr.com/post/149313321840/success...