Hacker News new | past | comments | ask | show | jobs | submit login
I test in prod (increment.com)
301 points by rsie_above on Nov 23, 2021 | hide | past | favorite | 123 comments



After reading a few smug comments, I've concluded that some of the folks in this thread have never worked on an application where the production scale is many, many orders of magnitude greater than preproduction environments. There's no substitution for "testing in prod." If you're not "testing in prod" with your really-big application, you're not testing enough.


> After reading a few smug comments...I've concluded that some of the folks in this thread have never worked on an application where the production scale is many, many orders of magnitude greater than preproduction environments.

I think your comment is smug as well. The author of the article has considerably overloaded the term "testing", reasonably having a lot of people have a knee-jerk reaction of "Don't do that". I get the impression that this is by design, and the article is intended to provoke discussion via flamewar.

I think the best way to combat this is to simply avoid too much discussion on such articles until they are rewritten to be more clear and less liable to cause flamewars. I have some views on the thesis of the article, and practical experience backing those views – but I simply won't express them because I don't want to get embroiled in the numerous minor arguments caused by the confusing terminology, with little to no actionable information. An example of a pattern that you'll see repeating across all the comments in this thread:

Person A: "We take our time and test the application thoroughly under all sorts of load in a preproduction environment. We run end-to-end tests on every build. We value work/life balance, and try to minimize testing-in-production."

Person B: "When did the article say not to do that? It never says you shouldn't test outside prod, it says you should also test in prod and that's a superpower! You're totally misunderstanding the article!"


> The author of the article has considerably overloaded the term "testing",

I would disagree. There is zero overloading of the term "testing" as that is an already extremely broad term of art that would seem to clearly apply to every example of production testing provided in the article.

> with little to no actionable information.

The article absolutely provides aome actionable points and breaks them up under "Technical", "Cultural", and "Managerial"

> An example of a pattern that you'll see repeating across all the comments in this thread:

The article repeatedly covers the exact points mentioned in your example exchange so the people you see having that exchange are those who at best only skimmed the article. I find that the light readers can sometimes dominate comment threads early but those comments eventually do become out outnumbered by the more interesting discussion. People who take the time to read carefully and think respond slower and thus tend to be back loaded.

> I have some views on the thesis of the article, and practical experience backing those views – but I simply won't express them because I don't want to get embroiled in the numerous minor arguments

That is unfortunate. The only way the discussion improves is when people do take the time to state their views, even when they don't have the time to follow up on replies. The only way to combat vapid discussion is to plant the seeds of better conversation.


The article also mentions a few things that are usually not tested pre-production (e.g. timeouts, race conditions) but certainly could be with better integration test tooling.

The thing that bugged me was that these were not treated as a cost/benefit trade off but simply as a fait accompli.

Over time I've come to believe that an appreciation of nuance and cost/benefit tradeoffs are at the heart of effective testing, but culturally the practice is steeped in dogmatism and absolutism. This exhibits all of that - e.g. "control freak managers", "only one represents reality" and "saying not today to the gods of downtime".


I think your reply to my smug comment is also smug! It’s smugness all the way down!


It's possible, I did feel a bit self-satisfied after making it!

Maybe it applies upwards as well? The article is kind of all smug about "I test in prod" too :)


Full-stack smugness


Same. I'm processing fitness data 24/7 and I can easily repull and overwrite the normalization. There's no actual down side to testing in prod with this style of data syncing. In fact the dev versions of our oauth2 app on fitbit have only one test account so there's no real data for me to access. 99% of the edge case issues happen with real data only and you will never catch them without it. I only use the dev version for setting up new features not yet in prod but once I've confirmed basic workability the rest of the dev is on my admin or test accounts but with production data.


Exactly. Here we crunch loads of data, and by a lot, I mean our production databases run on a three digit cluster of four digit terabytes of data. Good luck having that scale in preproduction environments.

We have loads of redundancy, and dedicated "test in production" machines/datacenters to test actual production loads in actual production-scale sets machines.

Now, tests in production usually involves a pair of hands and 2 other people looking behind (dev + ops + dba), and requires a well defined rollback procedure and post-mortem. We still have absurdly high SLA (99.999%).


Our production scale is just copied stage environments because our clients are trivially shardable and they only use our service from 9-5 m-f, but hoo boy are there a lot of them


Yeah it s nice to be trivially shardable ! Ours is not: stock market connections to 13 countries with clients having global trading controls across these countries - so we can shard some parts and not others and each country having specificities, guaranteeing a cross market feature works for a client, if it impacts control, is tested in prod, either after market close on mock exchanges provided by the various countries, or during trading hours on small orders.

No amount of testing in the bank has ever been able to spot the weirdest issues, so we continue while really trying hard to make the prod "pilots" (we try not to call these tests) as routine as possible. But, we still find crazy issues, notably those only the client notices (so wrong specs they couldnt prevalidate by plugging their monster system on our monster system in QA)


why can't you shard by clients (not regions)? Is there some reason why clients must know about other clients?


Let s imagine we build all the trading robots to have one server per group of client. That would def work, we probably cant pay for one set of server (for HA) per client since we have hundreds, but then all these per client servers have to queue up for the exchange line access which has per country controls (cant move the market more than x %, cant trade x or y stock at z time whatever). So yes in a way, end of the line, the clients must "know abt each other" so we can respect the 13 different law codes.

We split by exchange traditionally because it made the most sense when we started 30 years ago in Asia, but... We re more and more splitting clients in groups where we can and allocating them cpu power indeed.

I envy the stock market people because while they must handle even higher volume, they can shard per stock itself and have just one jursidiction.


cool, thanks for satisfying my curiosity. This sounds like a very interesting problem! Coming from an Elixir/Erlang background, I would probably architect it with clusters of "rate-limiting" backend agents sharded by exchange or jurisdictions and a cluster of client-sharded groups (in its own VPS, even) for client information. But yeah, it would be tough to migrate to such a setup from something more brownfield.


Some places you simply can't. Medical devices, airplanes, etc.

But non-critical services? Sure


They will and do something called acceptance testing, which is nothing but testing in prod.


Not quite. Nobody acceptance tests an airliner on a passenger flight. If we acceptance test a medical diagnostics device, we don't run the test the doctor later relies on with the DUT.


This is a common misconception about testing in prod. It’s not about the logic. To quote the article: “Once you deploy, you aren’t testing code anymore, you’re testing systems.”


I don't quite get the objection. For me, "testing in prod" is "we observe the actual running system, with production traffic and users directly interacting with it and its results being live". That's not quite what "acceptance testing" in the mentioned-above domains is. If you have an acceptance testing stage, prod follows it.


I think the argument breaks down when you have systems as opposed to objects.

You can't acceptance test "Google's search", because at a minimum doing so would require some kind of reverse proxy that is itself part of the system and can't be acceptance tested without...and turtles etc.

Another way of putting this might be that there isn't a good way to acceptance test "airplane manufacturing companies". There isn't a set of acceptance tests you can run to ensure that Boeing is performing to spec before having it build real airplanes.


Maybe, I don't feel particularly strongly either way about if it can be applied to the kind of large services you are thinking of. But also wasn't really the point of my comment, I merely rejected the claim that acceptance testing in fields where it is done is the same as testing in prod.


Perhaps the closest analogy would be deploying to a limited subset of users on closely monitored boxes.

It's clearly not perfect though, except perhaps if the test aircraft turned itself into a crater on the runway, there is very little one aircraft can do to effect the function of all others.

In software it's hard to get this kind of guarantee, a new delete with a missing where clause in your canary environment is going to take some time to clean up after (assuming you have backups etc).


Acceptance tests of aircraft are done by the pilots sent by the owner, not the factory. They do some flying and follow a checklist which is again unique to the company. It's like sending out your software for evaluation, which is in production and its evaluated by clients. You don't send testing/staging software for evaluation by clients.


You absolutely can and should test in prod in these cases.


We call that FDA testing


FDA does not test anything.


I'm not sure how this is a useful observation. Units don't test anything, but it's still called Unit Testing.


They test my patience!


Even with small applications that don’t have fatal consequences, you aren’t testing if you aren’t testing in production.

Especially not in a world where it is so easy to control traffic to your application in a way where 95% get on the stable solution and 5% get on your staging solution.

Hell, this is why you have LTS and non LTS solutions of the dev tools you use yourself.


As the old saying goes: Every team has a testing environment; some teams also have a completely separate production environment.


There is more than one way to define 'testing'. That's not responsible for all of the misunderstandings, but it's responsible for quite a few. For some people, feature toggles fall under testing in production. For other people, they don't.

It's not whether production is orders of magnitude greater that preprod. It's whether production is orders of magnitude more diverse. If you've designed your product so that every user is a snowflake, that's on you, or at least the sales team. That functionality surface area is all undocumented features. Bragging about not being able to test all of the features you've promised users is like the opposite of a humble brag. What would you call that, a Dunning-Kruger brag?

I've worked at a few places that didn't understand this. A couple eventually got it (begrudgingly, and with a hint of resentment. The other never did, and hemorrhaged money and good people.


Well this is why it is a superpower. On such scaled applications a bug is simply lost in the system.


That's why I usually develop in _prod_ (well _clients_)


"some of the"? or the one you already replied to?


The one I replied to, plus others. Hence, "some."


But do you edit code on prod? I did for years on our two person high traffic startup. 1 billion application requests per month for one of them and orders of magnitude higher on the more recent biz. And I’d do it again. When you’re a one person dev and ops team with a smart partner to talk it through with, and where the cost of not making the rapid change is far higher than making a wrong change and rolling it back, do it. Every time. Fire up vim in the live box, edit and save. Different process if you’re horizontally scaled but similar idea.

I don’t do that anymore but I am working to get our dev team more in touch with operations now and to treat it less like the holy of holies and instead like a living breathing application that produces a ton of useful telemetry, even when it’s failing, that we can use to improve product and processes.


> But do you edit code on prod?

For a very small startup that seems like a good approach. I think the right approach differs greatly depending on the size of the company and the culture.

Me, having been employed mostly in medium sized companies, I only do so for absolute time and business critical issues. It gets the management of my back so I can focus and fixing it properly on my own time later.

For everything else. Nope. Everything needs to follow the proper processes. Nothing gets merged without code review. No, I can't directly deploy on prod. We will test on stage first.

You need to be firm or you end up with the slippery slope of "Can you just fix this small thing quickly?". No, sorry, please write a ticket we can discuss it next sprint. You can't let management know how easy deployments are these days, they must feel like it is a huge deal or they will mess up your whole sprint.

So the right approaches have more to do with company politics than what is technically the best and that is what is missing from this discussion. One of the most important priorities for a employed developer is to keep his peace of mind. At my current job I don't even have access to the prod. So even those time critical fixes are not possible. And it absolutely slows me down sometimes from discovering and fixing certain bugs but I sure as hell wont ask for access.


We all do it for urgent fixes, the thing is to try to make it less holy to avoid "we cant fix this crushing bug before next release window" stupidity, but still holy to avoid "Ill fix it now in prod and forget abt that fix that Ill overwrite next release window" :D


We don't all do it for urgent fixes... where I work, I couldn't do it even if I wanted to. I don't have access. I suppose there's someone somewhere who does have access to the containers running my code but I'm not sure who to ask and it would be easier to just merge a PR and then click the release button in the devops UI than to change code in prod.


I work at a small startup right now but our motto is if it’s a 1 line fix we can just commit to master and deploy to prod. Anything that requires adding new logic or anything we run by the another engineer just to have another set of eyes look at it.


I work in a regulated bank and yday night I had to have an auditor look at us deploying a new version on one user, trigger one order, revert back. He ll then spend the next 3 days studying the changes (that we didnt test in front of him) to tell us if yes or not we can deploy.

And god forbid we find a bug in prod, to rollback I suspect we ll need to retrigger an audit :D Next time you ask loudly your representatives to regulate banks even more, think of us poor assholes spending our time filling mindless paperwork :D


We did at my first web dev job. It was a dozen e-commerce sites all tied to the same backend and we would regularly turn Adsense down for a month while we worked on a new layout for one of the sites.

The owner was semi-tech savvy but not enough to be efficient and when I tried to explain I could just copy the site to a new server and clone the database he’d say, “we don’t have time for that.” Was like working at the McDonalds of web development (I think it was 2010 or 2011) and I don’t recommend it.


To add to that: When the platform is down, editing in production is the fastest way to bring it back up.



The problem is not whether to test in prod or not, the problem is who gets prod access. When I started as a Linux sysadmin almost 20 years ago I did not get prod root access for a full year. The care and respect for production was drilled into me army style. Root access was considered a privilege one earns.

Oldish man rant: IT has been McDonaldised. People don't spend significantly more than one year at the same job and companies desire weeks long induction after which new employees should ship code. Developers don't understand TCP/IP, they don't understand DNS, HTTP, databases, IO, OS process and memory management, etc. Of course you can't allow production access for "I just import the library and start coding". I even suspect K8s to be nothing more than a standardized way to arrange the ice box, potato fryer and grill so anybody can just start flipping burgers on day one.


I understand all those things? Everyone I work with does except for maybe memory management?

If you said assembly or Fortran or Cobol, sure. But most good developers know these concepts.


Your comment is a tautology.


I mean, not really?

There's an implicit frequency hint with the word "good" that implies that it's not so rare as to not be present in a non-trivial portion of the population. If someone said to me "most good doctors can cure metastasize diffuse stomach cancer" and then later said "too bad only 1 in a million doctors are good" I'd roll my eyes pretty hard at them. But if the ratio is closer to 5% then I'd say it's a pretty fair ratio.

Anyway, back to the original point. I think anyone that's dealt with even high level languages like Python or Ruby has managed their way through almost all of those things. They're not so rare.


I am saying that the logical proposition in your comment is inarguably true. You’ve classified good engineers as those that know the things that make good engineers good and then stated that most all good engineers know these things.

It sounds like you are now also saying you only work with good engineers and also that good acceptably applies to the set of engineers in the top 95th percentile. I don’t think my tautology comment contests anything you’ve said.

I would say your experience seems to be biased. In my experience it’s usually a coin toss or two (top 50-25th percentile) wether an engineer just knows syntax or actually understands the machine they’re programming. But that’s just anecdotal and it doesn't seem super apropos to quibble over where that line is drawn at a population level in this thread.


I tested in prod all the time. Except

a) I know how to fix 99% of issues I would cause

b) I explicitly call out that this is happening, indicate possible issues, and what to tell angry customers if they call

c) Take ownership for mistakes

This trust breaks down when people don't do any of these and just yeet whatever without knowing the consequences


Also, rules only matter when problems arise. If you dont care abt processes and the error rate is low you'll never have any issue.

But just because you can fix issues quickly doesnt mean you ll survive having one everyday, so testing before prod in this case starts saving you money and time.


Everyone tests in prod, whether intentionally or not. Some teams acknowledge it and plan for it, and they are in a much better place than those who have supposedly complex testing procedures and environments but are not prepared for when their code meets the real world.


The question is not whether you test in prod, but whether you also take testing seriously elsewhere. Sure, sometimes problems will only happen when the rubber meets the road, or even (still metaphorically speaking) 1000 miles into the trip. That's life, but it doesn't excuse inflicting problems on users that could quite reasonably have been caught by other means.


Our dev and prod environments are literally the same code other than some operational stuff. When you need to test something before it’s continuously integrated on merge, you use the ephemeral feature environment that’s automatically created for you when you open a PR. This forces features to be done done when they’re merged unless they’re gated in some way. And it makes us less spooked about deploying to prod because it happens all the time. It raises the bar for PR reviews because if you approve something broken, unless an automated test catches it, it’s going straight to prod. So most reviewers take the time to actually verify changes work like we’re all supposed to do but usually laze out on. Since dev and prod are always the same, and ephemeral envs use dev resources (DBs), you know exactly what to expect and don’t have the cognitive overhead of keeping track of which versions of things are deployed where. If someone experiences an issue in prod it has always been instantly reproducible in dev. In those ways, we test in prod.


A major dot point of the article that it's impossible to replicate a prod environment. A staging environment won't capture all the issues


That’s essentially why we don't have a staging environment and why “dev” is a direct copy of prod but with different backing data. I’d argue more often than not it’s overwhelmingly difficult to maintain the discipline required to not have a crummy staging that doesn't in any way resemble prod, so I’m sympathetic. We deal with that by taking away the notion that you get to stage your changes anywhere persistent before they land in prod. Of course dev is not 1000% identical to the very last bit, I’m not going to argue that. But it is a hell of a lot better than the type of staging environments I imagine drove the author to take such a stance. Like I said, we’ve yet to experience a prod-only bug that didn't reproduce in our dev env. So in that sense, anecdotally, the point does not hold.

Just to be a little more clear: I agree with the author that issues happen in prod that are unique to prod that you simply won’t catch pre-prod. And I agree with the hot take mantra that “testing in prod” is okay and not to be as frowned upon as people seem to think is trendy. But I’m also suggesting that instead of viewing the ability to test in prod as a badge of honor, it’s also possible to apply this mantra towards traditional notions of a staging environment. You can cut out many of the issues and frustrations surrounding testing in staging by actually practicing continuous integration. Build mechanisms and policy that severely limit the frequency and distances that staging systems diverge from prod and I wager you’d get much much further than the comfortable status quo of merging to staging, manual and automated integrating, and then a cadence-based release to prod. So yeah: test in prod! Just don't use your real prod unless you have to.


> That’s essentially why we don't have a staging environment and why “dev” is a direct copy of prod but with different backing data.

That's fair, and a decent way to make a staging environment, though as echoed elsewhere, the data itself can exercise your code in ways that uncover bugs. I think also this is more feasible on, say, a monolith setup, vs a sharded multi-cluster service that's integrated with manifold 3rd party systems - but yes, if you can, you probably should have this kind of prod-replica staging in adjunct to incremental canary rollouts, prod-safe testing suites etc. and also the article was explicitly suggesting in-prod testing should be adjunct to non-prod testing


You and others also have a point. I’m now thinking of ways we could seed our dev data to be maximally similar to prod. It’s all encrypted blobs though so it would mostly be about scale in our case. But your point is still taken.


> “dev” is a direct copy of prod but with different backing data

Then it's not a direct copy of prod. Many times it's the data that makes bugs appear.


Let me quote myself:

> Of course dev is not 1000% identical to the very last bit, I’m not going to argue that. … Like I said, we’ve yet to experience a prod-only bug that didn't reproduce in our dev env. …

> You can cut out many of the issues and frustrations surrounding testing in staging by actually practicing continuous integration. Build mechanisms and policy that severely limit the frequency and distances that staging systems diverge from prod and I wager you’d get much much further than the comfortable status quo of merging to staging, manual and automated integrating, and then a cadence-based release to prod. So yeah: test in prod! Just don't use your real prod unless you have to.

I’ve seen plenty of staging envs that look nothing like prod, and that what I’m calling the real sham.


That’s true but depends on what you’re building. We don’t have a million users or anything yet but I clone the prod db every month or so, change the passwords, and use that for testing. Before we had a staging db and a prod db but they’d diverge and staging would have almost no data while prod would be full of it.


How do you manage sensitive data with this workflow (i.e. do you do it manually everytime, do you automate it, what scripts, etc.)?

I get changing passwords, but say that data leaks (whether by a vulnerability in the clone environment, or by a dev gone rogue), how do you mitigate possible damage done to real users (since you did clone from prod).

I ask not because I question your actions, but because I've been wanting to do something similar in staging env to allow practical testing, but I haven't had the chance to research how to do it "properly".


Not the parent post, but working in finance, multiple products had a "scrambling" feature which replaced many fields (names, addresses, etc) with random text, and that was used upon restoring any non-production environments. It's not proper anonymization since there are all kinds of IDs that still are linkable (account numbers, reference numbers) to identities but can't be changed without breaking all the processes that are needed even in testing, but it's a simple action that does reduce some risks.


> literally the same code

There’s a ton of stuff on this list that you bloody well can test in preproduction and you’re a damn fool (or you work for them) if you don’t/can’t.

    - A specific network stack with specific tunables, firmware, and NICs
    - Services loosely coupled over networks
    - Specific CPUs and their bugs; multiprocessors
    - Specific hardware RAM and memory bugs
    - Specific distro, kernel, and OS versions
    - Specific library versions for all dependencies
    - Build environment
    - Deployment code and process
    - Specific containers or VMs and their bugs
    - Specific schedulers and their quirks
That’s 40% of that list, or 5/8ths of the surface area of 2 problem interactions. CI/CD, Twelve Factor… you can fill an entire bookcase with books on this topic. Some of those books are almost old enough to drink. Someone whose by-line is “been on call half of their life” has had time to read some of them.


All of those are the same for us. Num CPUs and amount of RAM is the only difference.


To be fair, I've had to argue with a lot of managers prior to The Cloud about how the QA team was given shit hardware instead of identical hardware. The IT manager even had a concrete use case for identical hardware that I thought was for sure going to win me that argument but it didn't.

If you don't have enough identical hardware for pre-prod, then you probably don't have spare servers for production either. If you get a flash traffic due to a news article, or one of your machines develops a hardware fault, then you have to order replacements. At best you might be able to pull off an overnight Fedex, but only if the problem happens in the morning.

If, however, you have identical QA hardware, you can order the new hardware and cannibalize QA. Re-image the machine and plop it into production. QA will be degraded for a couple of days but that's better than prod having an issue.

With the Cloud, the hardware is somewhat fungible, so you can generally pick identical hardware for preprod and prepare an apology if anyone even notices you've done it. If the nascent private cloud computing vendors manage to take off, they'll have to address that phenomenon or lose a lot of potential supporters at customer sites.


I'm sure there are clueless companies/managers that don't quite get it in infra land (and that are still great places/people to work for and products to work on) and if you find yourself in one of those situations, it's pretty rational to need prod if it's the only instance of your problem because of large divergences in the things you and the article mention. You're not wrong. But something that I've been a stickler on since our company's beginnings is that dev is really, as much as is feasible and useful, an exact copy of prod. And it's working so far. We have yet to scale to massive heights, I'll admit that. But it's a principle that I've seen more than a few companies simply neglect.


All teams have a test and a prod environment.

Some teams are lucky enough for those environments to be different.


I had a campus job where I was making a tiny update to the url of a spreadsheet on one page. The sysadmin showed me how to merge my code, test in staging and then promote to live he didnt show me how to rollback or ssh into the host or restart the host…

So I go in and I make my little spreadsheet update to seed the SQL table with my new data and then I add my one line of code to the git repo to point to the new spreadsheet. Perfect I see it’s working fine in dev, I merge into mainline and promote to staging. I sit there and wait for the deployment, perfect all my new data is there the site is working perfectly this is great, my Saturday sure is going well!

I promote from staging to prod, I go to the site, hey this looks great its working but wait the data doesn’t seem to be updated. Wait a minute why is the page getting slower? Oh god the whole website is down… yea that was a great saturday for the sysadmin where he specifically told me not to merge to prod on saturday but I thought it was just a simple config update…

Since no code is completely bug free, we are always testing in prod to some degree, it’s an important lesson to learn and be prepared for.


> Some teams are lucky enough for those environments to be different

Many teams are lucky enough to have them separate, but unlucky because they are different.

Effectively, they only really test when they go to prod.


Oooh have my upvote. So true, especially in a global context.


Nice!


Here's another take from 2010 that shaped my thinking on this: https://imwrightshardcode.com/2010/12/theres-no-place-like-p...

(I. M. Wright is the pen name of a once-internal blogger at Microsoft, whose writings are now mostly public)

Our code has a pile of unit tests that run on every commit, but also a few dozen end-to-end/scenario/whatever tests that run against prod every five minutes or every hour, depending on the impact. I don't know how I'd ever sleep at night without this.


That article is great! Thanks for sharing.


Reminds me of what we ended up doing at a previous job.

We built in-house software for analyzing complex engineering systems. We had some unit tests and developer tests for basic sanity, but it's tough to cover really working with the app with those. We tried to set up a User Testing environment and have the normal engineer users test there, but we never could get them to spend enough time doing elaborate enough things to really find everything. After a while, we decided it was a waste of time and did all releases directly to Production, but only for a single client out of a dozen. It worked pretty well - we found the bugs in Production like before, but only for that one client, and were able to fix them before updating everyone, and we no longer wasted time doing pointless User Testing.

There's no general lesson IMO except to not get too tied down to any one process like it's a religion. Embrace things that add value, and abandon processes that don't add enough value in practice. If Production is the only place you can get real tests, then figure out how to make it reasonably safe to do that and embrace it.


I'm not really sure I like the term "test" for what you do in production, because ideally, tests should have well defined starting and ending states.

I feel like chaos engineering [1] is a better definition for what's being discussed here - constantly run experiments in prod, but do not consider them "tests". Making sure you _can_ run _useful_ experiments is the key.

1: https://principlesofchaos.org/


The most absolute simple form of production testing is simply trying to use the application to make sure all the parts seem to work based on what you'd expect from a glance. This seems quite distinct from randomly terminating a service or injecting random bytes to see what happens.


I think there's a gap between running tests without "defined starting and ending states" and chaos engineering. This would be closer to exploratory testing for me, however rather than having a QA team run them you're outsourcing to the users at the expense of risk. Everyone does this (it's called a release) and you can deal with this in two ways - have tracing and monitoring in place to collect the logs from your unpaid testers or ignore it. Add techniques like blue green or canary deployments and you somewhat reduce the risk cost. However this cannot be a replacement for decent prior software testing practices.


In a multi-tenant SaaS application, it should be possible to have a predictable begin and end state for a test case. Once you get this, then it's not to hard to make some canary tests periodically check the most important corners of the system.


This should be renamed to "I let my users do the testing".

And what a great philosophy. Just don't apply for jobs in aviation, or anywhere critical.


Except there is a saying in aviation, "All the flight regulations and rules are written in blood."

Meaning that, over the past century, there was actually a ton of "testing in prod" in aviation. The industry learned many, many valuable lessons only after the billions and billions of hours of flight time exposed issues that weren't caught in preflight testing.


Your comment begins with "except", but the content agrees with the GP.

GP would like aviation users ("passengers") to not be the ones doing the testing. You state aviation users historically tested -- it's why flight regulations and rules are written in blood. I suspect the blood of passengers is exactly what the GP would like to avoid.

I can't tell if you didn't make the connection, or if your commenting style is just to naturally start negatively. My suspicion is the former, given the throwaway account.


there's no way to write regulations without some user production testing in aviation.

the more "failed tests" in production, the safer aviation becomes.

we give thanks to those who have fallen (literally, out of the sky).


Those aren’t tests in any meaningful sense. That’s just telemetry from an already finished product that failed.


Nope, that’s not the same thing at all. Those regulations are for required test coverage before things make it to prod. And they only happened in prod before the regulation because it was an unforeseen possibility.

You’re deluding yourself if you think the aviation industry just sells directly to customers new aircraft without testing them.

“Testing in prod” means, “I’m gonna try this thing and I’m not sure if it will work and my test environments don’t cover that so I’ll go straight to prod and see if it breaks.”

The software industry is the only place where that works because shipping to prod is literally faster than test suites in lots of cases and the stakes are so low on failure.

Unexpected catastrophic failures in prod != testing in prod.


It's quite clear the author of the article doesn't at all think that "testing in prod" means how you define it here. I'm not making any argument over what "testing in prod" should mean, but given how the author goes to great lengths explaining that he is not using it to mean "just throw stuff out in prod and let my users deal with it", I don't think it's fair to then say "Oh, keep this guy away from aviation because he tests in prod."


I think the snarkiness of this comment (and the replies, to be fair - I'm not just picking on you) gloss over an actually important point that you're making: testing strategies and infrastructure involve a hidden cost-benefit analysis that we could all benefit from being a little more cognizant of. Writing some internal tool or need to ship tomorrow? Maybe you can skip that test. Building an airplane? Maybe not.


Quite the opposite. That philosophy is even more important in critical industries.

Because the article very clearly says that you should still test before going to prod, but that it's just as important to treat going to prod as another testing stage where you will find problems you couldn't find before - and that you need to be prepared for that so you can find them early.

Doing so is very much an important best practice in aviation - you monitor planes in production, and you try to make sure that problems which almost lead to disaster are reported and investigated, rather than hidden.


Even aviation does test in prod. They just do it after a large amount of testing beforehand. The point of the article is not that you shouldn't test before exposing new changes to production, but you should also test, monitor, and validate in production too. Understanding how things are performing in production is critical to identifying when things misbehave, and there's no getting around that for any large-scale system.


We are all testing in prod whether we observe the results or not.

That is, once you’ve shipped code. It’s live and getting used whether by you or someone else.

Of course you should do a ton of testing before you get to prod. But once there, you may as well be your own worse user before someone else can while you’re asleep.


> We are all testing in prod whether we observe the results or not.

Production use is the gamma testing phase, or Delta if you give clients a UAT release first. As with any test environment you should observe closely.


dogma is dumb.

in some environments, the challenges are those of shipping velocity, scale and uptime. in others it's all about ensuring the highest levels of security/privacy for customer data or correctness of the system when running in production- sometimes with massive financial or even criminal penalties if something goes wrong.

neither set of problems is more important, more challenging, or makes you more special because you worked on them. they're different requirements and they require different approaches, if you're any good at all you'll recognize this and advocate for the correct tooling and approach for the job- rather than dogmatically pushing for some principle that may not apply.


I never "test" in prod. I "verify" in prod.


A couple of things you can do to sleep at night and get away with testing in production:

- Aim to catch most of the preventable stuff before it gets anywhere near production. That means integration tests, unit tests, static code analysis, code reviews and all the rest. Use whatever you can get your hands on; anything it catches is preventable. Not catching preventable issues is inexcusable. Life is hard enough without these issues spoiling your day.

- Keep your deltas small. That way there is a lot less that can go wrong. Like exponentially less. If you are sitting on several weeks/months of changes, you don't need to test it to find out it is broken. I can guarantee you it is. It's statistically extremely unlikely to not be broken at that point. So, avoid pushing big changes like that and ship smaller deltas in between.

- Push all the time. Practice makes perfect. This should be a routine action and it should not hurt you. Push with confidence rather than perpetual fear. Iterate. You should be updating production multiple times per day.

- Use defensive coding. Assume errors will happen and have some proper tools in place to diagnose why they happened when they happen. Like log aggregation and usable logging in your code. Implement mitigations for these errors too so you can do some damage control when they inevitably happen. The worst is having errors happen and not knowing that they are happening. With a large enough production system, even the most unlikely combination of things that would cause issues will have a high probability of eventually happening. So plan for that.

- Use feature flags and other means to isolate experimental code. That way you can test it in production without putting it on the critical path to your business.

- Automate your CI/CD. It's stupidly easy with stuff like Github Actions these days. Manual processes are the type of things that people can do wrong. So, the fewer you have of that the better.

- Keep your deployment process fast. When you inevitably break production, the time to recovery is that process. The worst is having to wait 30 minutes for a fix to go live while your users and managers are getting more angry by the minute. Much better if you can get the fix out before they even notice something was broken.

- When stuff breaks, reflect on why it broke and try to prevent further breakage similar to that. A simple test that reproduces the problem can go a long way.


Do you have any recommendation for all of those practices: books, talks, external vendors? Buy vs. build?


Ok, I'll my anecdotal comment to the stack...

I work on large global identity management systems. These systems tend to be connected to many other systems in many datacentres and offices. These all have firewalls on the network connections and/or boxes. The systems need to exchange data using a myriad of protocols and authentication mechanisms.

No business is prepared to replicate this in Dev, or even Test. The best you can do is 'best efforts' and test outside of Prod the stuff you can, and by documenting heavily the Prod topology go into testing in Prod with good knowledge of where it might break, and have a break-fix window during deployment where you can hopefully resolve any blockers you hit. If you can't fix the release during the window, the whole system gets rolled back and you try again in a few more weeks having performed a retrospective wrap up on the failed deployment.


This is what we do at FastComments! Our entire e2e test suite runs on prod post-deploy, and we have tests that periodically hit different components to make sure they're working and live. A lot of these tests even create and teardown their own tenants.

For example, uptime robot hits a URL that will, on the server, connect to the websocket URI, subscribe to a channel, send a message, and wait for the response, and respond with OK to indicate that the pubsub system is up. Lots of things like this make sure we don't break things.

Hell, our e2e test suite will even install the latest version of the WordPress plugin and go through the whole setup of linking the WP account with your FastComments account and doing the data import - on prod.

We test outside, prod, too... but prod is the most important.


We have a bunch of our integration tests marked as prodsafe, and they Automatically run after every deployment.


I read though half of the article thinking "This is like Honeycomb[1] way of thinking" without realizing it was written by Charity Majors[2].

So, keep in mind that the article is biased towards their business - they are in the Observability realm - and the long-term goal is making you think that you have to achieve super-deployment speed, and that involves deploying to prod often, and have a good system (Honeycomb) to know at any moment what's going on in production with real data. And if it breaks, you just have to quickly rollback.

Aaaand, I buy that. I mean, I think she has totally a point on this: having a safe and quick way to production that leads you to be confident on rolling out things quickly because you have a safe and fast path back, it's a good thing for any software engineering team/organization. It's not easy to go there and you still obviously need tests while developing/integrating, you need canary tests, you need to program in a defensive manner etc etc, but not being scared of sending code to prod and "test" how it will really hold the production pressure it something I'm personally trying to achieve at $DAYJOB.

[1] https://www.honeycomb.io/ [2] https://nitter.net/mipsytipsy


Pet Peeve #324: test environment that's needlessly different from prod environment because that's "simpler", making half of the important tests impossible.

Like, why waste time setting up rolling upgrades in a test environment? It's for test only. Just nuke whatever version was running and start the new version from git master. Much faster!

Thanks, I love the forward thinking, it's really faster. Now pray tell me how I test rolling upgrade?


Some things are hard to test when you are not in prod. In my experience you cannot effectively emulate the internet.


Please don’t tell me to TDD

I do “maybe test after code”, MTAC

-$stdout


After 30-some years in the field, here are my observations:

If it works absolutely perfect in pre-prod, there's a 50% chance it'll fail in prod. Half the time there's a variable unaccounted for.

If it works well in pre-prod but I have a doubt, there's a 90% chance it'll fail in prod. If it doesn't fail right away, it'll fail when you least expect it.

Similarly, if it works, but I don't understand why it does, it'll break when I least expect it. Never leave working code alone if you don't understand it.

If it works OK in pre-prod, but performance could be better, it'll fail miserably in prod. Optimize in pre-prod, test the optimizations in prod.

A minor non-breaking bug in pre-prod becomes a major breaking bug in prod.

If it works in prod and not in pre-prod, sync.

If it works in pre-prod and not in prod, worry.

Edited for formatting


I feel like testing in prod builds the needed urgency to get things working faster. What would've taken a more winded path through regular unit tests, gets to production quality faster through the more important error cases that surface on prod.

Too many times, the issues we predict are going to happen, are not the most common ones or the ones that we really should have foreseen (hah kinda like the process to find product market fit). The best way to know what's wrong is to simply put it out there faster. And then persistent error-correction to make it better (there's never a best).


One example I recently ran into: the integration of our new credit card payment processor failed after deploying to prod because the test environment integrated the payment processor's test environment since naturally you don't want to pay real money every time you test that functionality.

But in their test environment they only ever loaded pages from their domain into the iframe, whereas in prod it was pages from the card-iasuing banks. And our CSP was not written to accommodate that, because you want that too be restrictive and not use wildcards.


When I worked for a bookmaker the test accounts were running on the real environment, so we gambled with real money during development. I liked it very much, because any amount of faked data wouldn't hit half the issues seen in the real world. Sadly we couldn't withdraw from those accounts.


Related, but there is also no replacement for querying over prod data to figure out the shape of your data. When you work with old systems, it is important to validate your assumptions by looking at what your prod data actually is. Don't just assume all data in some column is all lowercase, write a query looking for data with uppercase characters (or number of rows with uppercase characters if said column is PII).


When I started there was only production servers, I'm not that guy, though really I am, I guess you could say that testing on dev is like getting half way up a mountain, it feels like you are there, but really, you are only halfway up, and nobody remembers the names of those who failed to get to the tippity top.


I think the important distinction is that some of us test locally, test on staging then test on production, and others ONLY test on production. No matter how hard you try, no test environment can truly replicate production, but that doesn't mean you should do away with staging completely.


>I test in prod.

I'm jealous unless management is looking ... "Huh! Prod? that's crazy."

Jokes aside our dev and beta envs are decent enough that prod surprises are fewish if it works pre-prod.



Technically any reasonable roll out / deployment strategy (canary, b/g, ...) is effectively a testing strategy as well.


GitLab team does it all the time.


Looking at the mass of crappy software all around us, I'm not surprised.


Remember, they're not users. They're gamma testers.


What's wrong with testing in both?


Why is this something to be proud of?


Nowhere in the article did the author say they were "proud." Did you read the full text? It's entirely reasonable.


Author, second line from headline:

> 'Testing in production is a superpower'

I wouldn't be surprised if most people stopped reading after that point.

> Did you read the full text?

https://news.ycombinator.com/newsguidelines.html

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."


And what in the article do you disagree with?


It's not, just like people who are proud of how little sleep they get, or how they lie and get away with it.


Good on you!





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: