Hacker News new | past | comments | ask | show | jobs | submit login
Production-Oriented Development (medium.com/paulosman)
115 points by jgrodziski on Feb 23, 2020 | hide | past | favorite | 62 comments



Right. Commenting on a very specific part of this with an anecdote. I did some integration with a ‘neo bank’ a few years ago where the CTO said testing and staging envs are a lie. I vehemently disagree(d) but they were paying to only test on production. I guess you can guess what happened (and not because of us as we spent 10k of our own money to build a simulator of their env; I have some kind of pride); testing was extremely expensive (it is a bank so production testing and having bugs is actually losing money... also they could not really test millions of transactions because of that so there were bugs, many bugs, in their system...), violated rules and the company died and got bought for scraps.

I understand the sentiment but I agree with 2, 3 and 6 of this article, the rest is, imho, actually dangerous in many non startup cases.

Example; Simple is always better IF you can apply it; a lot of companies and people you work with do not do so simple. A lot of companies still have SOAP, CORBA or in house protocols and you will have to work with it. So you can shout from the rafters that simple wins; you will not get the project. That can be a decision but I do not see many people who finally got into a bank/insurer/manufacturer/... go ‘well, your tech is not simple in my definition so I will look elsewhere’.

It is a nice utopia and maybe it will happen when all legacy gets phased out in 50-100 years.


Thanks for your comment! But I don't think all legacy will ever get phased out.

Today's code will become tomorrow's legacy.


I agree, I meant 50-100 yrs with a bit of wink as no-one can reason about that period of time in software obviously.


This article makes a lot of assumptions that only hold true in a very specific set of circumstances:

- that it’s possible for the team developing the product to deploy or monitor it (example cases where it isn’t: most things that aren’t web based such as desktop, most things embedded into hardware that might not yet exist etc.)

- that if you can deliver continuously, customers actually accept that you do. Customers may want Big Bang releases every two years and reject the idea of the software changing the slightest in between.

- not validating a deployment for a long time before it meets customers is also only ok if the impact of mistakes is that you deploy a fix. If the next release window is a year away and/or if people are harmed by a faulty product then you probably want it painstakingly manually tested.

My point is: if you are a team developing and operating a product that is as web site/app/service and you are free to choose if and when to deploy, then most of the article is indeed good advice. But then you are also facing the simple edge case among software deployment scenarios.


> that it’s possible for the team developing the product to deploy or monitor it (example cases where it isn’t: most things that aren’t web based such as desktop, most things embedded into hardware that might not yet exist etc.)

In these cases, you can have a pre-production embedded (in the sense of "embedded journalism") field test, where the developers come out to the production line and/or testing field to iterate on the software together with other departments + the final customers.

IIRC this is done often in military weapons testing—you'll often find the software engineer of a new UAV autonavigation system at the testing range for that system, doing live pair-debugging with the field operator.


> This article makes a lot of assumptions that only hold true in a very specific set of circumstances:

Yes. The assumption that you are working on a web based service is so core to this piece that it doesn't seem any more necessary to say "this doesn't work for desktop" than it would be to say "this doesn't work without internet".

given that you are delivering software on the web, your customers are going to get changes to it and like it, because their other option is to run systems on the internet with known exploits. Customers who don't want changes host their own instance.

And if your next release is a year away and you have no way to roll back the release, but you have no manual validation - then you aren't following this advice to begin with, and you have an appallingly broken process.


Absolutely agree, my complaint wasn't that the advice was bad but that it lacks the specifier "what follows is good avice for this small subset of scenarios". I really dislike the phenomenon that "software development" without qualifier has begun to imply web app/servic development.


I generally agree with your points, with one exception:

> If the next release window is a year away and/or if people are harmed by a faulty product then you probably want it painstakingly manually tested.

No, you manual testing should only be for the things which are difficult to automatically test. But I think you should _always_ strive for extensive automatic testing. Even with hardware which doesn't yet exist, mocks are perfect for that.


Agreed. Whether something is manually or automatically tested is really an implementation detail, but it’s economically insane to not have computers do the testing to the largest extent possible.

It’s also possible to extend the argument to the relation between “compiler verification” and “test verification”. That is: don’t spend time writing tests for things a compiler could catch.


Cool article, enjoy the summary of relevant knowledge that's been passed around various circles.

I do disagree with:

> Environments like staging or pre-prod are a fucking lie.

You need an environment that runs on production settings but isn't production. Setting up an environment that ideally has read-only access to production data has saved a huge number of bugs from reaching customers, at least IME.

There's just so many classes of bugs that are easily caught by some sort of pre-prod environment, including stupid things like "I marked this dependency as development-only but actually it's needed in production as well".

Development environments are frequently so incredibly far removed from production environments that some sort of intermediary between production is almost always so helpful to me that the extra work involved in maintaining that staging environment is well worth it.

It's not the same as production obviously, but it's a LOT closer than development.


> You need an environment that runs on production settings but isn't production.

Why?

> Setting up an environment that ideally has read-only access to production data has saved a huge number of bugs from reaching customers, at least IME.

That's an anecdote, not a reason. Also, just because you've done it that way doesn't mean it has to be done that way, like you asserted.

> There's just so many classes of bugs that are easily caught by some sort of pre-prod environment

Also does not support the claim that you need a pre-prod env.

> Development environments

Whoa, there! You're sneaking yet another kind of environment into the conversation? Maybe not. This is unclear, given the many different ways that people do work.

> not the same as production obviously, but it's a LOT closer

You seem to want something like production. There is nothing more like production than production.

If you're set up to do A/B tests or deploys with canaries or give potential customers test accounts you're probably able to start testing in production in a sane, contained way.


> If you're set up to do A/B tests or deploys with canaries or give potential customers test accounts you're probably able to start testing in production in a sane, contained way.

You seem to be assuming 1. some sort of large-horizontal-scale production system with multiple customers, where the impact of a failure can be minimized by minimizing the number of users exposed to new features, and where 2. there's no type of bug in the code that would potentially take down production as a whole.

What if your production system is, say, a bank's ACH reconciliation logic? A medical device? A car? The live server for a popular MMORPG? A telephone backbone switch? A television or radio broadcast station?

In these cases, your software isn't a service with multiple distinct customers that each make requests to it, where you can test your new code on one customer in a thousand; your software is just running and doing something—one, unified something per instance of the system (though that process may track multiple customers)—and if the code is wrong, then the whole system the software operates will fail.

How do you test software for such systems?

Usually by having a "production simulation" whose failure won't kill people or cost a million dollars in lost revenue.


Thank you for contrasting life building the latest social media platform from what many of the rest of us do.

Currently I work on systems to prepare and validate birth and death certificates for the state, counties, hospitals, et al, and this whole “throw it against the wall and see what sticks” methodology doesn’t fly. Nor would it have worked when preparing and presenting investment account information 5 years ago, nor the job 10 years ago processing lawsuit and insurance claim cases and legal bills. Nor any place that I at least have ever worked in the last 30 years.


Agree, and I work in the latest 'social media platform' type end. We have many customers. I can assure the author of the post, when those customers pay for enterprise licensing and their system is broken with an obvious bug. The 'we didn't do any testing before hand because staging is a lie' doesn't actually fair well as a valid excuse for anything. In fact, you just look like an unprepared and immature muppet.


Well in that case you're not talking about the original scenario where we were talking about whether or not to use a pre-production setup that mimicked production as closely as possible, are you?


> If you're set up to do A/B tests or deploys with canaries or give potential customers test accounts you're probably able to start testing in production in a sane, contained way.

Basically you're outsourcing QA to your customer. Some systems may afford this, others not.


> Basically you're outsourcing QA to your customer.

You've just described any software ever used.


Not in any meaningful way is that true, no.


>> You need an environment that runs on production settings but isn't production. > Why?

The unstated assumption of "staging" fans is that bad test coverage is a universal, naturally-occurring condition. At the companies where I've worked that had good tests, they did not have staging. At companies where I thought the test coverage was poor, they did have it.


The obvious answer is so you can test infrastructure changes and data migrations without impacting users.


In ye olde times that was the intent of "staging." Deploys were more expensive then. That's back when we talked about having some number of 9s instead of MTTR.


Some good points but some controversial ones.

I think a manual QA team is very valuable. Sure the tests pass but what if the UI is confusing or disorienting. QA can be user advocates in a way a unit test can't be. I work in games so maybe it's just a squishier design philosophy but you can't unit test fun.

I also don't understand the worry about other environments. If you're automating deployments how is another environment added work? Shouldn't it be just as easy to deploy to?


I have never worked in the game industry but I love your comment “you can’t unit test fun”.

There is definitely value in having both automated testing for repetitive stuff, AND, humans touching stuff to spot unspecified insanity.


I think the valuable purpose you are describing of QA is better achieved by having a UX team earlier in the pipeline.


We have both, and I'm really glad we do. A good QA team tests more than just "funtionality" and "usability" -- they test the product in its entirety, and notice all sorts of things that non-QA people would miss. Our QA people also often poke the database and look at REST request/responses too. I think going without this kind of full spectrum testing is just shooting yourself in the foot. You can totally do continuous delivery but still use a QA team.


Shameless plug: I'm working on a tool that creates on-demand staging environments for the purpose of getting your UX team to give feedback earlier in the dev cycle. https://featurepeek.com


I always liked Basecamp's approach on this, which was that every team working on a widget (or part of a widget), had to be composed of 2 developers for 1 designer, so the designer was continually involved in whatever they were working on to give a UX perspective.

https://basecamp.com/shapeup/2.2-chapter-08#team-and-project...


I do support constant deployments to the QA environment (also a no-no apparently). That can keep the QA team involved at all times. I wouldn't suggest waiting on large changes before having QA do a pass.


Literally constant? As in whilst attempting to replicate a bug the software could change out from underneath the tester? Would that complicate things or am I misunderstanding something about the process?


Hmm you're right. We CI to a dev environment that QA pulls in at their discretion.


Do you mean you CD to a dev environment?


yes


I think I disagree with this about 100%. Sure, production is what it is all about in the end. But how do you know the letters you just typed are going to be any good in production? They might just crash and burn there. That is why we need all those quality gates. The sooner and the farther removed from production that you discover a problem the easier it is to fix.

Regarding the 'buy vs build' I think buying software is one of the most risky things that you can do. Since it cost money you cannot then say 'o well, i guess it just did not pan out, let us just not use it'. Now you are kind of married to the software. And some of the worst software out there is paid for. E.g., jira vs. redmine. This is actually a bit ironic considering the fact that I actually am writing software in my job that is sold.... O well, it actually is sold as a part of a piece of hardware, so it is not really sofware as such.....

Regarding the last point, failure can be made uncommon if a relatively safe route to production is available, starting with a language that verifies the correct use of types, automated tests that verify the correctness of code, a testing environment that one attempts to keep close to what production is like and so on. Getting a call that production is not working is the event that I am trying to prevent by all means possible, and I think research would be able to show that people who get fewer calls, not just because production is failing, but in general, fewer calls regarding whatever subject, will live longer and happier.


> Regarding the 'buy vs build' I think buying software is one of the most risky things that you can do. Since it cost money you cannot then say 'o well, i guess it just did not pan out, let us just not use it'. Now you are kind of married to the software.

It is usually way more costly and risky to develop your own. It's many hours spent on what is a separate product to your actual product, and you're way more married to it: you've just spent money, time and energy developping a custom homegrown solution. What are the chances you'll go "o well, i guess it just did not pan out, let us just not use it"? Very, very low

So you end up spending more money and a significant amount of time/energy for a product that's probably subpar because there's no reason you'd do better than companies that are focused on this product.

I think buying software is one of the least risky things you can do, you know exactly how much money you have at risk and you usually know pretty well what you're buying. You don't know how much money/time/energy it will take to make your own solution, and you don't know what result you'll get.


Regarding your last point. You weigh buying software over building it when you know how much it costs to buy and maintain, and have a strong grasp on how much time money and energy it costs to build it yourself. That is how you make an informed judgement. Sure there is risk, but if your burning 15K a year on a build server and you can build it yourself for 5k and run it for 1k a year then math doesn't lie about what choice you should make.


Often it's not you who's deciding on buy vs. build. The choice can be: Build, or be forced to use whatever trash some PHB was sold.


I think you're missing a couple things here.

One is the difference between optimizing for MTBF and MTTR (respectively, mean time between failures and mean time to repair). Quality gates improve the former but make the latter worse.

I think optimizing for MTTR (and also minimizing blast radius) is much more effective in the long term even in preventing bugs. For many reasons, but big among them is that quality gates can only ever catch the bugs you expect; it isn't until you ship to real people that you catch the bugs that you didn't expect. But the value of optimizing for fast turnaround isn't just avoiding bugs. It's increasing value delivery and organizational learning ability.

The other is that I think this grows out of an important cultural difference: the balance between blame for failure and reward for improvement. Organizations that are blame-focused are much less effective at innovation and value delivery. But they're also less effective at actual safety. [1]

To me, the attitude in, "Getting a call that production is not working is the event that I am trying to prevent by all means possible," sounds like it's adaptive in a blame-avoidance environment, but not in actual improvement. Yes, we should definitely use lots of automated tests and all sorts of other quality-improvement practices. And let's definitely work to minimize the impact of bugs. But we must not be afraid of production issues, because those are how we learn what we've missed.

[1] For those unfamiliar, I recommend Dekker's "Field Guide to Human Error": https://www.amazon.com/Field-Guide-Understanding-Human-Error...


One can talk about MTBF and MTTR but not all failures are created equal so maybe not all attempts to do statistics about them make sense. The main class of failures that I am worrying about regarding the MTTR is the very same observable problem that you solved last week occurring again due to a lack of quality gates. To the customer this looks like last weeks problem was not solved at all despite promises to the contrary. If the customer is calculating MTTR he would say that the TTR for this event is at least a week. And I could not blame the customer for saying that. Since getting the same bug twice is worse than getting two different ones, it actually is quite great that quality gates defend against known bugs.

The blame vs reward issue to me sounds rather orthogonal to the one we are discussing here. If the house crumbles one can choose to blame or not blame the one who built it but independently of that issue, in that situation it quite clear that it is not the time to attach pretty pictures to the walls. I.e., it certainly is not the time to do any improvement let alone reward anyone for it. First the walls have to be reliable and then we can attach pictures to them. The question what percentage of my time am I busy repairing failure vs what percentage can I write new stuff seems to me more important than MTBF vs. MTTR.

I have to grant you that underneath what I write there is some fear going on, but it is not the fear of blame. It is the fear of finding myself in a situation that I do not want to find myself in, namely, the thing is not working in production and I have no idea what caused it, no way to reproduce it and I will just have to make an educated guess how to fix it. Note that all of the stuff that was written to provide quality gates is often also very helpful to reproduce customer issues in the lab. This way the quality gates can decrease MTTR by a very large amount.


> The main class of failures that I am worrying about regarding the MTTR is the very same observable problem that you solved last week occurring again due to a lack of quality gates. To the customer this looks like last weeks problem was not solved at all despite promises to the contrary.

I think the quality gates mentioned in the article are the ones where you have a human approving a deployment. If you have an issue in production and you solve it you should definitely add an automated test to make sure the same issue doesn’t reappear. That automated test should then work as a gate preventing deployment if the test fails.


I can't say I spelled every letter in the article but it says so many strange and wrong things I would not give it any benefit of the doubt of the sort of 'but it cannot really actually say that, right?'


Problems don't occur due to a lack of quality gates. Quality gates are one way to fix problems, but are far from the only way. And, IMHO, far from the best way.

And I think the issue of blame is very much related to what you say drives this: fear. Fear is the wrong mindset with which to approach quality. Much more effective are things like bravery, curiosity, and and resolve. I think if you dig in on why you experience fear, you'll find it relates to blame and experiences related to blame culture. That's how it was for me.

If you really want to know why bugs occur in production and how to keep them from happening again, the solution isn't to create a bunch of non-production environments that you hope will catch the kinds of bugs you expect. The solution is a better foundation (unit tests, acceptance tests, load tests), better monitoring (so you catch bugs sooner), and better operating of the app (including observability and replayability).


I am sorry but what you are saying really does not make much sense to me. You say quality gates are bad and instead we should have unit tests, acceptance tests and so on. Actually, unit tests and acceptance tests are examples of quality gates. And do note that the original article is down even on unit tests because they are not the production environment.

Then you say that e.g., bravery is better than fear. Well, there is fear right there inside bravery. I would be inclined to make up the equation bravery = fear + resolve.

And why are you pitting replayability against what I am saying? Replayability is a very good example of what I was talking about the whole time. I have written an application in the past that could replay its own log file. That worked very well to reproduce issues. I would do that again if the situation arose. Many of these replayed logs would afterwards become automated tests. The author of the original article would be against it, though. The replaying is not done in the production environment, so it is bad, apparently.


I don't believe the original article is down on unit tests. He's very clearly down on manual tests and tests that are part of a human-controlled QA step. But he also says, "If you have manual tests, automate them and build them into your CI pipeline (if they do deliver value)." So he is in favor of automated tests being part of a CI pipeline.

And I'm saying that the things I listed are good ways to get quality while not having QA environments and QA steps in the process.

I also don't know where you get the notion that all debugging has to be done in production. If one can do it there, great. But if not, developers still have machines. He's pretty clearly against things like QA and pre-prod environments, not developers running the code they're working on.

So it seems to me you're mainly upset at things that I don't see in his article.


Buying non-open source software is quite risky. Buying non-open source software or using third party services is only valuable in the case that if it fails you can replace it easily.

Once you start relying on it a bit too much, it can hurt really badly if you are not able to fix issues by yourself, or if they decide to change the price later on. The worse is when a company that you were paying a software for goes out of business. You just have to start again.


He lost me at “non production environments are bullshit”.

In dev you can break almost anything, no biggie. In stage if you break something, great just don’t deploy it to prod. If you break something in prod, well ... you may end up going below SLA and may legit lose money and your customers trust.

Don’t YOLO into prod. Build reliable shit.


Non-production environments are useful for more than testing application code. Changing underlying infrastructure (Upgrading a database, networking shenanigans, messing around with ELB or Nginx settings) requires testing too. Having the same traffic / data shape in pre-prod is not as important.


Radical! Some counter-points, though.

- Infrastructure as code and schemas as code make it easier to keep environmental parity, because everything can be rolled back/forwards/reset with easy source control and CD operations. Visual environment diffing and drift detection can make this even easier.

- Make your stage and prod into a blue-green situation, where if stage is ready to go, you flip users onto it. I can guarantee your stage and prod will both be respected as prod then. Failing that, just add load/stress tests to stage to make it more prod-like.

- Non-prod environments and attention are not necessarily debt, but they are expensive insurance premiums. You should only pay those premiums if you need the insurance. It's about risk management.

- As time passes, the people who wrote a specific part of a system don't know it anymore, so having them babysit 'their' code in production has diminishing returns. On the other hand, having a systems quality team who have a broad mandate to bugfix, put in preventative measure, reduce technical debt, improve observibility and establish good patterns for developers to do these things, can enabled these things to actually happen, when just telling devs who are busy making features that they should happen often doesn't make them happen. Also there are devs who enjoy creating new things, and others who love trouble-shooting and metrics.


What exactly is would be the disadvantage of running something in stating environment before running it "for real" in production? I'm assuming the staging environment is an exact clone of production (except reduced size: fewer app servers + smaller DB instance)?

I understand the deploy-often-and-rollback-if-there-is-a-problem strategy, but certain things like DB migrations and config changes are difficult to rollback, so doing a dry run in a staging environment seems like a good thing...


It's funny in my career I've observed similar development styles. But I always just thought of this as great/good developers verses average/mediocre developers. the A+ coders would always make their code very easy to access, deploy from a user standpoint, debug, read etc. The mediocre guys would wait for someone else to hit a landmine beefore fixing something that was obviously wrong.


While I do not agree with everything presented in this article (especially item #2), I definitely share the overall sentiment.

For some of our customers, we operate 2 environments which are both effectively production. The only real difference between these is the users who have access. Normal production allows all expected users. "Pre" production allows only 2-3 specific users who understand the intent of this environment and the potential damage they might cause. In these ideal cases, we go: local development -> internal QA -> pre production -> production actual. These customers do not actually have a dedicated testing or staging environment. Everyone loves this process who has seen it in action. The level of confidence in an update going from pre production to production is pretty much absolute at this point.

The amount of frustration this has eliminated is staggering. At least in cases where we were allowed by our customers the ability to practice it. For many there is still that ancient fear that if we haven't tested for a few hours in staging that the world will end. For others, weeks of bullshit ceremony can be summarily dismissed in favor of actually meeting the business needs directly and with courage. Hiding in staging is ultimately cowardice. You don't want to deal with the bugs you know will be found in production, so you keep it there as long as possible. And then, when it does finally go to production, it's inevitably a complete shitshow because you've been making months worth of changes built upon layers of assumptions that have never been validated against reality.

This all said, there are definitely specific ecosystems in which the traditional model of test/staging/prod works out well, but I find these to be rare in practice. Most of the time, production is hooked up to real-world consequences that can never be fully replicated in a staging or test environment. We've built some incredibly elaborate simulators and still cannot 100% prove that code passing on these will succeed in production against the real deal.


I've worked with a customer who also had a post-production environment. They used it for the sole purpose of being able to replicate problems and do root-cause analysis in case things went horribly wrong. Then they took a snapshot of prod, synced it to post-prod, hotfixed prod as fast as possible, and then did their detailed analysis in post-prod.

This wasn't cheap; they payed Oracle somewhere between 50k€ and 200k€ a year just for the database for this environment, but they considered it worth it. (They were also in a pretty tightly regulated vertical).

My main takeaway is that I don't think there is a one-size-fits-all answer to the question of how many and what environments you need. IME having at least one "buffer" between dev and prod is a good thing, but I'm not sure to which extend my experience generalizes.


I agree with most of the points, but I have serious caveats on the first two.

1. No, the engineers should not by default be on call; the owners of the product are the first call line. If they're not engineers or if they're engineers but don't have enough time to deal with all incidents–in short, if they need to delegate–they better be willing to pay very generously for the extra hours of on call duty.

2. No, hosted is not better than open source, both for philosophical and operational reasons: mostly, you become subject to the whims of the provider. A good compromise is hosted open source solutions, which at least takes you half way to a migration, if the need for one comes up.

That aside, I very much agree on everything else.


I agree with most of this but the point about QA gating deploys should be amended. A 5 minute integration test on a pre-flight box in the production environment by the deploying engineer is a form of QA, and can catch a lot of issues. It shouldn’t be considered anti pattern. Manually verifying critical paths in production before putting them live is about the best thing you can do to ensure no push results in catastrophic breakage.

Without such a preflight box, or automated incremental rollouts, you are kind of doing a Hail Mary, since you are exposing all users immediately to a system that has not been verified in production before going live.


I agree with most everything said in the article but with a big condition. If I as an engineer am responsible for everything the author says I should be responsible for then I want total control of the tech stack and runtime environment.


Why would you ever share the medium version with all the added crud when the author has a version of the post on his personal blog?


I'm sorry but I'm tired of Medium paywall to the point I don't want to read anything there.



I couldn’t agree more.


this is another Agile or DevOps


> 2. Buy Almost Always Beats Build

Strongly disagree with that, well maybe it is a good idea when you are over founded by VC where cost of money is equal to zero and you don't want to master what you are working on but in all other cases this is wrong, you shouldn't rebuild everything from scratch but creating a company is not the same as playing with LEGO

And this is the same argument as saying you should have everything in AWS because if you self host you will have to hire devops engineer


Could you expand more on why you disagree with this? Do you believe the opposite - that "Build Almost Always Beats Buy"?

I've made the build-vs-buy decision many times in my career. I don't necessarily regret /all/ of those times, but the general lesson I've learned time and time again is that you're going to end up investing WAY too much time maintaining your special version of X when you should have spent that time solving problems unique to your business model.


If you're building something that already reasonably exists, you better be sure you can do it 3x better (for some economic metric of better, e.g. cheaper, bringing in more revenue).

If not, you're wasting your money in a different way, by not focusing on the things that really bring in revenue or by paying salary to people to maintain it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: