Yep, that's the exact bullet point I was writing a response on. Security and abu...

blr246 · on July 12, 2019

Your response highlights a good idea to mitigate the risk I was trying to highlight in mine.

They want to have a rapid response path (little to no delay using staging envs) to respond to emergencies. The old SOP allowed all releases to use the emergency path. By not using it in the SOP anymore, I'd be concerned that it would break silently from some other refactor or change.

Your notion is to maintain the emergency rollout as a relaxation of the new SOP such that the time in staging is reduced to almost nothing. That sounds like a good idea since it avoids maintaining two processes and having greater risk of breakage. So, same logic but using different thresholds versus two independent processes.

jsnell · on July 12, 2019

Right. The emergency path is either something you end up using always, or something you use so rarely that it gets eaten by bit-rot before it gets ever used[0]. So I think we're in full agreement on your original point. This was just an attempt to parse a working policy out of that bullet point.

[0] My favorite example of this had somebody accidentally trigger an ancient emergency config push procedure. It worked, made a (pre-canned) global configuration change that broke everything. Since the change was made via this non-standard and obsolete method, rolling it back took ages. Now, in theory it should have been trivial. But in practice, in the years since the functionality had been written (and never used), somehow all humans had lost the rights to override the emergency system.

jacques_chester · on July 13, 2019

My personal rule is that any code which doesn't get exercised at least weekly is untrustworthy. I once inherited a codebase with a heavy, custom blue-green deploy system (it made sense for the original authors). While we deployed about once a week, we set up CI to test the deployment every day.

Cold code is dead code.

solatic · on July 13, 2019

> Security and abuse are of course special little snowflakes, with configs that need to be pushed very fast, contrary to all best practices for safe deployments of globally distributed systems.

Once upon a time, I worked on a system where many values which would otherwise be statically defined in similar systems where instead put into a database table. This particular system didn't have a proper testing and deployment pipeline set up, so whereas a normal system would just change the static value at some hard-coded point in the code and quickly roll it out, this system needed to keep it in the database so that it would be changeable in between manual deployments (months or even years apart). The ability to change the value facing the user by changing the value in the database inflated the time it took to test a release, thus exacerbating the amount of time it took to release a new version, but well, it worked.

My point is that if security and abuse rules need to be rolled out quickly, then the system needs security and abuse systems where the entire range of security and abuse configurations (i.e. their types) are a testable part of the original pipeline. Then the configurations can safely be changed on the fly, so long as the changes type-check.

It's easy to understand why it's never been built though - you'd need both a security background and a Haskell-ish/type-theory kind of background. Best of luck finding people like that.