Hacker News new | past | comments | ask | show | jobs | submit login

> Configuration bugs, not code bugs, are the most common cause I’ve seen of really bad outages, and nothing else even seems close.

The funny thing about this conclusion is that configuration and code are not dramatically different concepts when you think about it. One of them is "data" and the other "code", but both affect the global behavior of the system. Config variables are often played up as being simpler to manage, but it's actually more complicated from an engineering standpoint, since we know there is code required to support said configuration.

The process is what's dramatically different. "Write a story with acceptance criteria, get it estimated by engineers, get it prioritized by management, wait two weeks for the sprint to be over, wait for QA acceptance, deploy in the middle of the night," vs. "Just change this field located right here in the YAML file..."




Also, I can't speak for all companies, but where I work configuration is how we define the differences between our test and production environments.

If your config files are intentionally different, because in test you should use authentication server testauth.example.com and in production you should use auth.example.com, then how can you avoid violating test-what-you-fly-and-fly-what-you-test?

Obviously, you could add an extra layer of abstraction (make the DNS config different between test and production and both environments could use auth.example.com) but that's just moving the configuration problem somewhere else :)


This how we do it - a) Have a regression test suite running continuously and have alerts pop up when they fail. Have a minimal set of config values in your regression suite and fire off alerts when they fail. b) Setup monitoring for your components and trigger alerts based on some thresholds c) With (a) and (b) setup, rollout your bits to a canary environment and if all looks good, trigger rolling deployment to your prod environment.


you automate the deployment and that automation runs checks. If things don't work out, it refuses to deploy.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: