Hacker News new | past | comments | ask | show | jobs | submit login

A couple of things you can do to sleep at night and get away with testing in production:

- Aim to catch most of the preventable stuff before it gets anywhere near production. That means integration tests, unit tests, static code analysis, code reviews and all the rest. Use whatever you can get your hands on; anything it catches is preventable. Not catching preventable issues is inexcusable. Life is hard enough without these issues spoiling your day.

- Keep your deltas small. That way there is a lot less that can go wrong. Like exponentially less. If you are sitting on several weeks/months of changes, you don't need to test it to find out it is broken. I can guarantee you it is. It's statistically extremely unlikely to not be broken at that point. So, avoid pushing big changes like that and ship smaller deltas in between.

- Push all the time. Practice makes perfect. This should be a routine action and it should not hurt you. Push with confidence rather than perpetual fear. Iterate. You should be updating production multiple times per day.

- Use defensive coding. Assume errors will happen and have some proper tools in place to diagnose why they happened when they happen. Like log aggregation and usable logging in your code. Implement mitigations for these errors too so you can do some damage control when they inevitably happen. The worst is having errors happen and not knowing that they are happening. With a large enough production system, even the most unlikely combination of things that would cause issues will have a high probability of eventually happening. So plan for that.

- Use feature flags and other means to isolate experimental code. That way you can test it in production without putting it on the critical path to your business.

- Automate your CI/CD. It's stupidly easy with stuff like Github Actions these days. Manual processes are the type of things that people can do wrong. So, the fewer you have of that the better.

- Keep your deployment process fast. When you inevitably break production, the time to recovery is that process. The worst is having to wait 30 minutes for a fix to go live while your users and managers are getting more angry by the minute. Much better if you can get the fix out before they even notice something was broken.

- When stuff breaks, reflect on why it broke and try to prevent further breakage similar to that. A simple test that reproduces the problem can go a long way.




Do you have any recommendation for all of those practices: books, talks, external vendors? Buy vs. build?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: