Do as best as you can to "find compute room" (laptop, desktop, spare servers on rack that arent being used, .. cloud) , and make a Stage.
Make changes to Stage after doing a "Change Management" process (effectively, document every change you plan to do, so that a average person typing them out would succeed). Test these changes. It's nicer if you have a test suite, but you won't at first.
Once testing is done and considered good, then make changes in accordance to the CM on prod. Make sure everything has a back-out procedure, even if it is "Drive to get backups, and restore". But most of these should be, "Copy config to /root/configs/$service/$date , then proceed to edit the live config". Backing out would entail in restoring the backed-up config.
________________________
Edit: As an addendum, many places too small usually have insufficient, non-existent, or schrodinger-backups. Having a healthy living stage environment does 2 things:
1. You can stage changes so you don't get caught with your pants down on prod, and
2. It is a hot-swap for prod in the case Prod catches fire.
In all likelihood, "All" of prod wouldn't DIAF, but maybe a machine that houses the DB has power issues with their PSU's and fries the motherboard. You at least have a hot machine, even if it's stale data from yesterday's imported snapshot.
You missed one of the really nice points of having a stage there. You use it to test your backups by restoring from live every night/week. By doing that, you discourage developing on staging and you know for sure you have working backups!
Indeed. But if it's just 1 guy who's the dev, I was trying to go for something that was rigorous, still yet very maintainable.
Ideally, you want test->stage->prod , with puppet and ansible running on a VM fabric. Changes are made to stage and prod areas of puppet, with configuration management being backed by GIT or SVN or whatever for version control. Puppet manifests can be made and submitted to version control, with a guarantee that if you can code it, you know what you're doing. Ansible's there to run one-off commands, like reloading puppet (default is checkins every 30min)
And to be even more safe, you have hot backups in prod. Anything that runs in a critical environment can have hot backups or otherwise use HAproxy. For small instances, even things like DRBD can be a great help. Even MySQL, Postgres, Mongo and friends all support master/slave or sharding.
Generally, get more machines running the production dataset and tools, so if shit goes down, you have: machines ready to handle load, backup machines able to handle some load, full data backups onsite, and full data backups offsite. And the real ideal is that the data is available on 2 or 3 cloud compute platforms, so absolute worst case scenario occurs and you can spin up VM's on AWS or GCE or Azure.
--Our solution for Mongo is ridiculous, but the best for backing up. The Mongo backup util doesn't guarantee any sort of consistency, so either you lock the whole DB (!) or you have the DB change underneath you while you back it up... So we do LVM snapshots on the filesystem layer and back those up. It's ridiculous that Mongo doesnt have this kind of transactional backup appratus. But we needed time-series data storage. And mongodb was pretty much it.
Make changes to Stage after doing a "Change Management" process (effectively, document every change you plan to do, so that a average person typing them out would succeed). Test these changes. It's nicer if you have a test suite, but you won't at first.
Once testing is done and considered good, then make changes in accordance to the CM on prod. Make sure everything has a back-out procedure, even if it is "Drive to get backups, and restore". But most of these should be, "Copy config to /root/configs/$service/$date , then proceed to edit the live config". Backing out would entail in restoring the backed-up config.
________________________
Edit: As an addendum, many places too small usually have insufficient, non-existent, or schrodinger-backups. Having a healthy living stage environment does 2 things:
1. You can stage changes so you don't get caught with your pants down on prod, and
2. It is a hot-swap for prod in the case Prod catches fire.
In all likelihood, "All" of prod wouldn't DIAF, but maybe a machine that houses the DB has power issues with their PSU's and fries the motherboard. You at least have a hot machine, even if it's stale data from yesterday's imported snapshot.