If you go that route, I also recommend using something like Ansible. Unfortunate...

erenst · on March 13, 2023

We used to manage 500+ servers with Ansible for almost 10 years. It was a nightmare.

With so many servers Ansible script would ocassionally fail on some servers (weird bugs, network issues, ...). Since the operations weren't always atomic we couldn't just re-run the script. it required fixing things manually.

Thanks to this and emergency patches/fixes on individual servers, we ended up with slightly different setup on the servers. This made debugging and upgrading a nightmare. Can this bug happen on all the server or just this one because it has a different minor version of package 'x'?

We switched to NixOS. It had a steep learning curve for us, with lots of doubts if this was the right decision. Converting all the servers to NixOS was a huge 2-year task.

Having all the servers running same configuration that is commited to GitHub, fully reproducable and tested in CI, on top of automatic updates of the servers done with GitHub action, was worth all the troubles we had with learning NixOS.

This entire blog post could be a NixOS config.

pornel · on March 14, 2023

I realize Ansible is kinda slow and can be flaky, and wouldn't use it for 500 servers. However, for one beginner VPS I think it's fine.

The fact that it's not hermetic and perfectly reproducible is a major problem for a fleet, but for single user it's a benefit. It offers a graceful migration path from a snowflake server to a managed server, and still works even if you can't manage to do 100% of the config automatically.

ratorx · on March 13, 2023

I don’t think Ansible is a good solution. It’s kinda a meh middle ground between reproducible system (which is hard to actually do well with Ansible) and simply using backups/snapshots. Ansible introduces a lot of overhead, but does not provide particularly strong reproducibility guarantees. There’s no really advantage to treating config differently from data unless you want to use revision control (and simple backups are not enough) or have some tricky config generation to do.

In a lot of cases, especially if you are gonna be moving between machines, rather than scaling horizontally, your machines don’t need to be reproducible. You can just restore a backup and tweak the new stuff (network configuration, etc). This gets rid of an entire class of complexity and lets you just do things once (rather than figuring something out and encoding into config management DSL).

If you actually need reproducibility, then it makes sense to go all the way. Something like NixOS provides stronger guarantees than Ansible. There is definitely a larger learning curve than Ansible though.

malborodog · on March 14, 2023

Quick comment on best method for doing a total manual backup/clone of a VM and moving it to another (whether other provider, downsizing, upsizing, etc.)?

a link or a couple of words would be great -- might save hours.

ratorx · on March 14, 2023

I’d strongly recommend taking frequent (automated) full backups if not using config management. With config management, you can maybe get away with only backing up application data. The best way to do it is to use something like rsync/borg in combination with ZFS/Btrfs. The idea is to take a snapshot of the filesystem and then back up the snapshot, which handles any consistency problems.

For low load systems, just using rsync/Borg by itself may be enough. The advantage of Borg is incremental, deduplicated backups.

I personally use borgmatic (which is a wrapper over Borg, that is nicer to configure). It comes with a fairly decent starting config: https://torsion.org/borgmatic/docs/how-to/set-up-backups/

tomkarho · on March 14, 2023

Since you mentioned both rsync and snapshot: rsnapshot is a tool that exists.

ratorx · on March 14, 2023

rsnapshot is not what I meant by snapshots. I meant filesystem level snapshots that are atomic. It is an alternative to rsync/Borg, but it does not handle consistency properly (similar to rsync/Borg). It does not solve the snapshotting part of the problem.

malborodog · on March 17, 2023

So should I just use dd or something? Or can I actually just rsync from one root to another and that will capture all of the dependencies, nginx config, literally everything? GPT-4 tells me I can just shut off everything that might change the file system in source (nginx sshd etc.) then literally rsync from root to root with

sudo rsync -aAXv --exclude-from=exclude-list.txt --delete --numeric-ids --rsync-path="sudo rsync" / user

(the exclude list is just /dev /proc /sys /tmp /run /mnt /media /lost+found).

At a certain point you start to wonder if saving the 50 bucks a month is even worth it....

ratorx · on March 17, 2023

Yes just rsync should work if you are stopping everything on the first system. The second system should be completely switched off. If you can just mount the drive, that’s the way to go.

malborodog · on March 18, 2023

No, don't think so -- I would be going from one digital ocean VM to another. So the target machine will be in operation. I honestly might just leave it and think of some other use for the server to justify keeping it running in current state.

Thanks for explaining this though - really appreciate it.

bravetraveler · on March 13, 2023

+1 for Ansible, or really any tool of choice.

Even well-written scripts will do! Just know - if you go with writing your own, you're doing something someone probably already did as a module.

With proficiency Ansible is like writing declarative pseudocode. Say what you want and a huge library of Python will 'make it so'.

Eventually you'll have a library of roles that don't care about the operating system, maybe even the provider.

A peer comment mentions docker-compose. That's fine, but I prefer Ansible -- the DSL can be made very similar with a role... and it's far more capable. Like preparing the runtime.

zamnos · on March 13, 2023

Can I interest you in some Docker? If you're not using docker, and still have an extensive pile of Ansible scripts (you still need some, just not as much as before), it would really be worth your time to investigate how those scripts can be simplified using newer technology.

lmarcos · on March 13, 2023

Ansible is useful for stuff like installing openssh on the server and setting up a secure sshd config file. Or setting up ufw. You can do more, of course. I don't think Docker is suitable for that kind of stuff.

floatinglotus · on March 13, 2023

I can’t imagine using Ansible to configure a single server, seems like the wrong tool IMO.

Why not something like docker-compose and put all of the config bits into a single repo?

That way you can clone the repo on a new server and simply rebuild the containers.

lmarcos · on March 13, 2023

How do you setup sshd or the ufw, or fail2ban... with Docker? I do see the point on running nginx and a Python apps using Docker, though.

Macha · on March 13, 2023

How do you setup sshd with ansible? You need some form of access for initial provisioning, Ansible doesn't fix that either.

Sure, you can take that manually configured sshd and replace it with something managed by ansible _after_ the initial setup, but not before.

ratg13 · on March 14, 2023

While learning ansible is great, it’s easy enough just to put your configuration in cloud-init.

I don’t think there is a VPS provider that doesn’t use it.

thisismyswamp · on March 15, 2023

Why is cloud-init so hard to find? You just made me learn about it, and it seems like the obvious thing to use for VPS initialization.

Shouldn't the hosting provider have pointed me to it from the start? So weird!

re-thc · on March 14, 2023

What about post init? Ansible can be run regularly for maintenance patching too.