Shouldn't one design their systems so that a single machine rebooting is not fat...

chris_wot · on May 8, 2014

Depends on the environment. If you are a big enough business, then yes this would be a good idea. Otherwise, if you are small and don't have much money for infrastructure then perhaps not.

Though I have to ask: why do you reboot your servers every 24 hours?

valarauca1 · on May 8, 2014

I was gonna say 24 hours is short. I think a good time is about ~1 month. I know HP-UX suggests 2 weeks, and that OS has a history of 10-15 years up time, stupid y2K bug killing server up times.

laumars · on May 8, 2014

> "Shouldn't one design their systems so that a single machine rebooting is not fatal?"

In an ideal world yes. Not everybody has that opportunity though.

> "Generally I make my servers reboot at least once every 24 hours."

That can't be true. There's Belkin routers out there with greater uptimes. Christ, I think even my old Windows ME PC had longer uptimes than that!

No sane sysadmin would reboot a server multiple times a day. If nothing else, it's just a needless waste of electricity and any other resources it might drain during the power cycle (eg network IO if a SAN hosted VM).

I'm amazed at the just how much some people will exaggerate figures to make a point....

_3u10 · on May 9, 2014

What if reproducibility is more important than uptime?

How does memory fragmentation affect performance after more than 24 hours?

What happens when your server has to restart from a cold cache scenario?

What happens when your server is down?

What about that setting that a sys admin manually applied to the server 6 months ago to fix some issue and isn't in saved in the server config?

By forcing a condition people try to avoid you get good at dealing with those situations.

How often do your database servers actually fail over?

How confident are you in your code and systems to actually fail over properly?

How do you prime the caches when your memcached servers reboot?

laumars · on May 12, 2014

> What if reproducibility is more important than uptime?

Reboots aren't something you need to reproduce multiple times a day

> How does memory fragmentation affect performance after more than 24 hours?

If this is a massive issue then your server daemon is piss poor. However I think you're just clutching at straws here.

> What happens when your server has to restart from a cold cache scenario?

That doesn't mean you have to reboot your server several times a day

> What happens when your server is down?

It would only be down because you're rebooting the bloody thing :p

> What about that setting that a sys admin manually applied to the server 6 months ago to fix some issue and isn't in saved in the server config?

You don't need to reboot a live server several times a day to apply and test server config.

> By forcing a condition people try to avoid you get good at dealing with those situations.

Advocating repeatedly breaking live servers as practice to know what to do when they break accidentally is the dumbest thing I've read in a while. I don't need to repeatedly walk in front of a bus to learn not to walk in front of a bus. If you need to practice test scenarios then do so on a test system - that's what they're there for!

> How often do your database servers actually fail over?

Through my own negligence? They haven't yet. Through dodgy code rushed live by our developers; more often that I'd like to admit.

> How confident are you in your code and systems to actually fail over properly?

Code: not very. But I don't manage that. Systems: very confident. But I do sane load and disaster testing on test systems; and monitoring and logging on all systems to highlight potential issues before they completely snap.

> How do you prime the caches when your memcached servers reboot?

We're not stupid enough to reboot all our live infrastructure (and their redundancies) when they're in use. Let alone to do it multiple times a day.

--------------------------------------------

You're not convincing me that you need to reboot your servers multiple times a day; if anything, you're just convincing me that you don't have a proper dev and test infrastructure in place. And that's far more dangerous than any of the other issues you or I've raised thus far.