Hacker News new | past | comments | ask | show | jobs | submit login
On rebooting: the unreasonable effectiveness of turning computers off and on (keunwoo.com)
250 points by todsacerdoti on May 24, 2022 | hide | past | favorite | 198 comments



I remember attending a talk by Walter Bright maybe three decades ago, back when he was known for his C++ compiler for Windows, and before that his compilers for DOS. He talked about his very paranoid approach to assertions; in addition to a huge number of asserts for preconditions, he would start out with asserts essentially at every basic block; he'd take them out as he produced tests that reached this point. This was the DOS days and he didn't have a coverage tool, so the presence of an unconditional assert meant that this line hadn't been reached yet in testing.

He said he developed that style because the lack of protection in DOS meant that any error (like a buffer overrun) could trash everything, right down to damaging the machine.

He said that early on those asserts were enabled in the shipping code, making it appear that his compiler was less reliable than competitors when he felt it was the opposite, but I think he wound up having to modify his approach.


he actively hangs out here, btw :)

https://news.ycombinator.com/user?id=WalterBright


A nice fella too.


I'm always delighted when Walter responds to something I post. HN isn't Twitter, which makes it feel all the more special.


last Advent of Code I took to putting asserts for every invariant in my code. most of my time was spent debugging, not the first write or running code. adding so many asserts provided a number of benefits:

  * asserts are a compact form of documenting invariants both expected and produced, making code easier to reason about
  * each assert serves as a tiny unit test with exactly the necessary cases
  * most importantly, asserts catch logic bugs when code doesn't produce the intended results
P.S. that has me thinking, there should be a tool to transform asserts into testing. The tool would have another keyword "expect" in addition to "assert", for preconditions. With both preconditions and postconditions, individual functions can be automatically verified with a prover or a fuzzer! The tool could also check that preconditions are met at each call site, without having to write and rewrite unit tests for each call site!


Even with these approaches, you will inevitably run into Heisenbugs that creep up as runtime errors. Even the most powerful compiler in the world can't catch those. It can be mitigated by having a system that is traceable from start to finish.


It’s been known for more than 35 years that most bugs found in production are “Heisenbugs”:

https://www.cs.columbia.edu/~junfeng/08fa-e6998/sched/readin...

Gray’s paper provides a logical reason why: If a bug can be reproduced reliably it’s much more likely to be found during testing and fixed, so appearing seemingly at random is a kind of Darwinian survival adaptation.


Heisenbugs are the worst.

Just last week I was on the verge of literally ripping my hair out trying to figure out a frustrating bug that never occurred when running in the debugger. After a lot of frustration I had an idea: rather than starting the application in the debugger, I’d attach later. When doing this, I began to see a huge amount of interesting first-chance exceptions (this was managed C# code) that clued me into the root cause of the issue and I was finally able to solve the problem.

I came to learn that an underlying library I was using had some code that ran on startup that actually detected if a debugger was attached and would turn off the problematic code path in this case! I came to learn that this was an intentional optimization the initialization of this code caused a bunch of noise during regular debugging (lots of first-chance exceptions that were ignorable). Unfortunately, this also meant that the interesting failure case I was running into was now completely hidden.


Production code that checks if a debugger is attached should be illegal.

Ideally there would be a way to get the debugger to lie to the process that a debugger is indeed not attached. Sure, maybe there's a reason today why you actually need to know a debugger isn't attached but at least give future-developers an escape hatch if they need it.


On Linux, you can "lie" to the process by running it with the LD_PRELOAD environment variable set to a library that hijacks ptrace(PTRACE_TRACEME, ...) to return 0. If the process is smart enough to check for LD_PRELOAD as well, you might be able to use something like seccomp to hijack the return value at the OS level.


You don't need LD_PRELOAD; the ptrace mechanism by itself can intercept system calls and modify them on the fly.


It's mostly an attempt at stopping someone attaching a debugger to an application you've shipped


Yep, and good luck using a debugger against race conditions (or any other time sensitive bugs).


The worst is when adding log lines changes timings and hides the race!


That can be where a debugger comes in handy! Instead of breakpoints it's (sometimes) faster to have the debugger log & continue. This doesn't always work, but often does with microcontrollers that have some dedicated debug/trace hardware.


Sounds like an easy bugix in that case, just leave the log in!


Heisenbug is a very specific concept that refers to bugs that don’t reproduce because you are attempting to debug them. E.g. your debug printf statement changes the compiler output enough such that the bug no longer occurs with the same inputs. Or running the program in the debugger changes the thread scheduling, so the race condition is no longer hit. General difficulty reproducing is not what makes a heisenbug. What’s hard about heisenbugs is investigating their cause, not necessarily discovering and reproducing them.


One such instance is when debug mode causes memory allocations to be zeroed, while release mode doesn't. Always a recipe for a good time.


Or your printf changes:

- the timing of execution so nothing works => hello USB driver!

- the scheduling of your threads

- the memory layout => a memory overflow won't crash your program the same way or not at all

I feel lucky I've enough hairs to pull them out on this kind of bugs. Fun time indeed.


> If a bug can be reproduced reliably it’s much more likely to be found during testing and fixed

which means in order to find as many bugs as possible during development, the dev environment must match the prod environment as exactly as possible.

Bugs don't "evolve" like a life form - unless you code is self-modifying! It is caused by a permutation of all possible inputs (where the environment is also an "input").

Limiting possible inputs (such as functional programming) means you limit the possible differences between dev and prod - so functional programming ought to produce less bugs!


Bugs don't evolve but the bug generators do. We build code with more and more complexity until we reach an equilibrium with our ability to debug the systems.

If there is no competition in the evolution then systems can become arbitrarily baroque and fragile, which leads to rules like 'no software updates/patches' and 'all config changes via the CCB and pass mandatory tests' or 'your problem is handled manually', the wire wrap of systems development.

Rebooting (the process or the computer) is trimming the state space. Model checking driven design really helps tame the tendency to let program state evolve without bound. FP tends to follow a bottom up design which makes this easier too.


> Bugs don't "evolve" like a life form - unless you code is self-modifying!

Bugs live in the codebase, and the codebase evolves (through human modifications).

Every change in the codebase has a probability of introducing a new bug. If the introduced bug is deterministic and easy to catch, the probability of its survival is lower. If the introduced bug is nondeterministic and hard to catch, the probability of its survival is higher. So eventually, as time approaches infinity, the probability of a random bug in the codebase being nondeterministic approaches 100%.


Or, if it is a system based on many components that can be independantly updated, some bugs may be introduced because a mismatch caused one variable to corrupt and then that one just keeps being propagated to newer versions until it triggers a bug because you assumed it can never have that specific value.

Which, it actually can't in the current state, but only because of the system itself changin, or evolving, over time.


Or go to the opposite extreme and use fuzzing.


This is why I only test in production.


You test? How modern. I develop in prod, editing the code direct on the server. Users are the best testers anyway, they’ll phone if there’s a problem.


The best is when you push your changes when your users are most likely to be invested in the correct behavior of the system.


The code is the correct behavior of the system. Users are deluded in their investments, sort of like crypto bros.


Also designing something "crash-first", even if you don't call it like this, leads to many different approaches and possible improvements.

For example lets imagine some embedded device in the ISP network, it is not very accessible so high reliability is required. You can overengineer it to be a super reliable Voyager-class computer, but that a) will cost much more money than it should, b) you will fail to achieve the target.

Or you can go crash-first approach. Many things can be simplified then, for example no need to have stateful config, no need to code any config management which saves it, checks it etc. You just rely on receiving new config every boot and correctly writing it to controllers. Less complexity.

Then if you are crash-first you expect to reboot more often. You then optimize boot time, which would be a much lower priority task otherwise. And suddenly you have x times better (lower) downtimes per device.

You can optimize on some hard stuff - e.g. any and all 3rd party controllers with 3rd party code blobs and weird APIs. Instead of writing a lot of health checks for each and every failure mode of the stuff you can't really influence, you write a bare minimum and then watchdog to reboot whole unit and hope it will recover. And this works very well in practice.

The list goes on. Instead of very complicated all-in-one device you have a lightweight code which has a good and predictable recovery mechanism. It is cheaper, and eventually even more reliable than overengineered alternative. Another example - network failure. Overengineered device will do a lot of smart things trying to recover network, re-initialize stuff, re-try different waits (and there will be a lot of re-tries), and may eventually get stuck without access. Lightweight device has a short simple wait with simple retry, or several, and then reboots. Statistically this is better than running some super complicated code, if the device is engineered to reboot from the start.


I remember a visitor center large-scale multitouch game we made for Sea World. It was being installed for long-term heavy usage. We built it in Flash (I know) — it was lovely but we just couldn’t stop the memory leaks.

We made slight adjustments (outros and intros) so that it seemed natural to have a 10 second break for the program to restart. And, we built in a longer cycle of computer resets. It was an unreasonably stable system for years!


Great Story. I like the Erlang\distributed systems view of the world: Who needs costly resilience or recovery when you can simply die and be reborn again? And if you can't do that, well... make it so you can. Erlang and distributed systems in general have no choice because the kind of computing they do is so wickedly complex there is no other way to fail, but you and GP's comments illustrate that even when you do have other options, this way of faliure is simply easier and more effective.

Can you elaborate on

>We built it in Flash [...] we just couldn’t stop the memory leaks

I thought Flash games was written in a high level JS-like language ? did it grant you enough access to raw memory that you can leak ? or did you mean a high level equivalent to memory leaks ?


We never figured it out. But when the program would run for a long time the RAM would fill up and the game would slow to a snail’s pace. We turned to this restart solution in desperation.


thank you for playing Wing Commander!


You can see this approach even in spacecraft. The Apollo Guidance Computer was an example of a crash-first system (https://www.ibiblio.org/apollo/hrst/archive/1033.pdf).


Mandatory

Light Years Ahead | The 1969 Apollo Guidance Computer

https://www.youtube.com/watch?v=B1J2RMorJXM

34C3 - The Ultimate Apollo Guidance Computer Talk

https://www.youtube.com/watch?v=xx7Lfh5SKUQ


The second one is interesting for using hexadecimal in its new syntax format, even though octal is a natural fit for a 15-bit word and was used in the display. The reverse was common back in the 70s and 80s, so I guess it's always been about which is understood rather than which is correct.


Well explained.

Erlang seems to also follow this kind of philosophy, although on a more granular level. The point seems to be in separation of "worker" code and "supervisor" code - where "worker" represents a well-behaved function without any (unexpected-)error checks, and "supervisor" represents error-handling code that will catch and resolve any errors that happens in the worker code, expected or not.

Joe Armstrong's "Making reliable distributed systems in the presence of software errors" contains more information on the topic.


I am utterly in awe every time there's a breakthrough in durable random access memory, where someone (often the author) is so very excited about how this will mean we will never have to power off our computers again. Flip a switch and it's instantly on!

Have you lost your ever-loving mind? Of course you're going to reboot. Bit rot improvements have only ever been incremental. To a first order approximation we've expanded from a day or two to a couple of weeks/months for general purpose computers (I argue that servers are not GP, and so their spectacular uptimes don't translate).

In thirty years, that's barely more than an order of magnitude improvement. You're gonna need a couple more orders than that before taking away the reboot option sounds anything like sanity. If you can keep a machine from shitting the bed for 2-3 times longer than the expected life of the hardware, then we can talk, but I'm not making any promises. Until then I'm gonna reboot this thing sometimes even if it's just placebo.


Quite a bit of our infrastructure runs for very long periods (long in IT terms) - years between faults or reboots.

Sure lots of domestic modems & routers have to be reset often, but it's common to find infrastructure routers that are only reset when they have to be moved! Says a lot about the software in domestic equipment.

I've worked with a mainframe that ran for a decade, and was only rebooted when a machine room fire required a full powerdown "right now"! But people are impressed when their Windows stays up for a few day %#%$#@#$!!

And remember Voyager, been ticking along since 1977.

But all these impressive systems are built to be impressive. Probably error correcting RAM, etc. Shows we can build reliable hardware. And if a substantial fraction of users wanted it, things like ECC RAM etc would be only fractionally more expensive than our error prone alternatives.

Of course,


No disagreement in general, but to your point of infrastructure routers (assuming you refer to ISP and Internet backbone infrastructure):

Having worked in ISP security, IMHO a years-long uptime of such critical components is nothing to be proud of (anymore). Quite the contrary, those are complex components, so if you care about security you have to regularly patch them, including occasionally required reboots. Just look at the list of security advisories of relevant vendors (Cisco, Juniper, Nokia/Alcatel-Lucent, etc), you can find scary vulnerabilities! Granted, "rebooting" a core router is more nuanced than a regular PC (you can e.g. reboot one management engine of a pair; or just a line card; etc), so it does not always mean that the entire traffic stops because of it.

Oh and btw. your network design should be able to cope with such necessary reboots, otherwise you have a single point of failure.

Regards


> And remember Voyager, been ticking along since 1977.

This is running a single unthreaded process.

My Mac, which isn't doing anything fancy, has over 500 processes running on it. In fact, I just checked to see if anything bad was going on, and I recognize everything I look at - almost 100 processes from Chrome alone, for example.

How sure am I that all of these processes are running correct code? Chrome is running "101.0.4951.54 (Official Build) (x86_64)", which gives a hint of the disposable nature of that binary.


how much of that hardware can run a random executable from the web? and does so pretty often?


This is what I meant by that parenthetical about not being general purpose.

A computer used for a single task is a bit like a 4WD truck that only stays inside the city limits. It could do those things, sure, but it never does, so it hasn't really proven anything.


> A computer used for a single task is a bit like a 4WD truck that only stays inside the city limits. It could do those things, sure, but it never does, so it hasn't really proven anything.

Not really. That's very different from a very application specific hardware+software being used for the exact purpose vs a very general purpose hardware+software being used for all sorts of things.

It's more like using a F1 car to race vs taking your average sporty car to a race track.

Sure that F1 car will race better, but at the end of the day you can't drive home in it or move kids/groceries around in it.


That’s a fair point but kind of seems like a straw man compared to the Voyager example


My argument was towards the network/IT equipment, but Voyager is even more special since it's a very application specific system.

Generalization itself is hard, it gets MUCH harder when you have to care about back compat and random executables that can alter system state because previous versions allowed that behavior and it needs to be supported for the common cases moving forwards.


All DDR5 DRAM ICs have on-die ECC. This is new for DDR5.


And the highest data rates to date with the signal integrity requirements that accompany it. Got a piece of dust pushed down by your CPU cooler saddling two DIMM pins? Get ready for your machine to shred your data. And that's just a common simple scenario. I's be surprised if real world error rates in nominal scenarios won't be higher than with DDR4.


You may be right, however, you could have said the same thing for ddr4 vs 3, with its crazy new high data rates!


And again still likely be right. 10 years ago consumer electronics marketing never included signal integrity stuff like eye diagrams but now pretty much every nvidia announcement with a new memory standard does. We're really pushing ever closer to channel bandwidth limits and corners that could be cut in the past can no longer be cut. ECC is more important than ever.


That's not the type of ECC the parent was talking about. That's because the densities and clock rates are so high for DDR5 that it needs ECC to function properly, but like most standards the minimal implementation is really quite watered down. It doesn't correct the entire range of bitflips that a server with ECC RAM does.


Disagree. Parent was discussing the need to reboot after a system has been on for a long number of hours. The failure mode, assuming it's related to the DRAM, would be an accumulation of bit-flips in the DRAM. Every memory has some FIT/Megabit rate. The on-die ECC added in DDR5 spec will be highly effective in addressing this failure mode.

Channel ECC is the ECC type most directly relevant for high clock rates and signal integrity aspects. I agree with you that Channel ECC becomes a practical requirement to meet the interface transaction rates of DDR5. It is also true that channel ECC is not mandatory in DDR5 and is not implemented by mainstream CPU platforms (like previous DDR generations).


If the on-die ECC reduces the error rate but the lack of standard channel ECC increases the error rate, because of the much more demanding signals, then it's not clear at all that the overall rate of error will lower.

In fact it could very well be higher depending on how the physical module is designed.

I imagine some portion of bit-flip induced reboots are due to the actual DRAM chips, but also some portion will be due to everything else that can bit flip both on the memory module itself and in the interconnect.

I haven't seen anything yet to say that DRAM chip bit flips will be in the majority.


It's never been clear to me if the ECC is necessary for DDR5 to operate, or just a nice feature. Do you or any other readers happen to know the answer?


> And if a substantial fraction of users wanted it, things like ECC RAM etc would be only fractionally more expensive than our error prone alternatives.

I started in the PC industry in 1988.

Then, they did. All IBM PC kit used 9-bit RAM, with a parity bit.

It was discarded during 1990s cost-cutting.


I don’t see any relation…? What’s stopping you from rebooting a system with NVRAM? What’s stopping you from shutting it down? Hibernating it? Nothing.

However, NVRAM enables power-efficient sleep. My old ThinkPad used almost all of its battery charge overnight in S3 standby. On the unreasonable efficiency of instant-on computers


That seems very poorly designed. My Acer, a 5 year old laptop, uses a few percent at best overnight.


My “old ThinkPad” is now 15 years old, so yeah. Things have improved a lot since then.


It was only after people got used to Microsoft products, for which the company categorically refused to refund anybody, that they would not consider needing to turn things off and on as a reason to send them back for a refund.

Probably the FTC should have stepped up and ordered Microsoft to issue refunds. But they didn't, and here we are.


You are not wrong. Bill Gates is the hero of his own story, but the villain in millions of others. I mean this as a joke (or do I?) but the service Larry Ellison provides by the barrel and Steve Balmer provides by the glass is keeping Bill Gates from being the biggest bastard currently living (I gather that the Carnegies or Rockefellars were not great people half the time either). As a certain defrocked comedian used to say, they're old and they're trying to buy their way into heaven.


Even with perfect hardware you will still have to reboot due to software instability. Lots of software isn't even tested to run for weeks/months/years.


The only reason I reboot my Linux laptop, or desktop, is when a kernel update appears. This can even be a year.

I run xorg, and requiring X11 to restart is equally rare. Thus, I often go for months or longer, without a restart at the gui level.

If I get a browser update, I restart that, and so on.

Microsoft has conditioned the world to accept absurdity. Just the lost productivity alone, due to reboots...


That is just the Microsoft effect. Before Microsoft, anything that didn't run through the entire warranty period was returned to the vendor for your money back. And, you got it.


I'm also speaking of e.g. the junkware running on automotive headunits. These systems are expected to be shut down at least once a day, so overall focus on (not only long-term) stability is really low.


Those came in post-Microsoft. Peak Microsoft was the US Navy ship dead in the water, towed back to port, because they had stupidly put MSWindows in charge of it.


Before complex GUIs. Ftfy


The Therac-25 (https://en.wikipedia.org/wiki/Therac-25) didn't have a complex GUI. However, I hope that those affected by that bug did get their money back (and then some). Not that money can really help you if you get exposed to fatal levels of radiation, but...


GUIs do nothing more complex than many things that have to work right, so do. It is only tolerance for crappy software that allows it to be foisted on us.


"There seem to be certain analogies here between computing systems and biological ones. Your body is composed of trillions of compartmentalized cells, most of which are programmed to die after a while, partly because this prevents their DNA from accumulating enough mutations to start misbehaving in serious ways. Our body even sends its own agents to destroy misbehaving cells that have neglected to destroy themselves; sometimes you just gotta kill dash nine."

I think that's also reminiscent of Alan Kay's philosophy behind OO in its most original form, and probably most closely realized in Erlang:

"I thought of objects being like biological cells and/or individual computers on a network, only able to communicate with messages (so messaging came at the very beginning -- it took a while to see how to do messaging in a programming language efficiently enough to be useful). - I wanted to get rid of data. The B5000 almost did this via its almost unbelievable HW architecture. I realized that the cell/whole-computer metaphor would get rid of data[..]"

I wonder why so many of the most popular programming languages went into the opposite, very state and lock based direction given the strong theoretical foundation that computing had for systems that looked much more robust.

http://userpage.fu-berlin.de/~ram/pub/pub_jf47ht81Ht/doc_kay...


Alan Kay has been for decades revising his description of how he thought. Suffice to say, Smalltalk-72 was not an OO language. That was bolted on later.

(Which is not to claim that OO has, as Kay insists, any universal merit.)


I'm looking at the Smalltalk-72 manual right now and I see e.g. the following:

> In Smalltalk, every entity is called an object; every object belongs to a class (which is also an object). Objects can remember things about themselves and can communicate with each other by sending and receiving messages. The class handles this communication for every object which belongs to it; it receives messages and possibly produces a reply, typically a message to send to another object.

So far as I can tell, this means that Smalltalk-72 was what-Kay-considers-object-oriented.

I'm not sure whether this refutes the claim you're actually making, because apparently you're saying that Kay lies about what he used to think, and maybe e.g. you're saying that what Kay now says about object orientation is not what he used to say, and that in the 1970s he wouldn't have considered Smalltalk object-oriented. Or maybe you're saying that the right way to think about object-orientation is something different from Kay's, and that Smalltalk-72 was not what-you-consider-object-oriented. Or something.

Could you maybe be more explicit?

Could you give some examples of things that Kay now says he used to think, and explain why you believe he didn't actually think them? Could you explain in what sense you reckon object-orientation was absent in Smalltalk-72 but bolted on later, and why you think that indicates intellectual dishonesty on Kay's part?


His definition of object-oriented has evolved continuously; after the 80s, mainly to try to exclude C++, judging by appearance. Smalltalk-72 does not qualify as what he today calls OO, because that requires a constrained sort of variable run-time binding -72 lacked. Some people call what -72 did "object-based".


Where can we find him saying what you describe as "what he today calls OO"?

Where can we find him saying "requires a constrained sort of variable run-time binding"?


Sorry, that would be work.


So, baseless criticism.


So, baseless criticism.


Back in the day, when I was cutting my teeth on embedded systems, I read an Intel Application Note - probably for the 8051. The section about managing watchdog timers stated that it is often good practice to deliberately let the watchdog time out periodically, at a convenient time. Then the system would be reset to a predictable state regularly, and failure states leading to unplanned watchdog timeouts would occur less frequently.

To this day, all my systems which are intended for prolonged unattended operation reset themselves at least once a day.


The Payment Card Industry (PCI) standards that all credit card readers conform to mandate that they actually reboot at least every 24 hours :)

[edit] More specifically there are a number of things that have to happen every 24 hours - some memory has to be zeroed, firmware integrity verified, etc. Most vendors like Verifone [1] implement this by rebooting the reader at least every 24 hours on a timer.

[1] https://developer.verifone.com/docs/verifone-documentation/e...


I know speculating on the purpose of biological sleep is a common cliche on internet forums and pop science, but what if this is why we sleep ? is there any evidence something in our body "resets" when we do?


State is a well known source of error after all.


> at least every 24 hours on a timer.

Shouldn't that be 'at most' or 'at least once'?

(My English skills are lacking so that was quite the mental stumble for me)


I understand how the way the original comment phrased things sounded, to your ears, like "at least 24 hours should pass between reboots," which is not what was intended. And your suggestion, "at least once every 24 hours," would be perfectly acceptable and idiomatic English.

However, "at least every 24 hours" is also perfectly acceptable and idiomatic English. It is very common in English to use this construction, "at least every X time period," to mean "X often or more often." If you say, "I make sure to call my distant relatives at least every year," you mean with that frequency or more. If you say, "I make sure to stand up and stretch at least every hour," you mean with that frequency or more. If you say, "these machines are required to reboot at least every 24 hours," you mean with that frequency or more.

In other words, "at least" does not qualify the number of hours (such that more hours would also be acceptable), but the frequency (such that greater frequency would also be acceptable).

Hopefully this response helps you to understand why English, fickle language that it is, works this way in this case.


It doesnt explain why that became an idiomatic way to say it. But at least it explains the what.


English is a highly irregular and challenging language unless you grew up with it. You have my sympathies.


'at least once every 24 hours' would be valid.

'at most every 24 hours' would imply reboots_per_day < 1 is bad and reboots_per_day >= 1 is good; i.e. that the standard was focused on preventing unnecessary reboots, but didn't care if the reader didn't reboot regularly, which I think is contrary to the comment's meaning.

'at least every 24 hours' is correct.


> ... would imply reboots_per_day < 1 is bad and reboots_per_day >= 1 is good;

I understood as that being exactly the point of the GP, rather than contrary. Because of the 'most' referring to the 24 hrs, not amount of reboots.

But I guess this is the point where my lack of English skill comes into play.


...I've just realised that I got the < and >= backwards while trying to explain that they were backwards. facepalm

> ... would imply reboots_per_day > 1 is bad and reboots_per_day <= 1 is good;

is what I meant to say!

But now I've taken a look at your phrasing ... I think I read your post as 'at most once every 24 hours', not 'at most every 24 hours' which is actually pretty ambiguous, and not as clear-cut as I was trying to say.


Yeah, my writing was quite ambiguous as well. I shouldve written out the full sentences.


You are correct


I remember a story from a couple of years ago about how 747s couldn't have more than like a week of uptime without running into catastrophic problems, and proggit and hn were in disbelief about the poor quality of the code and wondered how often it became an issue in flight. Turns out never, because the computers were restarted regularly for maintenance, and besides that were designed to safely restart in the air just in case anyway


"FAA: Reboot 787 power once in a while" (2016)

https://news.ycombinator.com/item?id=13094600


This also avoids bugs that are hard to find - like the windows uptime bug that only happened after an integer number of tics overflowed or something.


It was amusing that Windows was so unstable that it took decades for that bug to be discovered.


https://www.cnet.com/culture/windows-may-crash-after-49-7-da... (woah, cnet hosts articles from 20 years ago)


Alas, “only” 7 years to find.


The issue with Windows was not its instability, it is with its design. Unless using Run commands and Command Prompt, every adjustment is layers down a context menu to open some control panel that takes an excruciating 10-20 seconds, just to appear, and Microsoft intentionally hid all the fine controls. While literally everything is solvable without rebooting-- reproducing, tracing, researching and solving it takes too long because of the system design. If the problem isn't appearing often enough to warrant investing the time it takes to do it right, rebooting is the only rational choice. The first thing every Tier I help desk operator instructs is to reboot (because it's in their script, and because 95 times out of 100, it does the trick). "But I already tried that," "Please reboot again anyway." They're not wrong. Rebooting is a terrible solution. It's just that it usually works, and it takes a lot less time than doing things correctly.


The primary modus operandi for that time was turn on your PC, work/game, turn off your PC. As Win9x was never a server OS, nobody treated it as so.

And when people whine about 9x being unstable they don't remember (or even never experienced it themselves) how awful was the hardware it worked on.

I "fondly" remember some combinations of hardware were a literally ticking time bombs, you never knew when it whould BSOD. Though by that time I had enough understanding what if I see CMSXXX.VXD failing it is the problem with a cheap ass sound card drivers, not Windows itself.


Well also Windows systems should be rebooting every month for security updates.


As best as I can recall from the mid-90s, security updates were significantly less common. Many people didn't even have an internet connection.


I do this with cloud VMs these days: it's particularly useful when there's third party code in the mix. The theory is that your uptime represents a known tested amount of operation (roughly), and as such everytime you go beyond that in production but not testing, you're getting into unknown space - systems are too complicated to validate everything (see the classic patriot missile system bug [1]).

So, you should never deploy anything to production which won't be rebooted more frequently then you reboot it in testing. In practice, you should probably reboot much more frequently - as frequently as possible - to keep the delta between "known good" and "mutated" as small as possible.

[1] https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...


man way back when we bragged about more than 2k days of uptime as evidence of resilience


Half the products people build don’t even last 2000 days nowadays, but any system you build should have a continuous uptime well over 2,000 days. My dns servers for example have been responding to dns for the last 12 years on the same IP without more than a second of downtime.

It might be acceptable for AWS to crash every few months, but it’s not acceptable for my systems to be out for the length of a reboot.


Someone told me about a car where the ECU reset to a known state every revolution of the engine. "Rebooting" that often doesn't even feel like rebooting anymore, but they said that was the only way they felt it would be reliable enough.


Opposite of the robustness principle:

https://en.wikipedia.org/wiki/Robustness_principle

It's funny, I spent some time developing tools for CPU architects. Both the concept in the OP's anecdote, and the above principle don't really apply because logic doesn't break in the same way source code breaks. You don't run a program in HDL, you synthesize a logic flow. One could concievably test all possible combinations of that logic for errors, but it becomes 2^N combinations where N is the number of inputs and the number of state elements. Since this cannot be tested because the space is huge (excluding hierarchical designs and emulation), you generate targeted test patterns (and many many mutexes) to pare down the space, and perhaps randomize some of the targeted tests to verify you don't execute out of bounds. And even "out of bounds" is defined by however smart the microarchitect was when they wrote the spec, and that can be wrong too.

The only way to find and fix these bugs is to run trillions of test vectors (called "coverage") and hope you've passed all the known bugs, and not found any new ones.

There are four decades of papers written on hardware validation, so I'm barely scratching the surface, but I think it's a very different perspective compared to how programmers approach the world. I think a lot of the bugs that OP is talking about fall into the hardware logic domain. There isn't really a fallback "throw" or "return status" that you can even check for. Just fault handlers (for the most part).


The hardware world was doing fuzzing about a decade before the software world was, though they called it coverage-directed constrained random simulation: generate test vectors in a way that you cover more of the design.


>One could concievably test all possible combinations of that logic for errors, but it becomes 2^N combinations where N is the number of inputs and the number of state elements. Since this cannot be tested because the space is huge (excluding hierarchical designs and emulation), you generate targeted test patterns (and many many mutexes) to pare down the space, and perhaps randomize some of the targeted tests to verify you don't execute out of bounds.

This is all true of software too. Certainly, it’s easier to test software, but I don’t think it’s because software state is combinatorially simpler than hardware state.


Actually it is the opposite: it is harder to test software.

Software fuzzing is more difficult as function parameters can be (literally) infinitely more complicated to define (e.g., variable length strings), whereas logic test vectors are 1's and 0's (a lot of them, but a finite number).

I've yet to see a fuzzing library that could handle all possible combinations of State, even with judicious mutex support.


> logic doesn't break in the same way source code breaks

Uhh, sure it does. Any non-trivial digital hardware design, be it for an ASIC or an FPGA, will contain a lot of state machines interacting with each other, sometimes asynchronously. Hardware isn't immune from reaching a system state which requires a reset of the device.


> A novice was trying to fix a broken Lisp machine by turning the power off and on.

> Knight, seeing what the student was doing, spoke sternly: “You cannot fix a machine by just power-cycling it with no understanding of what is going wrong.”

> Knight turned the machine off and on.

> The machine worked.

from http://www.catb.org/jargon/html/koans.html#id3141171



I don't get it.


(Tom) Knight says that fixing problems requires you to understand what's going on, and you can't just blindly turn a machine off and on again and hope it'll fix things.

In the novice's particular case, Knight understands what is actually wrong with the machine, and that in this case it will be fixed by turning it off and on again, so he does that.

The idea that in such a situation the machine wouldn't be fixed by power-cycling it when the novice does it, but would when an expert with deep understanding does it, is a joke.

The joke is mimicing the form of a Zen koan. Koans often play with contradiction, I think with the idea of shaking the reader out of simplistic black-and-white thinking into something more holistic and less dichotomizing. I don't think "you need to understand things deeply and not just make random easy changes in the hope of fixing them, but sometimes it happens that those random easy changes are what's actually required" really counts as the sort of holistic non-dichotomizing thing Zen is trying to teach, but it's kinda in the right ballpark.


I found it interesting also that the Recovery-Oriented Computing project at UC Berkeley/Stanford puts "design for restartability" as a research area. http://roc.cs.berkeley.edu/roc_overview.html



Piece featured recently here : https://news.ycombinator.com/item?id=8464573


Apparently a ‘96 Toyota ECU can be restarted at 180 kph with only noticeable effect being a brief ignition ping. I've been told this isn't the case for newer cars though.


I’ll let you field test that one


Considering every animal sleeps on a daily basis, I would say “turning it on and off again” pertains to maintaining the stability of complex state-based systems.


An interesting analogy, but it's worth remembering that we don't actually understand sleep as a mechanism very well.


I thought the analogy of defragmentation was considered acceptable. I stopped doing all nighters a few years ago and the days have become much easier to handle, even bad days. Always getting a good night's sleep is, in the modern vernacular of philosopher-poet DJ Khaled, a "major key alert"


On a practical note, testing a system's ability to start up is usually much faster and easier than testing its ability to work properly for long periods of time. Real-world long-term behavior is unpredictable and somewhat chaotic, but there are only so many things you can do straight out of a reboot.


This is food for thought. I wonder if there’s some other ways we could architect operating systems so these problems wouldn’t manifest as much?

Like, the “startup process” seems to be framed wrong to me. Why does it take tens of billions of clock cycles (seconds) to start up? Essentially all the system is doing is filling some memory with a known-good kernel image, and initialising some hardware devices. If we think about startup as “restoring a (deterministic) boot image”, then booting is much faster. And we might be able to use that to verify the kernel’s integrity periodically and maybe restore parts of that image at runtime if the system ends up in an invalid state. Sort of a “soft / partial reboot” process.

In my testing code I make heavy use of check() methods. These methods go through all my runtime invariants and checks that they all still hold. This is fantastically useful in testing - a check method and a fuzzer do the work of 10x their weight in unit tests. I wonder if a similar method in the kernel (run every hour or something) could be used to find and track down bugs, and automatically recover when they’re found. “Oh the USB device is in an invalid state. Log and reset it!”.


To better understand why we can't just load a snapshot of the "desired" memory state directly from a cached image, we have to understand what the OS actually does during it's initialization.

One of the main responsibilities of the OS is to manage hardware resources. During init, kernel drivers and modules perform health checks and determine availability and status of hardware resources, peripherals, I/O, etc. Since lots of things could go wrong with the various hardware components, at anytime, this process should be performed during each system boot.


Your reasoning doesn't invalidate the soft reboot concept. It merely states that having soft reboot would not allow us to get rid of hard reboot, which I agree with.


Exactly. If there’s benefit in checking that everything is configured correctly, and software bugs, hardware bugs and cosmic rays cause things to enter a misconfigured state, then maybe it’s worth periodically validating the internal state at runtime? The earlier errors are detected, the less damage they can cause and the easier they are to debug. Also when errors are detected, I imagine soft-reboots of that hardware or kernel module would usually be better for the user than hard reboots of the entire system.

Mind you, a once in a blue moon reboot of the network interface might cause more problems in a web application than a reboot of the entire machine.


Will we never be done with this “unreasonable effectiveness” cliche - or am I missing something?


It's an unreasonably effective cliché.


I would go further and call it a meme at this point. And it's completely lost its effectiveness. After all, this article uses the phrase to mean exactly the opposite of what it originally meant, arguing that it is, in fact, _perfectly_ reasonable to expect that rebooting will solve your problem:

> I offer the following argument that restarting from the initial state is a deeply principled technique for repairing a stateful system [emphasis original]

Whereas, Wigner's original article on the "Unreasonable Effectiveness of Mathematics in the Natural Sciences" has this statement:

> The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve.


HN needs a blanket ban on that and "considered harmful", and "rise and fall", and "for fun and profit". People are so uncreative.

Maybe we can make a browser extension that rewrites bad titles.


Not until it's considered harmful.


I find it quite sad how pervasive the 'considered harmful' line is. It wasn't even the original title of the article. 'A Case Against the Goto Statement' was a much better title.


The winning strategy of Mike’s parable is also known as offensive programming: https://en.m.wikipedia.org/wiki/Offensive_programming


The missing link is RE-initilization or re-validation of the expected state.

Offhand I remember some discussion of how old dialup friendly multiplayer games would transfer state. Differential state would be transferred. There might or might not be a checksum. There would be global state refreshes (either periodically or as bandwidth allowed).

The global state refreshes are a different form of re-initilization. The current state discarded in favor of a canonically approved state.


We call this station keeping or anti entropy in our designs.


Well yeah, all modern computers and software are nondeterministic state machines. The order and set of operations is nondeterministic, and its input, output and algorithm constantly mutates. As there is no function within the computer to automatically reverse its tree of operations to before it encountered a conflict, an infinite number of state changes inevitably leads to entropy, thus all running systems [with state changes] crash.

Since the computer is a state machine, and all initial state is read from disk, restarting the computer reinitializes the state machine from initial state. Rebooting resets the state machine to before state mutation entropy led to conflicts. Rebooting is state machine time travel.

We could prevent this situation by designing systems whose state and algorithms don't mutate. But that would require changing software development culture, and that's a much harder problem than rebooting.


It is 100% theoretically possible to build a complex system that never enters an invalid state. But because human beings are fallible—and attempts to correct that infallibility are not just expensive, but themselves subject to further fallibility—no such complex system will ever be built.


State machines can already deal with invalid state. They can walk a tree of operations/input until they hit a "dead end" and then go back up the tree and try other operations. But you don't need to entirely remove the possibility of invalid state. You can add mitigations.

One mitigation is regularly snapshotting state, and on error allow the user to revert to a previous state. Your browser's Back button is a form of this.

Another is programmed responses to invalid state. If the program encounters an invalid state, it can attempt to resolve the issue. One method would be to bubble up a signal to the system, and the system can have pre-programmed responses. For example, if a program raises an exception of "Error: Out of disk space", then the larger system can be triggered to perform some disk space garbage collection. If a program raises an "Out of memory" error, the Linux kernel already has an "Out of memory killer" that attempts to relieve applications of their memory.

We can come up with much more sophisticated methods if we try. But again, this would require software development culture to think outside the box, and that is highly unlikely.


Actually building such a computer is provably impossible: https://en.wikipedia.org/wiki/Rice%27s_theorem

I highly recommend this talk on the subject: https://pron.github.io/posts/correctness-and-complexity


Rice's Theorem isn't as strong as you say it is. What it says is that there exist programs that it's impossible to prove whether or not they have specific non-trivial properties. But there still exist many programs where we can prove their correctness / incorrectness. What's more, I'd argue that if you're writing code whose behaviour is literally undecidable, you're doing something wrong, unless you're doing it for research purposes or something like that.


> Rice's Theorem isn't as strong as you say it is

It is very strong: its corollary is there is no general algorithm to build Turing Machines with arbitrary non-trivial properties (if there was such an algorithm we could use it to build a TM that solves the halting problem)

So for any non-trivial property the only way to build a TM with this property is to use some ad-hoc method (as there is no general algorithm) - but how do we know the TM we've just built really holds the property we wanted it to? There is no way to verify it due to Rice's theorem.


Quoting from Wikipedia about Rice's theorem:

> for any non-trivial property of partial functions, no general and effective method can decide whether an algorithm computes a partial function with that property

This basically translates to "for each non-trivial property P, there exists a program x for which it is undecidable if x has the property P", or "∀P ∃x such that it's undecidable if x has property P".

Notice that in this statement, programs are quantified using an "exists", not a "forall". Something similar holds for what you said here:

> its corollary is there is no general algorithm to build Turing Machines with arbitrary non-trivial properties

Translated to logic, this becomes "for every program y which takes as input a property and outputs a Turing Machine, there exists a property P such that the y(P) does not have property P", or "∀y ∃P y(P) does not have property P".

Because this statement uses "exists" on the property, it does not rule out the existence of a y which works for most properties P that we'd care about. It only rules out one that works for all properties.

As a simple example where it's possible to prove a non-trivial property, take the property "x is the identity function". This is a non-trivial property, because it is not true about all programs, nor is it false about all programs. And it is trivial to prove that the function "(n) => n" satisfies this property.

Now, I additionally think that most practical programs can have their correctness / incorrectness proven. My reason for this belief is that generally, I expect coders to understand why the code they're writing should work. If nobody understands why the code works, I'd generally consider that a bug, or at least a major code smell. But if the programmer actually understands why the code works, and hasn't made any mistakes, then in principle their understanding could be converted into a proof, because proofs are more or less just correct reasoning, but written out formally. Of course, it would still be very hard, but it's not "provably impossible".

Edit: minor text change


> Because this statement uses "exists" on the property, it does not rule out the existence of a y which works for most properties P that we'd care about. It only rules out one that works for all properties.

Translated: there exists a computational model weaker than TM but strong enough to implement programs we care about.

Unfortunately, we don't know such computational model and we know empirically surprisingly many problems require a TM (ie. surprisingly many languages are accidentally Turing complete)


Making a system Turing-complete, even accidentally, does not make it proof against analysis. A system or input for the system may be built out of parts that could be used to make an undecidable system. But that does not mean you can only build undecidable systems with those parts.

It is in fact extremely common to build wholly deterministic machines out of the same parts as our unreliable computers, that are correct by construction. And they may implement a Turing-complete language and still be wholly deterministic, just by limiting the input to what can be proven deterministic in a strictly limited time. Again, this is normally by construction, where the proof is always trivial.

Such a proven-correct system can model literally any space-constrained algorithm.


> It is in fact extremely common to build wholly deterministic machines out of the same parts as our unreliable computers, that are correct by construction.

It is neither common nor easy but actually provably impossible. Please read this piece carefully: https://pron.github.io/posts/correctness-and-complexity


Yet, it is done, thousands of times every day, all over the world.

You must be misunderstanding the work if you believe that what is being done routinely is impossible.


> You must be misunderstanding the work if you believe that what is being done routinely is impossible.

Quite possibly I don't understand what you mean by "work". If by "work" you mean creating provably correct software then no, it is not done routinely.


The work is the paper you cited.

People create wholly deterministic state machines all the damn time.


> People create wholly deterministic state machines all the damn time.

What do you mean by "wholly deterministic"?


All possible states are well-understood, all possible paths through states are mapped.

You can buy an 8-bit adder/accumulator from a catalog.


I just want to point out that you don't need to know the precise bounds of such a computational model to prove things about your programs. There already exist Turing-complete programming languages where you can prove non-trivial properties about your program and which have had practical applications, for example F* and its use in Project Everest.

https://project-everest.github.io/

https://www.fstar-lang.org/


Just having a language that allows you to prove things does not mean it is possible to prove them.

We still don't know if it is possible to build a provably correct multi-tasking OS or a web browser without security vulnerabilities. If it was easy we would already have one written in FStar or Idris.

While admiring all the work done by Project Everest - the fact is that it is a multi-year project (I first read about it about 3-4 years ago and it was already under way) with a very limited scope - it is more of an evidence of difficulties of writing correct software.


This still doesn't exclude the existence of a general method for a well-behaved subset of Turing machines. It is the model of Turing machines that is too powerful.


There are two problems with this:

1. Even for as weak computational model as FSM the problem of verifying its correctness is intractable (NP-hard). And modularisation does not help.

2. We don't know any weaker computational model that is strong enough to be viable for the problems we want to solve and still weaker than TM. It is well known that it is surprisingly easy for a language to be (accidentally) Turing complete.


Yet languages with totality checkers and coinduction like Idris exist. These allow you to make useful but provably terminating programs. Humans are able to construct these termination proofs, though presumably not for all programs.

In practice, this means that the reductive take of Turing completeness, the halting problem and Rice's theorem is misleading in some way.

Yes, there is no general algorithm for constructing termination proofs for any arbitrary Turing machine. But there are clearly classes of problems for which a biological computer (a human) is able to construct termination proofs and proofs regarding other semantic properties.


It is not necessary to verify that random presented machines have some property. It suffices that we can reliably construct machines that perform a desired function and have the desired property.


> It suffices that we can reliably construct machines that perform a desired function and have the desired property.

See above - there is no reliable procedure (algorithm) that would allow you to do that.


Rice’s theorem says no such thing. Rice’s theorem tells you you can’t necessarily prove you’ve built a flawless system, not that you can’t build a flawless system. You can build a perfect program even if you can’t necessarily then know you have.

Perhaps you interpreted my post to suggest that you could build a machine that verifies that any arbitrary software run on it never enters an invalid state? If so, I agree, that is not possible. I made a minor edit to make it clear what I’m talking about.


> Rice’s theorem tells you you can’t prove you’ve built a flawless system, not that you can’t build a flawless system.

If you think about it - it is the same: how do you know the system you just built is flawless?


You don't, but it's still flawless. My point was that it was theoretically, not practically possible. So your retort that it's not practically possible is kind of missing the point.


My point is that it is _theoretically_ not possible: there is no algorithm that would allow you to build a TM implementing arbitrary requirements.

You have to use ad-hoc methods but then it is not possible to know your ad-hoc method is correct!

So by "flawless" you can only mean "having some property but we cannot say what property it is".

Edit: added "flawless" definition


I'm not sure you're getting this. Let's say you want to build a video chat application with all of the same features as Zoom and not a single bug. That is, in fact, possible. Nothing about Rice's Theorem says you can't build a program without bugs. You can build a flawless video chat application. You won't, but you can. How? Very easy. Every time you're about to write code with a bug, don't. Write that same code but without the bug.


> Every time you're about to write code with a bug, don't. Write that same code but without the bug.

This procedure (algorithm) is ill-defined (is not realisable by a TM) as it would require verifying two things:

a) the code you are about to write has a bug (ie. has a specific property)

b) the code you wrote does not have a bug (ie. has a specific property)

Rice's theorem says both are impossible.


Maybe moving to a specific instance of Rice's theorem is clearer. My point is that while he can't solve the halting problem—i.e. he can't reliably determine whether any arbitrary program will halt—an infallible programmer can reliably write his own programs such that they always halt.

So, to go back to the original point, you can't reliably determine the correctness of every possible program. But you could, if you were not prone to error like all humans, always write your own programs such that they are always correct.


No.

From the fact that we know how to write programs that halt you cannot deduce we know how to write programs with arbitrary (or even just useful) property, because there is no general procedure to do that.

In other words: you cannot use the algorithm of writing programs that halt to write programs computing nth digit of Pi :)


Because you built it in a series of steps, exactly zero of which introduced any undecidability.


See above - there is no reliable procedure (algorithm) that would allow you to build a TM with arbitrary properties (ie. implementing arbitrary requirements).


Programs don't have to have Turing-complete behavior. Compilers do, but not every program needs to implement uncomputable behavior on its inputs.


Actually it's the other way around:

In reality most interesting programs accept Turing complete languages - albeit not formally specified :)


Physical things fail or have failure done to them. You make a big claim.


Good point. I’m putting aside the physical nature of the machine that must ultimately run the software. Made a minor edit so I’m talking about a “system” not a “machine”.


Anyone old enough to remember IT Crowd? Here's my favorite quote from it: https://www.cipher-it.co.uk/wp-content/uploads/2017/11/ITCro...


I once deleted a 15TB database by temporarily shutting off an AWS i3en instance.

Nobody told i3en's have "ephemeral storage" that gets wiped on shutdown.


I know of an AWS customer that achieved a similar outcome for their entire stack after the ops team was directed by finance to use only spot instances.


I imagine a lot of folks have lost toes to that foot gun.


I once heard someone use the term “therapeutic reboot” for when computers get into a bad operating state. It really stuck with me.


"Not a single living person knows how everything in your five-year-old MacBook actually works. Why do we tell you to turn it off and on again? Because we don’t have the slightest clue what’s wrong with it, and it’s really easy to induce coma in computers and have their built-in team of automatic doctors try to figure it out for us. The only reason coders’ computers work better than non-coders’ computers is coders know computers are schizophrenic little children with auto-immune diseases and we don’t beat them when they’re bad." [1]

[1] https://www.stilldrinking.org/programming-sucks


We thus internalize our oppression, and become our own abusers.


State management!

Basically how can you know all the variables at one time? Start on initialization.

IMO the fact that everything "works" after reboots is simply because the state is 1. well defined and 2. available to all developers.


Is there a citable source for "The parable of Mike (Michael Burrows) and the login shell"?

I do not think that this whole discussion about returning to a "known good initial state" has much merit, because computers are Turing machines: they modify their own state and then act upon that state. Which means that unless your boot drive is read-only, nobody can predict what your computer will do on the next reboot. (Unless someone solved The Halting Problem and I missed it.)

But the discussion about assertions and initially fragile software that quickly becomes robust did strike a chord with me, because that's exactly what I have been doing for decades now. (Also, I see other programmers using assertions so rarely, that I feel kind of lonely in this.) I have a long post about assertions (https://blog.michael.gr/2014/09/assertions-and-testing.html) to which I would like to add a paragraph or two about this aspect, and preferably cite Burrows.


A possible analogue of this for building/packaging systems is the DOCKERFILE for building a container for infrastructure. No wonder, it acts as a reliable state initialization procedure.


So what's everyone's take on restarting servers every now and then?

I think that it might be something that you do occasionally (say, once a month or so) for each of your boxes, regardless of whether it's a HA system or not.

Though perhaps I can only say that because the HA systems/components (like API nodes or nodes serving files) won't really see much downtime due to clustering, whereas the non-HA systems/components (monoliths, databases, load balancers) that i've worked with have been small enough for it to be okay to do restarts in the middle of the night (or just weekends or whatever) with very little impact to anyone, even without clustering at the DB level/switchovers.

I mean, surely you do need to update the servers themselves, set aside a bit of time for checking that everything will be okay and don't try to patch a running kernel most of the time, right? I'd personally also want to restart even hypervisor or container host nodes occasionally to avoid weirdness (i've had Docker networks stop working after lots of uptime) and to make sure that restarts will result in everything actually starting back up afterwards successfully.

Then again, most of this probably doesn't apply to the PaaS folk out there, or the serverless crowd.


The more time it passes, the less need I see for periodic restarts. For example, a few years ago I would need to restart some servers due to memory fragmentation that would cause some allocations to fail (we needed some processes to get big chunks of continuous memory). Right now it doesn't happen. Memory leaks are rarer, and in general things work well. I have some servers with more than a year uptime and still working well.


The first law of tech support. Second law is check (unplug and re-plug) all the cables.


> Is this the best that anyone can do?

No. It's the best that can sometimes be done quickly.

Additionally, this doesn't mention the value of a good postmortem. Or the horrors of cloud computing, where restarting things is deemed a good enough mitigation because in the cloud these things happen, and nobody pushes for a good postmortem and repair items.


The whole article here is an argument for the proposition that, in many cases, this is not just the best compromise taking expediency into account, but the best thing to do, period (there are some caveats in the section titled "Complications.") You are entitled to a contrary view, but you have not said anything to counter the arguments presented here.

The last two sections do include the value of postmortems (or forensic analysis, as the author puts it) though from the perspective that this is more feasible and effective on a system that promptly crashes when things go wrong.


It sometimes really is as good as you can expect. You can have a compiler that checks for increasingly complex errors in intent but it won't defend you against cosmic rays.


One of my first jobs was working on DAB digital radios. Our workhorse low-end model was quite flaky, but it had a watchdog timer that would reboot it if it hung; and it was so simple that it would get back to acquiring a signal in a fraction of a second.

So the upshot was, a system crash just meant the audio would cut out for maybe 6 seconds, then start playing again. For the user, completely indistinguishable from poor radio reception, which is always going to happen from time to time anyway. Crashing wasn’t a problem at all as long as it was rare enough.

(I like making robust software and hate crashes, so I don’t like that approach, but I use that experience to remind myself that sometimes worse really is better, or at least good enough.)


On an application level, this means to restart the application.

Even when no unexpected exception occurs, sometimes frequently restarting the application can be a valid solution. For example when there are potential memory leaks (maybe in your code but maybe also in external library code) and the application should just run infinitely.

Btw, a simple method to restart an application: https://stackoverflow.com/questions/72335904/simple-way-to-r...


As it is now common to shut down subsystems for reasons of power optimization, this sort of approach may be becoming an accidentally autonomous architecture.


There are 2 hard problems in IT:

    0. Cache invalidation
    1. Naming things
    2. Off-by-one errors
Index 0 is why reboots are so effective.


I find even leaving them off pleasing. Don't run compputers 24/7, your brain doesn't, too.

The smaller the scale (amount of people affected), the more sense do business hours make. Ot times of non-operation. Who cares if a personal/smallbiz webserver sleeps for some hours at night?


Back in the 90's i competed with my friends on who had the longest uptime. Most guys I've meet they reseted the computer multiple times a day to increase the performance. I never turn off my mac, it has to be a software update or something like a general crash.


I don't think it's that unreasonable for a complex system with complex states to reboot. I imagine (I'm not a medical professional) that it works in roughly the same way as when human beings black out; reboot.


"Before a computing system starts running, it’s in a fixed initial state. At startup, it executes its initialization sequence, which transitions the system from the initial state to a useful working state:"

LOL etc


Before HN, I didn’t know so many things are unreasonably effective.


Works great for bad OS’s like Windows


No idea why you got downvoted. As a Linux user I don't restart my whole computer to try fix something. I can't even remember the last time I restarted without having installed a new kernel.

I also don't know why I would, I can restart the broken components and get a less cluttered debug log when doing only that.


Same, my FreeBSD uptime is approaching 1 year. Server, not good for desktop tho




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: