Some of this looks like horrible advice, particularly the defeatist attitude towards what the article calls "programmer errors". Statements to the effect that you can never anticipate or handle a logic error sensibly so the only thing you should ever do is crash immediately are hard to take seriously in 2016. What about auto-saving recovery data first? Logging diagnostic information? Restarting essential services in embedded systems with limited interactivity? This article basically dismisses decades of lessons learned in defensive programming with an argument about as sophisticated as "It's too hard, we should all just give up".
As others have already mentioned, much of the rest is quite specific to Node/JS, and many of the issues raised there could alternatively be solved by simply choosing a better programming language and tools. The degree to which JS has overcomplicated some of these issues is mind-boggling.
Basically the argument is that once you reach a logic error (e.g. NullReferenceException, IndexOutOfBounds etc) you already potentially corrupted the application state, so using any part of the application state is dangerous, and saving it to be used once the program has been restarted makes it worse - then you load the corrupted state into your restarted program. So while saving data is prudent - it should be done at regular intervals so that after a logic/programmer error is detected, the program can reload saved data from before the error occurred, not after.
One can also imagine having nested "top level" handlers for the various contexts where errors in one type of context is not as serious as others. Example: in a graphical application an exception arising from a mistake in UI code does not affect the "document" the user has open, so might be possible to "handle" this error by simply reinitializing the UI and reloading the active document (since we know the active document).
An exception due to a logic error thrown during a transaction on the document on the other hand should probably be considered corrupting, so the application must try to reload some document state from earlier instead. If there is no state then the correct thing to do is to tear down the application even if it means losing the document. It's better to lose the work and let the start over, than allow the user to continue working with data he isn't aware is corrupt.
The "let it crash" philosophy assumes that there is some external system monitoring & restarting the program that crashes. They mention this explicitly in the article, but it's worth repeating: you need this external system anyway. Your program may stop executing for all sorts of reasons other than a bug in your program, from bugs in your dependencies to uncaught errors to infinite loops to cosmic rays to someone tripping over the power cord to an earthquake destroying the entire U.S. west coast. Your distributed system needs to handle these as operational errors, and in extreme cases you might not even have power available for 1000 miles; there is no possible way that a single process could recover from that.
They also recommend configuring Node to dump core on programmer error, which includes (literally) all of the diagnostic information available on the server.
It really depends upon the language and environment used. I work with C (almost legacy code at this point), and if the program generates a segfault, there is no way to safely store any data (for all I know, it could have been trying to auto-save recovery data when it happened). About the best I can hope for is that it shows itself during testing but hey, things slip into production (last time that happened in an asynchronous, event driven C program, the programmer maintaining the code violated an unstated assumption by the initial developer (who was no longer with the company) and program go boom in production). At that point, the program is automatically restarted, and I get to pour through a core dump to figure out the problem.
I'm not a fan of defensive programming as it can hide an obvious bug for a long time (I consider it a Good Thing that the program crashed otherwise we might have gone months, or even years, with noticing the actual bug).
Logging is an art. Too little, and it's hard to diagnose. Too much and it's hard to slog through. There's also the possibility that you don't log the right information. I've had to go back and amend logging statements when something didn't parse right (okay, what are our customers sending us now? Oh nice! The logs don't show the data that didn't parse---the things you don't think about when coding).
And then there are the monumental screw-ups that no one foresaw the consequences of. Again, at work, we receive messages on service S, which transforms and forwards the request to service T, which queries service E. T also sends continuous queries (a fixed query we aren't charged for [1]) to E to make sure it's up. Someone, somewhere, removed the fixed query from E. When the fixed query to E returned "not found," the code in T was written in such a way that failed to distinguish "not found" with "timedout" (because that fixed query should never have been deleted, right?) and thus, T shut down (because it had nothing to query), which in turn shut down S (because it had nothing to send the data to), which in turn meant many people were called ...
Then there was the routing error which caused our network traffic to be three times higher than expected and misrouted UDP replies ...
Error handling and reporting is hard. Maybe not cache invalidation and naming things hard, but hard none-the-less.
> I'm not a fan of defensive programming as it can hide an obvious bug for a long time (I consider it a Good Thing that the program crashed otherwise we might have gone months, or even years, with noticing the actual bug).
Not when you do it the right way! You should only mitigate unexpected situations if you also log it, monitor it and handle it with error callback etc.
> I'm not a fan of defensive programming as it can hide an obvious bug for a long time (I consider it a Good Thing that the program crashed otherwise we might have gone months, or even years, with noticing the actual bug).
I've had segfaults "hidden" for a long time because my artist coworkers weren't reporting crashes in their tools. They assumed a 5 minute fix was something really complicated. Non-defensive programming is no panacea here. Worse, non-defensive programming often meant crashes well after the initial problem anyways, when all sane context was lost.
My takeaway here is that I need to automatically collect crashes - and other failures - instead of relying on end users to report the problem. This is entirely compatible with defensive programming - right now I'm looking at sentry.io and it's competitors (and what I might consider rolling myself) to hook up as a reporting back end for yet another assertion library (since none of them bother with C++ bindings.) On a previous codebase, we had an assert-ish macro:
Which let code like this (to invent a very bad example) not fatally crash:
..._CHECKFAIL( texture, "Corrupt or missing texture - failed to load [" << texturePath << "]", return PlaceholderTexture() );
return texture;
Instead of giving me a crash deep in my rendering pipeline minutes after loading with no context as to what texture might be missing. Make it annoying as a crash in your internal builds and it will be triaged as a crash. Or even more severely, possibly, if simply hitting the assert automatically opens a bug in your DB and assigns your leads/managers to triage it and CCs QA, whoever committed last, and everyone who reviewed last commit ;)
> Logging is an art.
You're right, and it's hard. However. It's very easy to do better than not logging at all.
And I think something similar applies to defensive programming. You want null to crash your program? Do so explicitly, maybe with an error message describing what assumption was violated, preferably in release too instead of adding a possible security vulnerability to your codebase: http://blog.llvm.org/2011/05/what-every-c-programmer-should-... . Basically, always enabled fatal asserts.
This might even be a bit easier than logging - it's hard to pack too much information into a fatal assert. After all, there's only going to be one of them per run.
Please, please, don't roll your own. It seems like an easy problem at a glance, but its far from it. The more fragmentation in these communities the worse off we all are. Sentry's totally open source, and we have generous free tiers on the hosted platform. Happy to talk more about this in detail, but if there's things you dont feel are being solved let us know.
> Please, please, don't roll your own. It seems like an easy problem at a glance, but its far from it. The more fragmentation in these communities the worse off we all are.
I've rolled my own before, for enough of the pieces involved here, to confirm you're entirely correct. There's a reason I'm looking at your tech ;)
> Happy to talk more about this in detail, but if there's things you dont feel are being solved let us know.
No mature/official C or C++ SDK. Built in support for native Windows and Android callstacks would be great - I see you've already done some work for handling OS X symbols inside the Cocoa bindings at least. Plus hooks to let me integrate my own callstack collection for other platforms you haven't signed the NDAs for (e.g. consoles) and whatever scripting languages we've embedded.
All the edge cases. I want to receive events:
* When my event reports a bug in my connection loss handling logic (requiring resending it later when the connection is restored.)
* When my event reports I've run out of file handles (requiring preopening files or thoroughly testing the error handling.)
* When I run out of memory (requiring preallocating - and probably reserving some memory to free in case writing a file or socket tries to allocate...)
* When I've detected memory corruption.
* When I've detected a deadlock.
Some of these will be project specific - because it's such an impossibly broad topic that sentry's SDKs can't possibly handle them all.
No hard crash collection - this might be considered outside of sentry.io's scope, though? It's also hideously platform specific to the point where some of the tools will be covered by console NDAs again. Even on windows it's fiddly as heck - I've seen the entire pipeline of configuring registry keys to save .mdmp s, using scripts to use ngen to create symbols for the unique-per-machine mscorlib.ni.dll and company - so you can resolve crashdumps with mixed C++/C# callstacks - and then using cdb to resolve the same callstack in multiple ways... it's a mess. I could still use the JSON API to report crash summaries, though.
On a less negative note, I see breadcrumbs support landed in unstable for the C# SDK.
EDIT: And then there's all the fiddly nice-to-haves, ease-of-use shorcuts, local error reporting, etc. - some of which will also be project specific - but rest assured, the last thing I want to do is retread the same ground that sentry.io already covers. And where there are gaps, pull requests are one of the easier options...
At work, we regard exception collecting as essential for both development and production - if an application reaches internal QA, it's already reporting to an exception collector. This is separate to whatever logging is going on.
Sentry.io is one of the services that we use, but I don't have any connection beyond being a customer. I would echo the sentiment about not rolling your own, though: you want your exception collector to be a thoroughly battle-tested bit of code, and if it's reporting to a remote service, you want that to be as separate as possible from the application infrastructure, and extremely reliable.
As others have already mentioned, much of the rest is quite specific to Node/JS, and many of the issues raised there could alternatively be solved by simply choosing a better programming language and tools. The degree to which JS has overcomplicated some of these issues is mind-boggling.