Hacker News new | past | comments | ask | show | jobs | submit login
Sherlock Holmes Debugging (seancassidy.me)
114 points by bqe on Nov 9, 2014 | hide | past | favorite | 28 comments



I feel like this developer is trying to present a false dichotomy between his "Sherlock" method of debugging and the scientific debugging method. To be more specific, it seems as though the example he gave exactly fit the definition of the scientific debug method.

1. The author identified a problem he wanted to fix

2. He tried to guess the cause of the problem

3. He tested that guess

4. He analyzed the results of multiple trials.

5. When he was done with analysis, he worked out a fix and then tested again.

The only thing of note to the "Sherlock" method is that the author explicitly decided to spend a great deal of time in the guessing phase. What is my point here? Unless you already have experienced a bug and/or know exactly what is causing it right off the bat, the scientific method is still the most effective tool in your debugging arsenal.


This was my thought as well. Sherlock's method can be seen as a special case of the scientific method that focuses on gathering all the facts before making your HYPOTHESIS. (You can't call it a theory until you've tested it, and even then, the theory explains the facts, not the other way around.)

Also, completely missing from this story is how deductions are made and the difference between deductions, hypothesis, and facts.

For example, fact: The message isn't sent. Deduction: cURL works, so the problem isn't the system. Hypothesis: The problem is in the API.


> I use the Sherlock Holmes method of debugging software.

Everyone does, surprise! Finding a cause by effect is exactly what the debugging is :)

That said, does anyone else remember the "Undo Step Forward" command in TurboPascal 5.5 IDE? You could un-step through the program being debugged, essentially moving back in time. That was an incredible feature, a true engineering curiosity.


Reverse execution has been available in GDB for five years now:

https://sourceware.org/gdb/current/onlinedocs/gdb/Reverse-Ex...


... and I actually knew that, but forgot. Thanks. To TP's credit it had this feature back in the early 90s, over 20 years ago. Though I suspect it wasn't the first to have it either.


There's also Mozilla's rr https://github.com/mozilla/rr.


"Everyone does, surprise!"

Not true. You need a deep understanding of the workings of the system and a logical mind to be able to work backward from the symptom to the cause. (Patience helps too.) Lots of programmers who don't have these traits just try fixes at random. Either they find the solution (after a long time), or they give up and ask someone who is better at debugging to help them. I frequently get asked to debug problems that other people have failed to debug.


I'm not sure it's that simple. For example, you push code that breaks something. You have two options (after rolling back the code of course,) you either try to accomplish what you were doing in a sufficiently different way that you think it will avoid the problem (essentially the random guessing strategy you mention, but with purpose), or you try to dig into the problem to understand it enough to fix it at the lowest level of abstraction where the issue is introduced. Both strategies can work. In certain scenarios however, the former can be much, much quicker to enact, but you won't walk away with much reusable knowledge beyond the "the doctor says don't do that" kind.

I tend to attempt to just "go around" problems in the quick manner if I have the intuition that there is some top level fork in the road I can take, and that anything I am to learn in the process of performing proper diagnostics has minimal liklihood of being useful or reusable, and is not just some mindless incidental complexity of the task at hand. It's hard to be 100% correct in this prediction, often things like that in the OP can result in unexpected useful knowledge, but I think you can do better than random at guessing if it will be the case. I think part of the skill of being an engineer is learning what issues are worth tackling head-on and need to be understood deeply to insure against future peril, and what issues should be simply maneuvered around without much understanding.


The risk of not finding the root cause of the bug is that you don't really know for sure whether your fix is good - it may have just changed the state of the system in a way that masks the underlying problem, perhaps by causing it to appear under different conditions that you haven't encountered yet (but which your users may find very quickly).

Also, the code you just pushed may not have been the cause of the bug - it may have just exposed a pre-existing bug. If you just back out your change and try something else, you'd just be sweeping the bug under the rug.


Right, it may have just changed the state of the system to mask the problem, or it may have just swept it under the rug. But you also may have just avoided a complex, long debugging session which would not result in reusable knowledge.

Debugging something that ultimately ends up being a 24-hour one-time glitch in AWS is a waste of time unless you learn something useful in the process or have sufficient reason to believe you altered the speed with which it was fixed, no matter how fun the journey. Sure, if it results in you deciding you need to be resilient under AWS failures more, that's great. But often times you just burned man hours and lost the opportunity to work on something else. I don't think it's reasonable to say that all debugging results in useful learning. (I'd probably argue the contrary, given the number of times I end up grokking incidental implementation details of some third party software or system during debugging that in just a few short months later becomes useless as that code or service is paved over.)

Trying to tilt the odds in your favor so you make educated guesses around when this will be the case is part of the skill. Part of this skill is also being able to recognize when a bug you thought had been routed around was simply swept under the rug, and changing course to understand it properly instead of sweeping it under the rug again. ("Fool me once, shame on you, fool me twice... you can't get fooled again.") I find these are the exception not the rule, and are probably worth the cost. The joys of shipping outweigh the joys of understanding every facet of a misbehaving system, at least for me.


> But you also may have just avoided a complex, long debugging session which would not result in reusable knowledge.

I find this is becoming increasingly rare for me, to the point where I'm having difficulty remembering the last time it occurred. I run into the same "once off" problems more frequently, and getting more value out of debugging them. I'm getting better at using my tools, better at tying symptoms to causes, better at coming up with appropriate defensive coding techniques, and writing more (reusable) code that makes them easier to debug when they (and entire categories of similar bugs) reoccur.

> Part of this skill is also being able to recognize when a bug you thought had been routed around was simply swept under the rug, and changing course to understand it properly instead of sweeping it under the rug again.

Frequently not possible on two fronts:

On recognizing it: I've seen a distressing number of bugs which are hideously context sensitive (meaning "works fine on my machine" is the punchline because it's 100% broken on another, only now you've got a terrible bug report and no good repro case)

On changing course: Any offline distribution channel. Any embedded system. Any onerously regulated environment. Once it's in the wild, it can become very expensive to fix.

> The joys of shipping outweigh the joys of understanding every facet of a misbehaving system, at least for me.

That's fair. I simply hope you avoid the sorrows of burned customers and data loss.


For debugging javascript there's tracegl [1] and you can always restart frame in devtools [2]

1: https://github.com/traceglMPL/tracegl

2: https://twitter.com/ebryn/status/443080437485682689


This was a nice story about thorough debugging, but I can't see how the discovery method is anything different than what most people use.


Differential Diagnosis is an excellent approach to debugging. Seeing debugging in the wild before, I've seen two, maybe three different approaches to debugging:

1) Magical thinking: This is the fear that the computer system has suddenly changed its mind and decided to stop working. The approach here is typically to try some cargo-cult knowledge on how to fix the system without knowing what's wrong, and then just start hitting random things, and see if they come back.

2) Panic / Fixation: These kinds of developers usually get into a state where they are in a panic over what's wrong and fixate on one component of the system where they believe the failure to be. This is slightly more productive than the earlier, but not changing your assessment based on the new information gained is also counterproductive.

3) DDX: This is very much the Sherlock / House type of debugging. It tends to require a lot of wide-shallow knowledge, as opposed to in depth knowledge of the system. Typically, you work backwards from the symptoms, and essentially do a binary search based on the widest type of problem sets to the narrowest.

DDX in-fact is very difficult for most people to do under stress. Nelson from Stripe wrote a post on keeping a notebook to store wide information, and enable caching b-tree searches. This is a very good approach.


And if you want to program like Holmes as well...

http://www.amazon.com/Elementary-Learning-Program-Computer-S...

This was the first programming book I read, back in the eighties. Lately I got a copy for nostalgia's sake, and it's all still spot on good practice today - problem definition, algorithm design, avoiding global state etc - although the language (both the Pascal syntax and the Conan-Doyle-style prose) probably wouldn't suit today's "Dummy's Guide" market.


Sometimes I have tried to use the scientific method to resolve hard to reproduce bugs.

Write a hypothesis or series of them. Consider what implications this hypothesis would mean about the behaviour of the bug or code. Design an experiment to test the hypothesis, continue with more experiments. List methodology in bug report so others can repeat the experiments. Finally a proof of the cause of the bug is found.


I don't understand what makes this "Sherlock Holmes" debugging. Don't we all do this kind of stuff every day?

Cool story though.


Yeah. Eliminate possible causes until you've found the real one. There aren't many other reasonable approaches to debugging.


Also known as the "Five-Whys" elsewhere.

http://en.wikipedia.org/wiki/5_Whys


Well "Sherlock Holmes Debugging" may be redundant, since this is really just "debugging" (though who would have clicked on that link?), I was most excited by this line:

> There is no sin in software engineering more serious than thinking some behavior of a computer system is magical or beyond our understanding.

I've been trying to find a good test case to get my junior engineers to step out of their code and see more of the layers. This is a pretty good mantra for improving your debugging skills.


Only loosely related but for lovers of weird programming texts, there's a bizarre gem called "Elementary Basic - Learning To Program Your Computer In Basic With Sherlock Holmes" by Henry Ledgard and Andrew Singer.

http://www.amazon.com/Elementary-Chronicled-Learning-Compute...


MTU problems are not unheard of, it's one of the things you always check for whem you have this type of network problems. Especially when you're running jumbo frames, which used to be quite troublesome when the technology was new.

Another thing to check for is funny-looking TCP flags. Some firewalls tend to drop such traffic, and it may not end up in the logs you usually check.

That's why the first thing you do when one connection works and one doesn't is to tcpdump them and compare. Just last week I had one application which ran ssl directly and in another environment it did a starttls-type thing just because of the underlying libraries.

It was immediately obvious from looking at it, but it would have been terribly difficult to guess. Don't start with Sherlockian reasoning, start by getting all the data.


What the article says: Go down the stack in a linear format and check everything

What a Sherlock Holmes story is: Seemingly-unrelated characters could be responsible for the crime

How the story would've been if it was really similar to Sherlock Holmes:

- Check your own software (Already done)

- Go down the stack and check all dependency, platform issues (Already done)

- Check if your clipboard works properly (New)

- Check if the monitor outputs pixels properly (New)

- Check if your eyes see clearly (New)

That would be Sherlock Holmes debugging!

On a serious note, I like the story. I also like the way things are done but I assumed that's how every developer works. I looked at the link to the Scientific approach and IMO, that shouldn't be considered as the de-facto Scientific approach. This should be the Scientific approach.


It's always bugs like this that end up eating all of your time. That's why I find making small client websites or apps (as opposed to larger software projects) to be so hard to profit on.


this is a nice story, but i think something much more important is missed here... application of the scientific method.

all of this logical deduction is worthless unless you verify it with experiments. this is very much not what Sherlock Holmes does... but it is exactly what enabled the deduction in the story to be cemented into a reliable conclusion.


Debugging is detective work.


No shit.


What a terrible title/premise




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: