As someone who found four compiler bugs in three weeks - in a five-nines fault-tolerant OS, yet! - and who found a PostgreSQL optimizer bug within weeks of learning SQL, I think the key to being "that guy" is playing five-whys with every single bug you encounter.
I work with some very talented developers who, when they try something and it doesn't work, try something else. I am fundamentally incapable of that. If it doesn't work, I MUST KNOW WHY. Even if that requires building a debug version of my entire stack, adding all sorts of traces, and wolf-fence debugging until I have a minimal fail case.
It's a real limitation; if I hit an undebuggable brick wall, I have no ability to attack the problem from a different angle. Luckily, there are few things that are fundamentally undebuggable.
I also looked this up, and found the original paper that coined the term, which starts:
> The "Wolf Fence" method of debugging time-sharing programs in higher languages evolved from the "Lions in South Africa" method that I have taught since the vacuum-tube machine language days. It is a quickly converging iteration that serves to catch run-time errors.
It stipulates that the state of Alaska has got exactly one wolf, so you build a fence across the middle of the state to find on which side wolf would howl, then subdivide the problem, etc...
I'm assuming in your case the wolf got replace with a lion and Alaska with South Africa.
My experience with "fault-tolerant" OSes and their toolchains is that they are buggier than their off the shelf counterpart, probably due to limited use ...
Well... "fault tolerant" doesn't mean "correct". It only means it can tolerate faults and, presumably, recover from them. And since they are used in very limited contexts, the odds of having bugs that nobody noticed are much higher.
First off, I would say that is some pretty awesome work by this guy to chase this down. Including his work with the manufacturer to help them reliably recreate the issue.
Second, I would say that over the course of my 10 year career in managing developers, I've heard many, many times that the bug was in the kernel, or in the hardware, or in the complier, or in the other lower level thing the developer had no control over. This has been the correct diagnosis exactly once. If I had to guess, I would say about 5%.
I have been the first to trigger two CPU bugs and came across a third a few days after it was discovered, before it was published. Once errata are published, software workarounds are usually put in place quickly, and tripping over them is rare.
Compiler bugs are another story entirely. I have found dozens of them (confirmed), and I can find more whenever I feel like it.
Out of curiosity, if someone paid you to find compiler bugs for a day, how would you go about it?
(I've found several missed-optimization bugs in gcc, but I found them while working on a project where I examine assembly frequently; I have no idea how I'd go about looking for a compiler bug).
One way of actively looking for compiler bugs is using a tool like Csmith[1]. Another is to compile some known-difficult code (e.g. Libav[2]) with various combinations of (optimisation) flags until the test suite fails. Most of the bugs I've found were during routine testing of Libav.
While I don't consider missed optimisations bugs as such, they are easy to find. Simply compile some non-trivial function and look at the output. There's usually something that could be done better, especially if some exotic instruction can be used.
> While I don't consider missed optimisations bugs as such, they are easy to find. Simply compile some non-trivial function and look at the output.
Perhaps you'll give me a little credit :) if I mention that I found missed optimization bugs in extremely trivial functions. One of them involved gcc generating several completely useless stores even at -O3: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44194
I've been a developer for over 30 years, from mainframes to micros, and a hardware problem has been responsible for a bug in my code exactly zero times.
This is because OS devs (and compiler devs) have suffered them for you. I've run into many x86 bugs, both documented and undocumented. Have you done much assembly?
Usually bugs would involve unlikely sequences of operations or operations in unexpected states. But there have been very serious bugs involving wrong math (Intel Pentium) or cache failures leading to complete crashes (AMD Phenom). These two made it to production and were show-stoppers because OS devs could do very little about them (in the Phenom bug, they could, but with a noticeable performance hit). I don't think I've seen any production CISC chip completely free of bugs. OS devs have to do the testing and the circumventing.
I mean... typically x86 chips have DOZENS of documented bugs.
It's interesting how with significantly worse (or at the very least, comparable) complexity to manage, and uniformly horrific costs for repairing production bugs, the ASIC design industry and Intel/AMD in particular have managed to scrape by with something like <20 bugs between them in the past decade.
Perhaps we need to incentivize software developers with fear of execution, or something.
A recent x86 processor model has at least 20 "errata" that they'll tell you about; in the last decade they must have had hundreds. But most of them are just worked around so they don't affect you.
Modern x86 CPUs can trap arbitrary instructions to microcode. This means that most hardware bugs can be fixed with a firmware update that just slows down the cpu somewhat when it encounters the offending instruction.
There certainly are a lot of hardware bugs in cpus -- it's just that most of them get fixed before anyone outside the cpu company ever sees them.
The amount of effort spent on what is generally called "functional verification" is much higher for hardware than for software. Also, the specifications tend to be clearer and the source code size is smaller than you might imagine.
But really, I have only ever experienced one bug in a compiler (that I hadn't written) but it was such an odd experience, like the patient having Lupus.
Yeah, when I find myself drifting towards the 'maybe it is a bug in the compiler/OS/debugger' territory I know it is time to take a break from the debugging as it is rarely true[1]. Nice to see it occasionally happens :)
[1] I work on Visual Studio so I have found compiler/debugger bugs as I am generally using 'in development' bits, but far more often than not the bug turns out to be mine and mine alone :)
I had few bugs traced down to the kernel, on Windows. ALL of them were in 3rd party antivirus packages. In fact, it a machine blue-screened after installing our stuff, you can rest assured it had an antivirus and it was a Kaspersky.
Didn't Raymond Chen write once how most Windows crashes (of some version, during some year) were due to beta nvidia drivers installed by gamers to increase FPS?
Disagree. In fact, were I to come up with a rule of thumb, I'd say the opposite is true.
Want to find bugs in Sun's Java 6 compiler for X64 Linux , use annotations (yeah, I found one in their V30 release last week). Want to find bugs in MS' C++ compiler, write your own templates (this was a few years ago, maybe it's better?). The best programmers push the limit of their tools because they know what's "supposed to happen".
Poor programmers hit something that doesn't work, and just try something else, cause, well they're just trying shit. I would go so far to say that poor programmers, in fact, are unable to find compiler, optimizer, OS, or hardware bugs because, by definition, they probably don't have a firm handle on what's "supposed to happen".
I think what VBprogrammer meant is that thinking you've found a bug in a compiler / OS / CPU is often a warning sign you're a poor programmer. Often times a beginner will have a bug in their code that is too subtle for them to identify, so they end up attributing it to some external factor. Actually finding a bug in a compiler / OS / CPU is as you suggest likely a sign you're doing something advanced or unusual and therefore are perhaps more knowledgeable than most.
Yeah, that's exactly what I meant. Sorry if the sarcasm didn't quite carry.
I know these things can and do happen. I've come across one or two of these strange ones before, but too often I've seen people jump to the conclusion that someone / something else was to blame. Without any other real evidence other than that they have exhausted their shallow back of talent.
I remember scripting on a MUD that used a customized version of the standard MPROG[1] patch for ROM-based MUDs. Whoever originally "documented" it just grabbed some docs from some other ROM-based MUD that used their own customized version.
The documentation was a completely wrong for several years before I started programming there. Once I realized that the documentation was lying to me, I started methodically examining how things actually worked by writing lots of very simple test programs and documenting the actual behavior.
The others didn't care why it was broken. Most of them were just trying to build cool areas, they weren't really programmers at all. They would just tweak things until it appeared to work or they were frustrated enough to give up.
[1] I'm sure a lot of HN knows this, but to save the non-gamers the hassle of looking it up, a MUD is a type of text-based online game and an MPROG is a script used to control the actions of the characters in the game.
Way too complicated to identify concisely, and trying to release just now; in the end the wrong object was returned by a casting operator (very, very wrong object). Added and called a named method to do the cast (very same implementation) and all was well.
Depends on the compiler. When I took my "hardware for CS students" class as an undergrad, our big project for the semester was to write a CPU simulator in c, and the campus labs had just rolled out the upgrade from gcc from 2.x to 3.x. I had a bug in my program that I just couldn't isolate, and after about eight hours of chasing it, I realized that the compiler had allocated space for an integer variable right in the middle of an already-allocated array, so the two variables were stomping on each other. I changed the name of the integer variable and my problems went away.
Apparently I wasn't the only one, because within a week all the labs were back on GCC 2.95.
Actually, that sounds a lot like a bug in your build system. Did you have a custom Makefile (either hand written, or provided by a teacher)? If you didn't keep track of dependencies very carefully so that you always recompile all the .c files that depended on shared .h files when the .h files change, you can wind up with situations where different object files disagree on the layout of structures -- it could cause exactly the sort of problem you describe. Changing the name of the variable could force the file to be recompiled, thus appearing to solve the problem.
To further support your point, the rollback to gcc 2.95 happened across many linux distributions due to incompatible changes in the language that happened at the gcc 3.0 version.
Many distributions rolled back so that the default "stable" compiler matched the one they had to use to build the packages - i.e. common sense.
Once the packages were updated to deal with the gcc 3.x language changes, the compiler and packages started appearing together.
My first programming job was doing VB programming in Access 2 programs that had to run on Windows 3.1. (Yes, this was in the last millennium.) I kept on running into bugs that I could demonstrate were in Access, not in my code. It was very frustrating.
My next job was in Perl. I went several years before I found an actual bug in the language. Which then went unfixed for years because someone might be using it. Despite the fact that in every significant Perl code base that I've seen since, there are real bugs in the code that nobody has noticed which trace back to the bug that I found. Why do you ask whether I am bitter?
So your suggestion failed glaringly for me when I was using VB, but since has worked much better.
It did not use Visual Basic for Applications (VBA), but it did use Access Basic. Which was a dialect of Visual Basic.
Access 95 had the ability to upgrade from Access 2, and that included the ability to migrate from Access Basic to VBA. The tool was not flawless (very little from Microsoft is), but mostly worked pretty well.
"Finding" them is a warning signal of a poor programmer if and only if the scare quotes mean that they have not actually researched the problem sufficiently to prove that the problem is in one of those areas.
These legitimate bugs do exist, and some of us have a talent for finding them with annoying frequency.
Less experienced programmers often "want" to find bugs in the compiler/OS/whatever because that way it's not their fault, and they lack the skill to track down difficult problems in their own code. More experienced programmers realize that finding bugs outside of your own code is often a disaster because frequently there's nothing you can do to fix it.
Nah. Finding a bug in the dev stack is a sign that you're running on the edge. I've encountered one or two myself. I sent one in and the company wrote back and said "yep, it's a bug, fixed next build".
Now, blaming without reproduction on the dev stack is a sign of a lamer. :-)
If that were true then there would be no need to ever release new versions of these things!
I have personally found bugs in Linux (kernel, libc), Oracle, various JVMs, etc, usually cases in which algorithms optimized for "normal" loads became pathological under extreme load. It's much more common than perhaps you'd think.
When I'm working, it's always a bug in the compiler, kernel or hardware. The semicolon was implied. Just give me a minute to work around the compiler bug.
I have only twice thought one of my problems was due to a compiler bug, and I was right one of those times (and that was because my company was stuck with a 4-year old version of the compiler; The bug had already been fixed in the latest version.)
Please stop referring to him as "this guy", he's well known in the BSD and Linux worlds. He had commit access to FreeBSD before many things we take for granted today even existed. His name is Matt Dillon and he's one hell of a hardware/OS hacker. http://en.wikipedia.org/wiki/Matt_Dillon_%28computer_scienti...
I'm an offender of the 'this guy' thing. I have heard of Matt before, but come on, there is almost no context as to who he is given in the posting, and do you really expect everyone in this community to know every semi-significant kernel hacker of the last 2 decades?
The link is nice for everyones education, but I, for one, would appreciate a little less condescension.
Referring to Matt Dillon at "This guy" is akin to referring to Linus Torvalds, Theo De Raadt, Jony Ives, Zed Shaw, John Gruber, etc.. as "This Guy" - particularly in this community - everyone should know who Matt Dillon is.
And, yes, I would expect everyone in this community to recognize who these people are, and roughly what their contributions have been.
I'm sorry but no, its not on the same "scale" of being universally known. You see, I'm in this industry from there late 80s, and I know who Lunus is but I have no idea who is Matt Dillon, I was never too much into BSD, mostly using Linux, but I think even most Windows guys know who Linus is but I doubt that many of them even know who is Theo (I do).
> And, yes, I would expect everyone in this community to recognize who these people are, and roughly what their contributions have been.
I'm sorry but I (and everyone else) don't owe no one any such things. This is excepting too much from people. Yes, not knowing who is Linus, or Bill Gates or Woz would be strange, but its ridiculous to say that Madd Dillon is as well known. I can't (and don't want to) know every significant linux/bsd contributor. I have enough information filling up my limited brains as it is.
My mother knows who Linus, Bill Gates, and Woz are. These are people who have MSM (Main Stream Media) recognition.
Within the Hacker community, anybody who's been "in the industry from the late 80s" should have awareness of around a couple hundred or so major figures, most of whom my mother would not know. Pundits like John Gruber (MG Siegler, Sarah Lacy, Michael Arrington) are perhaps better known in the HN community than those outside of it - but I would expect everyone who has been around any hacker community for a couple decades would recognize Names like Dennis Ritchie, Guido Van Rossum, Richard Stevens, Larry Wall, Tannenbaum, Bill Joy, etc...
In addition to this group of people, People like Matt Dillon should be on your peripheral Radar, even if you don't follow BSD that closely. I've never run FreeBSD/DragonFly BSD, and even I recognized his name.
I'm not suggesting you memorize every last hacker/pundit of any note, but there is just a core _canon_ of people that we reasonably should be expected to know - it's knowledge like this that ties the broader hacker communities together.
I only know of John Gruber because he is mentioned/linked to on HN so often, same with Zed Shaw.
if I google Matt Dillon, I only get things about an American actor. So I can't really think of a any reason that I would have heard of him, since this is the first thing that I have read (afaik) that has mentioned him at all. If he was writing a popular technical blog on top of his work then I'm sure I would have.
Of course that is not to disparage his work at all, there are many people who do extremely impressive, valuable things but receive little fame. Even amongst educated people you would probably struggle to find someone who could tell you who invented the combustion engine (Étienne Lenoir) but I'm sure many could name the members of the latest fad pop music group.
It's just that some people are pushing into the spotlight more than others, or more likely they get there on purpose. Some people are happy doing good work in the background and only want recognition amongst a small group of peers.
- Dennis Ritchie of course, and Kernighan too (though I had to google for how to write it)
- Gruber and Arrington - yes
- Siegler and Lacy - no
- Van Rossum, Wall - of course
- Tannenbaum - yes,
- Stevens - no idea
- Joy - rings a bell, but still no idea.
you see, I'm not really into remembering names, especially of the people I'm not in close contact with. I only remember then if they are pounded on me for a while ;)
I'd say three, although the work of the third one (which I'd assert is Theo) is more obscure it's pretty darn critical and influential[0]. Unless the two you were thinking of are Linus and Theo?
Zed Shaw gets props from me for being a great teacher. Lamson/Mongrel2 alone are huge contributions to the community. And regardless - I wasn't suggesting that he's on par with Linus or Theo, I'm just suggesting that in our community, he's a well known name. Definitely someone my mother would not know, but I'd expect anyone who's been on HN for a couple years to recognize.
Even before FreeBSD, I had the pleasure of interacting with him as a customer of Best Internet. He invented the concept of having multiple domains hosted on the same server back in '94.
My hat's off to this guy for the work he did, and indeed, finding a CPU is quite the accomplishment.
That said - what is it about the hardware manufacturers that makes them relatively immune to this sort of thing? Is it formal verification and rigid engineering process? Is it that they spend so much money developing these things that they better do them right, god dammit?
Sometimes I think that the whole industry would be much better off if everyone up the stack was held to these kinds of standards. If that were the case though, where would we be? We'd have rock solid systems, but how sophisticated would they be? Would UNIX exist? What about (a more bulletproof and less feature complete) Java?
Is it formal verification and rigid engineering process? Is it that they spend so much money developing these things that they better do them right, god dammit?
All of the above- with the minor correction that it's not about money spent developing per se. Producing silicon masks is obscenely expensive, so catching a bug before tape-out vs. after tape-out can be a difference of hundreds of thousands of dollars. So, think of it as "you better get it right the first time, god dammit"
There's little open development, so there's little incentive to write up public articles. You could try something like Bob Colwell's "The Pentium Chronicles".
Considering it takes years to design a CPU I don't think it would be worth it. The point of software is that you can change it quickly and so it evolves faster then the hardware that runs it.
IMO that's an advantage that easily overcomes any instability that new software brings.
That said there was an interesting article/interview (cant find it sorry maybe someone else can) with one of the creators of hotmail. Off the top of my head he said that because he came from the hardware side when creating the hotmail software it never broke due to the processes and practices he followed. Perhaps there is a middle ground that we can take and get benefits all around.
On the other hand, you've got FPGAs, which bring the software stuff to the hardware guys - and based on my (limited) understanding of the hardware industry, they've improved certain prototyping exercises by an order of magnitude.
Agreed that the rapid development cycles and malleability of the product is part of what's made software great - but my feeling is that maybe we've let that slip a little too much - leading to the bloated, slow, buggy software that everyone runs all the time.
As an integrated circuit designer I can say that a lot of hardware is somewhat less complex than the software world. At the chip level anyway.
In the hardware world at the chip level, the environment is fairly rigid. A designer has a pretty good idea of what the environment is going to be for their chip. Linux is designed to operate in all sorts of environments: multiple cpu architectures, multiple versions of those architectures, the myriad peripherals, and all the other software that is going to run on linux.
That's not really the case with hardware at the chip level. We know what chip we are going to be talking to, or what family of chips. And most of those are designed in house, so there can be a lot of give-and-take going on.
It's just not a lot of combinations (compared to software, in my opinion.)
An exception would be memory, which is usually made by a third party. And it's also where you find incompatibilities... some manufacturer's ram doesn't work in macbook pros, etc. That's a bug.
And because the hardware can take a long time to iterate through design->implementation->verification->fabrication->prototype verification, there is a strong force driving designs toward being the simplest possible implementation that accomplishes the goals.
So you have these very well defined blocks of functionality that get pieced together and form a chip. The interfaces between blocks are very rigidly defined, and the amount of functionality in a given block is usually pretty limited.
At the level I work at, design is done in VHDL or Verilog (and now SystemVerilog is becoming a viable option). So that's a limited space. That could possibly be the biggest contributor to "less bugs" in hardware (if that is a true supposition). I get flustered with all the different software programming languages.
A big disadvantage of hardware design is that testing is first done in event based simulation. The tools available for this are pretty awesome, but it's still extremely slow. My last design would take about 100 hours to simulate about 300 ms of hardware time. That's a large part of why the iteration times are so long.
Now, with AMD and Intel, and designing these ridiculously highspeed CPUs, it's a whole different ball of wax. I expect a lot of their design is transistor level, full custom. Maybe they prototype in a higher level language, but you aren't going to 3GHz doing standard cell designs in VHDL.
Chip simulation seems like it would be a pretty difficult problem to multithread, since everything depends on everything else - can you run those simulation tools on clusters (or even GPGPU)?
Apparently Bulldozer, AMD's latest chip, is their first to start using automated design tools; one ex engineer claims that it resulted in 20% bigger and 20% slower designs:
He already had it. Matt Dillon is the business, and that's not a fake name of his. He quit FreeBSD-core because they chafed at his awesomeness, from which he went to start a FreeBSD fork, Dragonfly BSD, one of whose goals is process and state portability across CPUs and machines. He was also the technical stud behind Best Internet, one of the earliest and largest and highest-performance ISPs of the Mom&Pop era of the Internet (~93-97).
If you're interested in the types of bugs that are present in modern CPUs, AMD makes their errata documentation publicly available. (As far as I know, Intel's errata are not public. Edit: See tedunangst's comment below for a correction.)
AMD should be commended and the guy who found the bug especially. This is how I got into software from the hardware world, in a complex custom system sometimes the bug is in the hardware.
Patch around it in microcode (applied by the OS or BIOS on every boot; releasable as an OS update), disable the related CPU feature if possible (twiddling bits during OS initialization), or trap any related exception the CPU throws, detect the bug's condition, and patch up the running task's state from the exception handler (again, another OS update).
If all else fails, issue a product recall or downplay the bug's severity.
or recall all CPUs and replace them with new repaired ones. That's what they do in Wonderland. :D
On a more serious note, I wonder what auto manufacturers do. (There are many CPUs in modern automobiles, and auto manufacturers are often compelled to do recalls.)
Traditional embedded devices (i.e. not smartphones) tend to use very mature and well understood CPUs. Moreso for things holding life and propery like cars. If a bug does occur and cause a crash, normally a watchdog timer will reset the CPU quickly enough to avoid unrecoverable problems.
They're certainly not pushing the bleeding edge at all like the 3 GHz desktop/laptop processors.
I work with some very talented developers who, when they try something and it doesn't work, try something else. I am fundamentally incapable of that. If it doesn't work, I MUST KNOW WHY. Even if that requires building a debug version of my entire stack, adding all sorts of traces, and wolf-fence debugging until I have a minimal fail case.
It's a real limitation; if I hit an undebuggable brick wall, I have no ability to attack the problem from a different angle. Luckily, there are few things that are fundamentally undebuggable.