Hacker News new | past | comments | ask | show | jobs | submit login
It Is Never a Compiler Bug Until It Is (r6.ca)
256 points by nullc on Sept 30, 2020 | hide | past | favorite | 132 comments



My first compiler bug was in my first year at Google. I'd just introduced a new system for the animation that updates your position while driving in Google Maps. It was perfectly buttery smooth as planned, except on my manager's commute the next day, where it constantly lurched back and forth. The others on the team were convinced that it had to be something with my code, but I didn't think it could be, because my code had no conditional statements and should either always be right or always be wrong.

It kind of looked like it was being fed nonsense speed values, so I got the GPS log from my manager and checked - but no weird speed values, actually a remarkably clean GPS log. Replayed his GPS on my phone - worked perfectly fine, buttery smooth. Eventually it came out that it only happened on my manager's phone. Borrowing said phone and narrowing things down with printf, I showed that my core animation function was being called with the correct values (a, b, c, d) but was being run with the wrong ones (a, a, c d). This is when my manager thought to mention that he was running the latest internal alpha preview of Android.

Searching Android's bug tracker for JIT bugs, I found that they had a known register aliasing bug. Honestly I have no idea how it ran well enough to get to my code in the first place. But I tagged my weird animation bug as related to that (they didn't really believe me) and ignored it until they fixed their thing, at which point it went away.


JIT bugs must be horrible to debug because they happen in full running programs and might not always result in crashes. RyuJIT of dotnet had a bug a few years ago which caused certain calculations on certain specific conditions to go wrong.


I ran into several other JIT bugs working on Google Maps. Since I worked on location, my code was the first complex thing that ran under most flows, and rather prone to being the first thing to crash when the system behaved in a very dumb way. About once a year we'd get a spike of crash reports from some ancient version of Android where a third party had built a custom JIT to get the system to run on very cheap hardware. Then the crash would be something absurd, like a null pointer on a value that was never assigned null, or an ArrayIndexOutOfBounds on a statically sized array that was only ever accessed with constants.

Sometimes someone would actually try to solve these, but I preferred the times where someone just changed the Proguard config and hope it shuffled things around enough that it didn't trigger whatever bug the JIT had.


> spike of crash reports from some ancient version of Android

Quasi-related question: app installs/updates have been very very consistently SIGBUSing system_server with BUS_ADRALN on my ancient Galaxy Note 3 for about a year, but downgrading the Play Store app to the factory version (5.5.12 :D) makes the problem go away completely.

I've tentatively considered writing something up for issuetracker.google.com, but besides tons of logcats I'm not sure what info to provide, and I also wonder if the info would be considered useful due to the age of the device. Any advice on whether/how to proceed would be appreciated!


I think the biggest problem you're going to run into is that there are very few people assigned to maintaining old systems, and it's quite hard to get these issues brought to anyone's attention. The ones I mentioned only got fixed because they crept up to the top most common causes of crashes.

A bug report would be the first thing I'd want for something like this - in addition to logcat info it includes a bunch of process information, and apps can add custom data to the report. It'd also be nice to know anything to narrow down when the problem started - whatever bounds you can provide on when you were last absolutely certain it was up to date and working, plus your best guess. Someone's going to have to dig a Galaxy Note 3 out of a drawer somewhere, go into the archives of old Play Store versions, and try installing them until they find the version that broke it, so anything that makes that binary search process less terrible would be a start.


Thanks for replying!

> there are very few people assigned to maintaining old systems, and it's quite hard to get these issues brought to anyone's attention. The ones I mentioned only got fixed because they crept up to the top most common causes of crashes.

Thanks for this info. It kinda matched my own intuition about what to expect, but I wasn't sure where/how I might find out for sure. It's nice to be able to calibrate how to optimize effort.

> A bug report would be the first thing I'd want for something like this - in addition to logcat info it includes a bunch of process information, and apps can add custom data to the report.

Ah, of course. That's straightforward to do; I can just watch logcat for system_server crashes then automatically take bug reports when they occur.

> It'd also be nice to know anything to narrow down when the problem started - whatever bounds you can provide on when you were last absolutely certain it was up to date and working, plus your best guess.

IIRC™, this has been happening from day 1 when I was given this phone (it was previously sitting unused in a drawer, yay).

> Someone's going to have to dig a Galaxy Note 3 out of a drawer somewhere, go into the archives of old Play Store versions, and try installing them until they find the version that broke it, so anything that makes that binary search process less terrible would be a start.

Oh meep. Of course.

I'd have no problem loaning the device (perhaps the carrier-specific image it's using is implicated), but that's probably a huge pile of overhead to deal with. I wonder if I can help out with the bisection process myself?

Hmm. Given that this is the Play Store and the device is (currently) not rooted, I couldn't do it the "russian" way with random APK sites even if I wanted to.

I'm very curious what the internal path is here - does it depend on rooting or is there a way to isolate devices and send them specific versions of apps?

(Now I'm imagining being given a (bespoke) HTTP endpoint to hit (just for this) to switch versions...)

Maybe the best path forward would just be to root this thing already. I'd be very surprised if rooting affected the situation.

Thanks very much for the thought about bisecting GPS though. Definitely hadn't thought of that myself at this point, and don't think I would have anytime soon.


We just used adb install. It might've involved installing a special version of Android, but I deliberately forgot about that process because it sucked. Either way, I'd be kind of surprised if Play wasn't a special case, no clue what they do. (I'm no longer at Google, can't even check.)

Good luck with your quest though!


> I'd be kind of surprised if Play wasn't a special case

It probably is, but there's also the

  Failure [INSTALL_FAILED_ALREADY_EXISTS]
brick wall which makes perfect sense, is not especially specific to the Play Store, and will likely require root (then some careful `mv`s) to squish. Yayyy.

> Good luck with your quest though!

Much appreciated :) thanks again for the insight/feedback!


I caught a JIT bug in .NET back when I worked at Microsoft. Unfortunately I don't remember the specifics, but it was doing something wrong when trying to optimize a bit of code using AVX instructions.

I was immediately sure it was a compiler bug when I realized I could make the code work correctly if I changed the order of variable declarations. It would happen on something like:

int a; int b; float c;

(again, that's only illustrative - too long ago for me to remember the specifics)

But not with:

int a; float c; int b;

Folks in my team didn't want to believe it at first :)

Edit: found it here - https://github.com/dotnet/runtime/issues/17395

Not quite as I described but close.


That not only shows it's a compiler bug (assuming the program had no undefined behavior), it gives a general technique for generating compiler tests. Take any program, randomly change the order of declarations in a way that doesn't alter the meaning of the program, and see if both versions have the same behavior on some set of inputs.

This is an example of "metamorphic testing".


Mozilla engineer Yulia Startsev has a series of live-coding YouTube videos called "Compiler Compiler" about developing and debugging Firefox's SpiderMonkey JS interpreter and JIT:

https://www.youtube.com/results?search_query=Mozilla+Hacks+C...


Yep, JIT bugs are a bear.

We use LuaJIT pretty extensively and some codepaths trigger a crash-on-assert which we've confirmed is a JIT bug. But not consistently, the weather has to be just right. 95% of the time, randomly inserting `assert(true)` into the code un-breaks it.

We're using the beta, so this is an expected sort of thing. I'd love to be the guy who squashes it, but that's way outside my wheelhouse. I'm left crossing my fingers and hoping one of the updates makes it go away.

Globally disabling the JIT is a nice way to confirm that using LuaJIT was a good idea, though. Instant code starts taking seconds to complete.


Back when I worked on IBM's J9, the nightmare scenario was either a dropped memory fence, or the map keeping track of links between register / stack references and GC heap objects getting screwed up. Things could go on a very long time before those errors manifest visibly.


Compiler bugs are indeed pretty frightening. A few years ago I bumped into one in some code that had potential to have a big impact. Unfortunately I am not at liberty to give details about the business setting except to say that we had processes in place that prevented any danger.

In the end I whittled it down to the following tiny C# program:

  namespace UhOh
  {
    internal class Program
    {
      private static void Main()
      {
        System.Console.WriteLine(Test(0, 0));
      }
      private static bool Test(uint a, uint b)
      {
        var b_gte_a = b >= a;
        var b_gt_a = b > a;
        System.Console.WriteLine(b_gte_a);
        return b_gte_a && b_gt_a;
      }
    }
  }
Compiling and running this with Microsoft's .NET stack with versions 4.7.0 and below, the output was incorrectly: "True, True" instead of "True, False". (IIRC, it also had to be a 64-bit Release build.)

The intermediate language was correct; it was a bug in RyuJIT.


I imagine some really interesting bugs must be encountered in the HFT/MM world. Risk controls are there for a reason :)


A couple of decades ago i worked with GSM handsets, and because you certify the GSM stack along with the compiler, you're pretty much tied to a specific compiler version.

At the time we were working on a new "feature phone" with a 160x120 pixel display in 4 shades of grey, which was a huge upgrade compared to our previous models. Another feature was full screen images for the various applications, and we'd been implementing is into the software and testing it for weeks without problem. After the development cycle came to an end, our release team created a new software release and sent it to our test department, which almost instantly reported graphical errors back to us. We tested the software image on our own handsets, and half the screen was "garbage".

We spent weeks looking over the merged code, as well as debugging pointers code and found nothing. It wasn't until we were stepping through the paint code with a Lauterbach debugger that we noticed something was "off" with a pointer.

The platform was a 16 bit platform and memory was addressed using a page pointer pointing to a 64kb memory page. When we traversed the bitmaps fate would have it that this particular bitmap, in this particular build, was split between two pages, and when incrementing a pointer beyond the page limit, instead of incrementing the page pointer, it simply just overflowed and started from the beginning of the current page.

Another interesting bug we chased in the compiler was it's inability to add more than 2 variables in initial assignment.

i.e.

  int a = 1+2;        // a = 3
  int b = 1+2+3;      // b = 3
  int c = 10 + a + b; // c = 13
That took a while to figure out.


Long ago, I bought the MIX C compiler for DOS. It had a tendency to miscompile basic things like the ++ operator. Lost hours of my life on it.

One day I found out about DJGPP, and even if the download cost a fortune in phone time, life was so much better.

Now the C reference book included with MIX C turned out to be much better than the compiler itself, and served me well for a decade.


The first example sounds like a bug that could sneak in anywhere, but the second one makes me question how they laid out the compiler architecture.

Did they write a special expression parser just for declarations? I mean, this shouldn't be possible if there was just one expression parser?


Probably more to do with compile time static expression evaluation (optimization) than parsing. Seems like it's not actually recursing down, just evaluating the root.


That could be, but my first thought was it's only compiling it to a single assembly instruction (which can only have two operands). Or it generates ADD instructions using a succession of registers or stack locations, but the initialization code blindly assumes the result is in the first register or stack location.


Of note to me: certified compiler was actually non performing :) Lauterbach, the savior of embedded wizardry. Still in use today with chips one would not consider "embedded".


I've encountered one genuine compiler bug in my (now 14+ year) career.

I was working on a defense contract, on a government system, where I was constrained by local IA policy to specific versions of various tools, including a relatively ancient version of gcc.

I can't recall exactly what the problem was, but I do remember figuring out after doing some research that the bug that was biting me had been identified and fixed in a later version of gcc. Which I was not allowed to install. So I had to implement a hack-tastic workaround anyway.

One of the best parts about that job - I was integrating the tool I was writing (in Python with Tk - it was the only installed and approved GUI library I could use) with a really old signal analysis library that had originally been written for VMS back in the day - then ported to SPARC/Solaris - then ported again to x86 (yes, VMS heritage was evident in places). Through many years of flowing through many different maintenance contractors, the library had become a giant amalgamation of Ye Olde FORTRAN, C, C++, and Python. To build it I needed a specific version of the Intel FORTRAN compiler, which my employer would not purchase, and the client IA policy would not allow on their system anyway. With much hackery, I managed to coax the damn thing into building using the "approved" gfortran that was already on the network.

Egad, what a horrible job that was.


I slightly regret leaving the writeup of this bug to my old employer, but it was a spectacularly hard to identify one.

Windows (including CE) has a crash detection system called "structured exception handling". You can use this to route segfaults to either "try/catch" (don't) or "__try/__except" special handlers. We had one of these to log the error and show a crash screen. It worked fine on the desktop. It worked fine on the MIPS systems. On ARM systems, it sometimes didn't work. At these times, the debugger didn't work properly either.

I eventually found the kernel stack unwind code (with WinCE you get some, but not all, of the kernel source). For some reason this was looking at the jump instructions on the way down, and the particular problem was virtual method calls were implemented as "LDA pc, [r3]" after computing the target vtable location in r3.

r3 is in the "clobberable" set of registers in the ARM calling convention, so if it got overwritten lower down the call stack the unwind handler would read the wrong value and fail to follow the stack.

Fortunately it turns out there were two different versions of the ARM compiler, shipped in different toolkits by Microsoft (why? who knows) and using the other one didn't trigger the bug.

We checked the known-good compiler into source control and kept it there as critical build infrastructure.


I worked with a guy who had in fact once found a javac compiler bug, and he was maddening to try and get to fix bugs, because he'd just always point at the compiler.


What's worse is that those policy usually means the product will have more defect and easier to attack.

I've heard of story where government standard explicitly demanded a lower key size of encryption, way below current industry standard, for 'secure' applications.


The same happens in medical devices. Changing to newer versions can be very expensive in terms of testing and documentation so you stick to old stuff which may not work well. It's a tricky balance. Switching to a newer compiler may fix your current problem but introduce a ton of other subtle problems.

I often think that we create inferior solutions because it’s too hard to get newer methods approved. But then the newer methods may also cause problems.


I have encountered two compiler bugs in my (almost) 40 year career.

In about 1988, a bug in Apple's MPW Pascal compiler. I refused to believe it was a compiler bug, until I finally inspected the generated code. IMO the only way to be taken seriously about a compiler bug is to distill the defective compiler behavior down to a short (like one page) example. Also helpful is to show the generated code and how it is wrong. Bug was acknowledged and fixed.

In the mid 1990s, dabbling with C++ on (classic) Mac, I upgraded to a (name brand withheld) version 8.0 C++ compiler. The generated code behavior was obviously wrong. To make matters worse, it was possible to crash the compiler. The number of problems with that compiler were so bad that I simply ditched that compiler, and it didn't last much longer commercially. Sad, because its predecessor compilers (mostly C and Pascal) had been very good.


Vernor Vinge's software archaeologists are already among us.

I wonder what percentage of the job market that niche will be in fifty years.


> I've encountered one genuine compiler bug in my (now 14+ year) career.

you're so lucky. between just 2014 and 2018 I reported something like 30 bugs to msvc, gcc, clang (in decreasing order)


When something doesn't work as expected, I'll often check disassembly. That can massively cut troubleshooting time when something "smells" like a compiler bug.

This is why preferably everyone should learn to read assembler output. This is not limited to C/C++/Rust/etc. native code, the same output is typically also available for example for JVM and Javascript JIT.

Haven't found any miscompilations so far (unless you count braindead codegen), but quite a few hardware bugs. Including one CPU bug.


I agree. Also, depending on the compiler you're using, you may have some additional opportunities to look beneath the covers.

For example:

- With Clang, you can dump the C/C++ AST and/or the LLVM IR.

- With GCC, you can dump the Gnu Assembly (with source-level annotations).

These views can be helpful, especially for someone unfamiliar with the target machine's instruction architecture and ABI.


Do you have any recommendations on resources for picking up assembly and learning more about JITs?


Hmm...

First of all, https://godbolt.org/. C/C++/Rust to assembler, very useful.

Learn the calling conventions, basic arithmetic, flags, conditional branching. Understand stack management.

Single step through functions in a debugger to see how things work, and for example how a stack frame is set up. Pay attention how registers and flags are affected by the execution. See how conditional branches are affected by the flags.

When you're looking at the code, remember that there are often weird looking details, like unused portions in stack and in compiled function codegen (loops, entry points) for alignment purposes — modern CPUs hate unaligned things.

Note: Some x86 instructions have implicit register use that might not be directly obvious. Like PUSH, POP, IMUL, IDIV, LOOP, STOS[BWD], MOVS[BWD], etc. They can affect registers that are not mentioned in the instruction operands.

In general, if the things look weird, just google the instruction. Much less surprises in other mainstream architectures, like ARM. All architectures do have vector instructions that might confuse you at first, like x86 vectored double precision add, VADDPD. Again, just web search them. No one remembers all of the instructions by heart, there's just no point.

Web search for assembler tutorials and simulators. They're too many to list, just pick something suitable for your taste.

In short, play around. Don't get scared by something weird, just look it up.

Don't stress if you don't understand everything, you can always look it up or try it out in a debugger. Even a little bit can help quite a bit.

JITs usually have some way to display generated assembler. For example, to see the native x86/ARM/whatever code generated by JVM you'd say something like:

  java -Xbatch -XX:-TieredCompilation -XX:+PrintCompilation
Maybe throw in -Xcomp for maximum optimization. I don't remember the details, just look them up. :-)

Other JITs have similar ways, once you know this kind of thing exists it should be easy to look it up.


> Again, just web search them. No one remembers all of the instructions by heart, there's just no point.

Another good tip: download the official architecture manuals, they're freely available for the most relevant major architectures (and for x86, both Intel and AMD have their own version). They're monstrous doorstoppers (several volumes with many thousand pages), but include in excruciating detail the description of every single instruction, with the advantage that they'll work even if your Internet connection is offline.


> but include in excruciating detail the description of every single instruction

well, at least the official ones with the officially supported operands. Sandsifter has a lot to say about how many undocumented instructions exist on just the "exposed" x86 part of the CPU, nevermind the ARM and other sub ring-0 stuff.


Yes, but in the context of this thread: undocumented instructions can change from generation to generation of a processor architecture, while documented instructions are much more stable; and because of that, compilers will not generate undocumented instructions. When reading compiler output, the official documentation of the instructions (plus documentation of the calling conventions) should be enough.


I'm still a bit of an assembly/JIT newbie, but a few thoughts:

When you write assembly code, it's generally specific to several details, which in turn affects your choice of tutorials from which to work:

1) The target architecture (x86-64, AArch64, MIPS, etc.) This defines what instructions you have available, the memory model, etc.

2) The architecture ABI. E.g., what's the protocol for passing parameters to functions you call, and for receiving the return code. [0]

3) The particular assembler you plan to use. Gnu As (and I assume others) provide some directives that don't map directly to machine instructions, but do some book-keeping to make your life easier.

4) (Potentially) the file format for the resulting code [1]. I think this is one area where the assembler and linker utilities can shelter you from the ugly details.

FWIW regarding JIT: Lately I've started playing around with Xbyak [2]. It's probably a bit light on the documentation, but for the most part I've found it to be an easy way to get started with JIT.

[0] https://en.wikipedia.org/wiki/X86_calling_conventions#System...

[1] https://en.wikipedia.org/wiki/Comparison_of_executable_file_...

[2] https://github.com/herumi/xbyak


There are lots of online talks, hard to pick something on the spot, here are a couple of quick ones.

Maybe start with JF Bastien talk at CppCon 2020. He did a very nice historical overview of JITs.

"Just-in-Time Compilation"

https://www.youtube.com/watch?v=tWvaSkgVPpA

Then some quick stuff:

"Understanding HotSpot JVM Performance with JITWatch"

https://www.infoq.com/presentations/jitwatch/

In Visual Studio you can directly see CLR JIT Assembly in debug mode (F12)or via WinDbg and SOS plugin.

https://docs.microsoft.com/en-us/windows-hardware/drivers/de...

https://docs.microsoft.com/en-us/dotnet/framework/tools/sos-...

You can also play with it online, https://sharplab.io/

V8 Blog

https://v8.dev/


I look at a fair amount of disassembly. One tip to understand something complex in the debugger is to skip to callsites if they are available in the interesting part of the code. The compiler usually has to put values in conventional locations, either as arguments or live-through variables, so your variables are easier to track.


Make it more accessible from your build system. That way, when your curiosity is piqued (or frustration has peaked...), its easy to pick it up and see for yourself.

With GCC, its as simple as adding `-save-temps=obj`. You get preprocessed source and assembly emitted alongside the object files.


Do you have a specific CPU family or ISA in mind? (ARM, x86,...) And what is your background in terms of general programming skills?


I am also interested in this.


Had a very junior engineer who had just started report a compiler bug to our all-technical-staff mail list. He was testing the not-yet-released next version (of tcl), so it was possible but we had appropriate skepticism and someone on list asked for the smallest reproduction case.

Few hours later, he verified and produced on-list a reproduction case where a variable could not be incremented by 1 but could by 2 or any other number.

Turns out he’d been taught in typing class that l (lowercase L) could be used for 1 and carried that into computing.

WONT_FIX


> Turns out he’d been taught in typing class that l (lowercase L) could be used for 1 and carried that into computing.

A good way to keep them busy would have been to demand they type an exclamation point.

(Hint: Backspace-and-overstrike a period with a single quote.)


When did this happen? It's hard to imagine this happening anytime in, say, the past 15 years...


Summer of 1997 or 1998. Fresh college grad. Pre-release of some patch level of Tcl 8 (which introduced byte compilation, making it slightly more plausible there would be a compiler/interpreter/language bug).


Oh ha. That is like, just about the latest that could happen, I imagine...


Yeah. I grew up typing on computer keyboards so I couldn't believe the explanation. Some of the older employees were much more understanding as they learned to type on typewriters that literally didn't have a 1 key at all.


Oh compilers are fun. Just recently I was reading through Rust's bug tracker, as one does, and learned that comparing function pointers is not deterministic. Compiling this code [0] in the Debug mode yields different results than in the Release mode. You can read the whole discussion about whether it's a LLVM bug, a Rustc bug, an undefined behavior, an intended behavior, a pretty serious bug, or nothing to worry about over at [1].

[0] https://play.rust-lang.org/?version=stable&mode=release&edit...

[1] https://github.com/rust-lang/rust/issues/54685


> comparing function pointers is not deterministic

The comparison is deterministic - the perhaps unexpected part is that two distinct but identical functions in the source code are folded into one in the binary.


That bug was actually a bit deeper; it merged the functions but would statically fold some comparisons to false while letting others persist to runtime (where they would evaluate to true) in the same build. You could do `x == y` twice and get different results, one of which was a lie.


God, that issue is a mess. Lots of people missing the forest (Rust and LLVM are breaking constant folding guarantees) for the trees (function pointer equality is weird).


It should be noted that function pointer comparison is defined to not be deterministic between different compiler option sets, it's at least for now possible but mostly useless.

Through there are bugs involved, too ;=)

(Due to const folding comparisons vs. doing them at runtime there was/is non-determinism in the same build between different call sites ... )


Just a heads up, you linked to the main Rust Playground page, not an actual gist.


Damn, you're right. I didn't notice because Playground remembers the last content you had in the editor... Anyway, the correct link is https://play.rust-lang.org/?gist=62d362e2bf72001bd3f3c4a3ed0...


Since we are sharing stories about bugs we ran into in compilers...

I once ran into a bug where "bash" would run commands out of order. It wasn't hard to trigger the bug, but it wasn't deterministic either.

When I first noticed the bug on my production systems it drove me insane, since the logs being generated were impossible. It took me a weeks to figure out that bash was running commands out of order.

Then, when I tried to report this bug, I ran into a lot of resistance. First over IRC, nobody believed this could possibly be happening -- and I was eventually directed to the mailing list [0], where the maintainers were initially not able to replicate it, but eventually more required elements were identified and the bug was fixed.

[0] https://lists.gnu.org/archive/html/bug-bash/2015-06/msg00010...


Some years ago i worked on a part of a specialized steering system for a car. This was done with certified everything (Certified Compiler, Processor, a lot of paper work etc.)

This was a 16-Bit processor and the C-compiler had a "funny" bug. If you had a struct with 3 8Bit Values in a row and a 16Bit Value afterwards it would overlap the 8Bit Value with the 16Bit value:

  struct {
    int8 a;
    int8 b;
    int8 c;
    int16 d;
  }
In this case the variable c and d would have the same address. This was on a cpu where we didn't had a debuger (not enough memory left for it), we only had a serial port for debuging.


I guess someone told the compiler authors that the automotive industry was a unionized industry.


For anyone not well-versed in C a union does something similar to the bug mentioned. It's like a struct but using the same address for each member.


You mean a few variables in a union doing work of one variable? Yes, poor joke, I can see my karma burn....


A serial port is all you need, if you have room for a wee GDB stub. Then you get the full power of GDB.

I do this routinely where the target has 256GB of RAM, and (not incidentally) specialized network hardware, but no dev infrastructure except gdb-server (which provides the stub) and sshd. I build in a docker image that matches the target, but with dev tools, with the output bin directory sshfs-mapped to a directory on the target. I run the binary on the target under gdb-server, opening a socket listener. Then I run gdb natively on my dev machine, and `target remote server:61231` to attach to that socket. If I didn't have easy access to listening ports on it, I could ssh-tunnel one in.

So, a serial port and small RAM doesn't have to mean you have no debugger.


It sounds like the poster was targeting a microcontroller rather than a more generally capable "embedded" CPU. 16 bit CPU, probably on the order of 100KB RAM and code space total if you're lucky. No operating system in the common sense, although you might have some notion of task switching if you're fancy. "Wee" in that context would imply a footprint on the order of 1KB of code and maybe 100 bytes of RAM.

I assume, by saying there wasn't room for debugging functionality, the poster meant that the "jtag" or equivalent hardware port simply couldn't work for single stepping due to the particular architecture requiring compiled-in cooperation of the firmware, and they didn't have the kilobytes of memory to spare.

These days, it's becoming more reasonable to throw Linux based compute nodes at problems previously best served by microcontrollers. A more powerful CPU isn't a superset of a microcontroller, though. Microcontrollers are still necessary when you have "hard" timing requirements and you need to account for where your CPU cycles are going. Even seemingly "solved" problems like participating on a CAN bus is difficult for a Linux based node. For example, while you can easily purchase CAN interface boards for raspberry pi and send and receive messages, you are pretty much guaranteed to drop some percentage of incoming messages at realistic bitrates. All the boards use MCP2515 SPI CAN controllers, and the linux driver simply can't schedule SPI transfers in response to interrupts fast enough to avoid mailbox overruns inside the controller. Maybe it's somehow been cleverly fixed since I last looked at it though?


It is fairly common nowadays to run Linux as one task on an RTOS, and have other tasks manage the CAN controller and other devices that need a low latency response. Or, just to use a coprocessor for low-latency work, as is commonly done to manage wifi.

Routing gdb stub traffic through a hosted Linux to an RTOS task or coprocessor is not an elementary exercise, but is something an engineering student might be expected to implement, even as just part of the real project.


That's a nice way to get gdb!

In our case it would probably not have helped. We had a fixed old mcu board where the functionality grow over the years. We were fighting over bytes...


My first was at my first job out of school. It was a bit of an adventure telling my manager. It was in C, but with an old GCC version on an architecture like MIPS. My code would seemingly never run through a switch statement correctly, but it worked fine with if statements. Luckily and unluckily, the company was large, ran a custom GCC build from a third party and had a support contract. When I filed the bug, they said "there's a known issue with large jump tables on that GCC version, disable some optimization with this flag."

I think that made me just a little paranoid. I generally trust things, but depending on their popularity and likely it is my code path is run by lots of users, I realize library (and compiler!) bugs happen.


I have found bugs in Gcc, and reported them. I check on them once every few years to see if anything at all has happened on any of them. It seems worth distinguishing code-generation bugs from other compiler bugs. Most of my Gcc bugs are not code-generation bugs.

Back in the '80s, the C++ compiler was `cfront`. We spent half of every day bisecting source files to identify the line that would crash the compiler, and doctor it to step around the bug.

People who used to use the Lucid compilers said they were happy when Lucid flopped, because from then on their compiler only had known bugs, instead of a new crop every few months.

Things are better, nowadays, with compilers.


The problem with compiler and standard lib bugs is it's the last thing you suspect. You're always going to look at your own code first, because 99/100 it's you and not them. You're never going to immediately think "compiler bug", your first port of call is gonna be "I must be using the API wrong".

I discovered a bug in the Swift standard lib once, and it took ages before I got to the point where I decided to strip out my own code, just to make sure it was me. And it wasn't, there was genuinely something wrong in the lib that other people on SO were also able to reproduce.

Good on him for finding a bug in secp256 too. When it comes to cryptography code, it can be very hard to know what the right answer is. I always find some examples on the internet and put them in a unit test to make sure I'm not misusing the API, because if you do your answer looks the same: bunch of numbers in a byte array. To know that your numbers are wrong, you need to be sure you are testing them correctly. Which you can't be if you don't know if you're using the API correctly.


I've worked with people for whom "compiler bug" is their first port of call...


That's a rookie mistake. After being wrong enough times, most people will catch on that the chances of a compiler bug are near enough to zero to not worry about. The last actual compiler bug I can remember was probably 30 years ago.


If we expand looking for a compiler bug to looking for a bug in dependencies, I still don't think that should be the first step of searching for a problem (unless your dependencies are that bad, which sometimes they are). However, it's not a bad thing to do. Sometimes, you will have bugs in your dependencies, so you should know how to debug them, so it's a skill you should practice. Also, any time spent debugging your dependencies is also time spent getting deep knowledge of your dependencies which is important to have. Anyway, what else are you going to do when you're stuck on a debugging task where it looks to you that the code is right, but the answer is wrong.


there is a telltale signature for compiler bugs:

1) your test suite works over several revisions of the compiler

2) after you upgrade, suddenly one test fails (especially if it's out of hundreds)

2a) you can isolate the minimum condition and it really leaves your head scratching

3) rewriting the code in a slightly different way makes it pass

The time I found a compiler bug it was because of an optimization that missed a corner case that I just happened to be using.


Or it was undefined behavior in your code, like depending on the evaluation order of function parameters, or one of mine, testing an array element before testing the index was within bound.

Broken: table[table_i].values[j] && j<category_map_count && to_i<category_array_len; Fixed: j<category_map_count && table[table_i].values[j] && to_i<category_array_len;

Optimizer: "j was used to access values array, j outside array would be undefined behavior, therefore it must have been within the array boundaries and does not need to be tested."

Oh, and threading bugs where some variable that should have been mutex locked or made atomic gets moved in or out of a register depending on the phase of the moon, causing other threads to fail to notice changes to that variable...but add a debug line and it suddenly works.


A wonderful example of this was some Linux kernel code that ended up causing a security vulnerability. It was something approximating:

  int foo(struct data *data) {
    struct member *member = &data->member;
    
    if (data == NULL) {
        return -EINVAL;
    }

    // Do stuff
  }
Assigning a value to member is just a matter of taking data and adding the value of member, so won't explode if data is NULL since it's not actually dereferencing that address. But this is still in undefined behaviour territory, so gcc assumed that we must know that data could never be NULL and optimised out the check.

This sort of thing is why I have definite feelings about the use of C in security sensitive contexts.


hahah good point. The bug I discovered was in the erlang compiler, which does not have UB (in the language being compiled), and threading bugs are either rarer or obvious.


Back in the day Red Hat decided to ship GCC 2.96 which was unstable and not meant for general release. I've seen a handful of compiler bugs but all (or almost all) of them can be traced back to that decision, but because that happened right when I started writing software full time it took a bit before I started trusting the compiler.


> The problem with compiler and standard lib bugs is it's the last thing you suspect.

Well, until you start suspecting hardware bugs.


I got a fun compiler ignorance bug.

Me: I have memory corruption when I call your API. IBM: trust us, our API DLL is perfectly compatible with your old Windows 32 bit client program! We changed nothing! Me: I have stack overruns. 4 bytes of return value from you overwrite 4 bytes of variables, whatever I declare last in my function. IBM: look at the source of our API façade! It's unchanged! (it was, except for harmless additions). Me: your compiled code is fairly similar, but the return value is bigger. (At this point, I was already on very friendly terms with Ghidra and with the Visual Studio remote debugger.) IBM: we just recompiled our code!

But they recompiled it with a newer compiler: time_t had changed from 32 to 64 bits, changing the size of the returned unions in their DLL but not in my client.


> As I rushed to recompile my computer system using GCC 8, I contemplated the vast consequences of such a bug could be, and pondered how it was possible that computers could function at all.

This hits home.


I maintain embedded development C and C++ toolchains for a living. I have seen my share of compiler bugs. For example, some optimization pass in a popular open-source compiler that would lose track of dereferences of pointer variables if they were more than 12 bytes deep in the stack, meaning that a reference capture in a C++ lambda would get converted to a value capture if it was the third or later capture in order and changes to the referenced capture would be lost....

Anyway, my experience is that compiler bugs do exist, but maybe 99% or so of "compiler bugs" reported by my users turn out to be undefined behaviour in their code.


Note also that "it's never a compiler bug" applies more to things like GCC and so on.

If you're working with a new language or quickly changing, e.g. Nim, Crystal, etc, or even something as old as Rust, then it can much more easily just be a compiler bug...


Yes, the older and more popular a compiler is, the less likely it is to have bugs.

The buggiest compiler I ever used was a C compiler that ran on a PC and generated code for the 68000 processor. We seemed to trip over something about once a month.


Yes, the Nim compiler is a disaster area, especially if you do lots of templates and macros.


Maintaining an application that still has Symbian users (some people are really conservative and like their Nokia E52s, plus there isn't as much malware for this dead OS), bugs in old GCCE are rather annoying.

Sometimes, for no reason, GCCE just crashes compiling totally innocent code. Usually, a minor rewrite of the logic helps, or even weird edits such as adding a new (useless) parameter to a method.

The last GCCE toolchain for Symbian was released by CodeSourcery in March 2012. It contains GCC version 4.6.3. It is theoretically possible to adapt and compile a newer version, but the sources need so many edits that I gave up after a few days.


Symbian in Brazil was rapidily climbing in popularity when MS killed it.

It had 60% of phone market share and it was RISING despite the launch of iPhone and Android.

The reason is that:

  1. It worked great.
  2. Brazil had a vibrant dev community (people would even port PC games to Symbian O.o)
  3. It was much cheaper than iPhone and clones.
  4. Nokia phones were just solid and awesome.
After MS made that memo that killed Symbian, it died almost instantly, people got so disappointed with MS that they started to switch to android, even if it was some chinese-made "shit-phone" instead of "feature-phone", the amount of really, really crappy androids that flooded the market was mind boggling, many didn't even work right, for example wouldn't complete calls properly or wouldn't connect to some Wi-Fi channels.


A couple of years back I ran into a JDK JIT bug during a project. The code ran fine until I ran it through benchmarks, which triggered JIT on a method causing it to return incorrect results.

Took a long time to find, because there were no errors, just wrong results (a specific if statement taking the wrong branch).

Trying to get assistance from others were mostly met with responses along the lines of "It's probably a race condition" (in single-threaded code) / "very unlikely to be a bug in the JIT". I did end up finding a way to disable JIT for the specific method, which solved the issue, and never got around to finding the root cause. I do believe it has been fixed in the meantime at least.

I haven't run into major compiler bugs since then, but often have to dive deep into libraries to find obscure bugs (database drivers and web servers most often).


Ah kids these days. New compilers used to be written every year or so, and had the most horrific bugs. For instance, the one that reordered complex 'if' conditions to evaluate in chunks, ignoring precedence. Or the compiler that stored parameter context while compiling with a different lifetime than the actual one - resulting in references to deleted memory during compilation. And on and on.

Used to be, a compiler bug was right up there with a memory issue in your list of 'what might be wrong'.


My immediate tangential thought is about Ken Thompson's paper, "Reflections on Trusting Trust".

The gist is that our security issues could come layers away from where we may expect them to, all the way up to the compiler. It's a great paper, but who would have expected anything less from Ken Thompson.

https://dl.acm.org/doi/10.1145/358198.358210


I ran into a compiler bug in Clang by accident about 10 years ago.

The story is that I checked in a code change that passed unit tests locally, that then broke an automated build. This was bizarre because this was at Google and our check-in process guaranteed that we had run the test suites successfully, which required the very compile that broke.

It turns out that the local compile was with GCC, and the automated one was Clang. The construct that they treated differently went like this.

There was a class A with a protected property bar, and a subclass B. I also had unit test code in a class TestB which was a friend class to B. And in that unit test code I accessed foo.bar where foo was of class B.

GCC looked at that access, decided that bar is protected, foo is of class B, and TestB is a friend, so TestB has access.

Clang looked at that access, found that bar is protected and from class A, TestB is NOT a friend, so TestB had no access.

The problem went to a local expert who read the spec and decided that GCC was per the spec, Clang was not, and submitted a bug report to Clang.

As for me I figured that if the very first thing that I thought to try with protected, friend and subclasses was an edge case that nobody could agree on, perhaps C++ wasn't the right language for me...


The most difficult bug of my career turned out to be a compiler bug. This was in the last 1980s. We were building autonomous robot navigation code using T, a dialect of Lisp. We were running the same code on Sparc workstations and an embedded system running vxWorks. The vxWorks code would intermittently crash with massive heap corruption, but only when certain devices were being used. The problem turned out to be two out-of-order instructions emitted by the compiler that decremented the stack pointer while a value in the current frame was still live. That value was read by the very next instruction, but if an interrupt happened to occur in between those two instructions, on vxWorks the interrupt handler used the stack of the current task, and so would clobber that live value. Add a few milliseconds more of run time and -- kablooey!

It took several months to figure out what was happening even though in retrospect it should have been pretty obvious.


I might be wrong, but that doesn't sound like a compiler bug. Re-ordering operations without regards to interrupts is generally permitted in compilers. If you want the order to be maintained because you are dealing with volatile values then you have to tell the compiler that!


This was the late 1980s. A 680x0. Instruction re-ordering was not yet a thing.

But it was definitely a compiler bug either way. There was a POP (or maybe it was an UNLK) instruction, followed by a dereference of the PREVIOUS value of SP. That was bound to fail on any architecture that used the current process stack to handle interrupts. It just so happened that most of the development of both the compiler and the application code was being done on unix machines, which happens not to be such an architecture (interrupts are handled on the kernel stack) and so the bug never manifested itself there. But such architectures were common then, and are still common in embedded systems today.


In the JDK 1.0.1, the GridBagLayout would make absolute hash of your layout. I found this out when attempting to write an application in early Java. I read and reread the API spec, I tried everything I could think of, but every time I got a scrambled mess on the screen instead of a nice layout. I was convinced that I had done something wrong because the first rule about bugs in the runtime is, it's probably not a bug in the runtime.

It was a bug in the runtime.

JDK 1.0.3 came out, and ran that code just fine.


Once I learned MigLayout, I never went back.


Reminds me of an incident in 2015 where a colleague and I stayed late trying to figure out a strange case of our physics engine producing NaN values out of nowhere. Even when we found the culprit it took us so long to believe our findings. There was a bug in Visual Studio's intrinsics for certain geometric functions (I believe it was sin). There was even a bug report for it with a reply saying they were aware of the issue and had no immediate plan to fix it.


In my personal experience you haven't really completely tested your software until you've hit a couple compiler bugs. :) Though they've gotten better in recent years.

The hits in the implementation of draft-brezak-win2k-krb-rc4-hmac-04.txt seem ... interesting.


There is a compiler bug in GCC 10 that causes VLC to crash with some files but not others:

https://bugs.debian.org/971027#10


Wow, this is like finding a glitch in one of knuth's books.


Nah, those are much rarer.


Long, long ago I hit what presumably was a compiler bug in Visual Basic. Ran perfectly in the IDE, data corruption when compiled to disk. (I didn't have the experience to pin it down well back then.) Forget this, stay with the previous version.

I've also had a library + debugger bug that really left me pulling my hair out. Delphi, protected mode, the library dealing with real mode data. Running the code it would normally segment fault but occasionally work correctly. Single-stepping in the debugger would work correctly 100% of the time.

The library turned out to be riddled with the bug, they were using pointers to point to the real mode addresses. The mere act of loading an invalid pointer is a segment fault and the code emitted by the compiler would copy the pointers by loading them. Note that they were not being followed, if they pointed to nonsense it didn't matter, but if it wasn't a valid address, boom.

Somehow the debugger was successfully executing the invalid load when single stepping. I never investigated exactly what it was up to, I figure it might have been simulating the command to avoid having to write a breakpoint into memory that conceivably could have been read only.


I found and reported a nasty compiler bug with basic arithmetic just a few weeks into learning C++. My previous programming experience was in BASIC and 6502 assembly, so it was my first experience with a compiled language. My bug report was accepted and the vendor issued a quick patch.

After this formative experience, it took me years to stop instinctually assuming that I was less error prone than the compiler.


I once fought a Java compiler which did not produce the proper bytecode when I used the "@Loggable" annotation. (https://aspects.jcabi.com/annotation-loggable.html) Worst off, the incorrect bytecode was on the exception path. I spent at least 2 days making sense out of it.


It is scary that gcc or glibc manage to break memcmp or memmove on a regular basis.

I begin to understand the people who write their own libc for security reasons.


I can't see that working out the way they expect.


Well, on one hand, writing correct albeit slow memcmp() is easy, on the other hand, it has some gotchas...

    int my_memcmp(const void * ptr1, const void * ptr2, size_t num) {
        const unsigned char * p1 = ptr1;
        const unsigned char * p2 = ptr2;
        for (size_t i = 0; i < num; i++) {
            if p1[i] < p2[i] {
                return -1;
            }
            if p1[i] > p2[i] {
                return 1;
            }
        }
        return 0;
    }
For example, technically speaking, unsigned char can be as wide as an int, so "p1[i] - p2[i]" may actually evaluate to unsigned int which is not what you want.


Nitpick: those if statements need parentheses.

> unsigned char can be as wide as an int, so "p1[i] - p2[i]" may actually evaluate to unsigned int which is not what you want

Would this matter?


I don't think it would; `<` and `>` aren't syntactic sugar in C.


> unsigned char can be as wide as an int

No, it is not true for almost all non-embedded systems. And char cannot be as wide as int, it is the opposite true, int can be as wide as char.


int is guaranteed to be at least as wide as char.


This exactly what I was saying, just made a wrong choice of words.


"Int can be as wide as char" is exactly the same as "char can be as wide as int".


I meant "int can be as narrow as char".


With a non-mainstream language, it is pretty much always a compiler bug

I use Pascal, because it is very fast and has automated reference counting (ARC) for most types, so it is almost memory safe. It was the only way to get C-like speed, fast compilation and no memory issues. With ARC you never get uninitialized values, you never get a use-after-free, and you never get a double-free.

A few days ago, I ran my program in valgrind's memcheck: double-free detected

That was hard to debug. Valgrind told me where the value was created, but not where it was freed the first time.

Turns out, FreePascal just put an unmatch refcount decrement after a string comparison: https://github.com/benibela/internettools/commit/4c510e8c977...

Time to update FreePascal. I use "nightly" builds of FreePascal because the stable version did not have Android aarch64 support (although the newest stable does support it. but there were other issues with the standard library). Last time I tried to update it, it stopped working on Android x86 and some floating point computations failed. But those issues have been fixed now.

--

Even worse than compiler bugs are CPU bugs. Or emulator bugs if you run it on an emulated CPU. I just had the problem that my app did not start on the aarch64 Android emulator. JNI ExceptionCheck always returned 1. This function here: https://github.com/benibela/internettools/commit/d9fafc9274a... Two instructions added to the assembly and it is fixed and returns 0, but those instructions should not have changed anything


I hit a tricky compiler bug a few years back, only after the bad code had been released to the public. Fortunately the bug manifested as some wasted performance instead of anything too damaging.

The issue was that extern C functions didn’t have proper type checking of the parameters. The header said one thing, the cpp file something else, and they were very subtly different. The compiler didn’t complain. At runtime the caller would pass some values and the callee would get complete garbage, but only on 64-bit architectures. I tested on 32-bit at the time and never saw it myself until it was too late.

To be fair this one is mostly my own bug. I call it a compiler bug too because I expected the compiler to have my back and it didn’t.


This reminds me of a brilliant coder I worked with while in grad school. We were doing a lot of low-level kernel hacking for the SnowFlock project[0] although his knowledge and skills were (and I'm sure still are) significantly deeper than mine in this area. He had spent a significant amount of time on a particular bug and eventually started digging through the assembly code and found the cause was a compiler bug. Such a thing blew my mind at the time.

[0] http://sysweb.cs.toronto.edu/projects/1


Like every 6 months something impossible happens on my visual studio. For example the program enters an if it shouldnt. It evaulates to false but still enters.

This happens because somehow it uses old cached intermediate files that are no longer valid. No matter what i do it refuses to build it correctly. Even clean and rebuild does not work, it forces me the delete project files and recreate them

It probably an err on our side, especially aince clean dowsnt work. But it is quite annoying when it happens and gets hard to figure out what is going on


Maybe store the code base in a git repository and use git clean -xdf?

Visual Studio's build system is horrible. The compiler and msbuild are actually quite ok if using the command line.


"It's not the compiler" is always a good first-approximation. And then bizarre things happen..

I ran into this with MSVC just a few months back. After an update of the compiler, a bizarre display issue emerged with lines criss-crossing the screen. Turning off optimizations to debug made the problem go away. Eventually tracked down the issue to the following line of code:

if (abs(delta.y) > abs(delta.x)) { …

changing this to:

bool yDominant = abs(delta.y) > abs(delta.x); if (yDominant) { …

fixed the problem. Yikes!

I don't know if they fixed the optimization in question by now.


We may have found a java compiler bug in high school cs class. The program was all of 2 lines doing some very basic arithmetic and comparison. Half the class and teacher were huddled around the computer trying to figure out what the hell was going on. I was new to programming so it's hard to say with confidence. But there was nothing 'tricky' in the code, just wrong output. This was over 10 years ago but java wasn't exactly new.


Back in 2015, GCC was already[0] over 14 million lines of code. It's surely much bigger now (although google didn't immediately provide a number). No one should be surprised when bugs crop up.

[0] https://www.phoronix.com/scan.php?page=news_item&px=MTg3OTQ


The only time I ever experienced a legit compiler bug was back in the heady days of Adobe Flex, where compilation would fail for no reason. After probably 2 days, I decided to just start making random changes. I added a bunch of white space to one of the many files and it started working again. Removed the white space - wouldn't compile. Re-added it, it compiled.


When I used to do firmware, we used to see tons of bugs in the commercial compilers. We try out a new version and one bug was fixed and a new one inevitably pops up. My first boss used to do a diff between the assembly output of both versions for the same source code to identify the differences and determine if any new bugs were added.


I've only encountered a compiler bug once. During undergrad in a data structures course my Scheme code wasn't working correctly. Took a long time to convince the TA to even acknowledge the possibility of the problem not being my code. Felt such vindication at the time.


I've encountered lots of compiler bugs. But I aggressively search for them, by various kinds of random testing. Before this became popular compilers were frightfully full of low hanging bugs, waiting for someone with a high volume test generator to flush them out.


my only compiler bug was a doozy: pre-egcs gcc, on a strange MIPS chipset running an odd mix of at&t and bsd. the machine had been retired, and was being used (with blessing) as a host for a MUD.

it turns out that gcc was compiling with an opcode that was invalid on this particular architecture, causing the application to crash in weird ways. being pre-egcs, it was easy to track down and fix by changing the opcode to two instructions instead, fixing the problem.

it was then I understood why this very cool (to me) machine with a fast multi-cpu and high memory had been retired: it's hard to reliably run binaries that you compile that crash randomly, and I seemed to have been the only one to take the time to figure out why.


All of the compiler bugs I have run into were related to recently added features. Things like intrinsics for new CPU instructions, or new optimization options (e.g. ThinLTO).

I have a healthy skepticism of new compiler features as a result.


I found a compiler bug in Julia within my first 5 minutes of using it, and it took some courage to report it as a bug since I was so convinced it was my fault.


That's why I blacklisted gcc 9 and 10 for a while. Not just this one which I also encountered, but also several similar bugs.


If prints don't work... Christ, I don't know what I'd do. Great post.


Try writing in Nim ... at least one compiler bug per day.


Hardly. This might have been the case years ago but these days it is pretty rare, unless you’re running the development version and using the newest features which is pretty understandable imo.


I'm speaking of personal experience, these days. Templates, macros, quote -- full of bugs. There is a very large number of open bugs, and I encounter them. I've been through the compiler code--a lot of low quality stuff, numerous bugs never reported.


Two thoughts:

1. If anyone has significant experience or even just interest in this, they should work to collaborate on how compiler bugs could be taken advantage of for unintended purpose. Otherwise, there will continue to be a lot at risk.

2. You think compiler bugs are bad? How about hardware hacking, quantum effects, etc.

Similar to Schrödinger's cat, security is relative, never absolutely exists, but always exists. It's a Platonic ideal.

"Maturity" of understanding the nature of security is variable and multi-state, like a spiritual journey; greater understanding of security may lead onto to great loss of faith or you may go beyond into the light of true awakening. Most will only get a firewall, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: