Hacker News new | past | comments | ask | show | jobs | submit login
“Unexplainable” core dump (2011) (stackoverflow.com)
193 points by curling_grad on Jan 3, 2023 | hide | past | favorite | 64 comments



> Our code and compilers are constantly changing, and the problem disappeared as suddenly as it appeared ... only to happen again 2 years later in a completely unrelated executable.

It does not encourage me how much this sounds like the short story "Coding Machines". The original post even happened right about 2 years after the short story was posted, then in that comment reoccured after another 2 years.

https://www.teamten.com/lawrence/writings/coding-machines/


Great story. It reminded me of Ken Thompson's Reflections on Trusting Trust, I assume it was inspired by that: https://cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Reflect...


Damn, what a story.

Though in reality it definitely would've been some guy on the compiler dev team adding it before publishing binaries. I wonder if you could set it up to inject a halt into compiled code that runs if some conditions are met, crashing most of the word's infrastructure on a predetermined date.


I have to say this is a really good story. I think whoever wrote it really understood computers and SE culture, it’s extremely realistic except for the metal switch part and AI.

(Or…maybe not except?)


That was a great read, thanks.


Excellent story, mildly terrified!


One of the best bugs I've seen had a description fairly similar to this. Hot routine run at scale (floating point math for ads ML training) fails at a rate about 0.000000001. Turned out to be a very obscure bug in the context switching code in the linux kernel, the FP registers weren't being restored properly.

Another one, the debugging was aided by the fact the developers ensured that everything was accessed through const pointers, so it wasn't their code corrupting their memory.


I had one of these back in the 90s that turned out to be a compiler bug. It was code that ran a mobile robot with an arm. Exact same code running on a Sun workstation never failed, but running on an embedded system running vxWorks crashed intermittently, but only when the arm was moving. Entire heap was corrupted, so by the time the crash occurred there was no hope of getting a stack trace or any hint of what went wrong upstream. Turned out to be two mis-ordered instructions that accessed a value on the stack after the stack pointer had been popped. On vxWorks, interrupts used the same stack as the currently running process, so if an interrupt occurred exactly between these two instructions it would clobber that value, and chaos ensued.

Took a full year to figure it out. Good times.


How did you end up piecing together what happened?


Long story but the tldr is that it happened in two stages. First someone figured out a way to reliably reproduce the problem. And then I spent a very long time single stepping through machine instructions until I had a eureka moment.


And the compiler was emitting the two instructions in the wrong order?


Yes.


I hope someone bought you a beer


Discovery is its own reward ;-)

Actually, I remember reporting the bug to the compiler authors and being stunned when they told me that they were not going to issue a new version with the bug fix because the project was no longer being funded. (This was the T dialect of Lisp in case you're wondering.)


Sounds like quite the arduous process!

What was done in the time period between discovering the bug and its cause?


A lot of rebooting and cursing.

Fortunately, it only happened when the robot's arm was moving, and we were mostly doing mobility research so we were able to be productive simply by not using the arm.


It has been a while, but a switch to kernel mode followed by a switch back to the same user mode process doesn't actually mess with FP registers. The idea being, the kernel should not be using those anyways.

Also minor point: a const pointer is a pointer which always points at the same address. You can still change what is pointed at. You probably meant "a pointer to const"


Not a switch back to the same process- context switching during normal process switching.

People have been using the term "const pointer" to refer to "a pointer to const" for 20+ years (as long as I've been doing C++), although that's probably more out of laziness than incorrectness. Certainly the language definition didn't do anybody favors.


OK, that's interesting. Have any details on how FP wasn't being restored?


In this CPU, AMD first introduced the "stack engine". This is likely a bug in that feature. It has a speculative stack address delta register in the front-end that is updated directly with push/pop instructions, and that delta is dispatched with the stack memory uop to be added to the original stack address register when doing address generation in the load/store units.

The delta has to be small, because you don't want big adders in the front end and because the delta has to be sent down with the push/pop memory uops. That means it can overflow or underflow, at which point it has to be reset by sending a synchronize operation to the back-end to update the original stack register (agner has a better description).

So the delta register is probably 10-12 bits on Barcelona, and this bug is probably a corner case where the stack register update is happening, hence 1024 bytes off. Perhaps there is a window where a uop can get the old delta value, but the new base value (or vice versa) when a sync is operation is concurrent.

Setting that MSR value possibly disables that stack engine feature entirely, or it could be it disables some aggressive and complicated detail where the bug is, e.g., allowing stack operations to run concurrently while flush operations are in progress.

It's not a coincidence there just happens to exist a way to disable this at runtime. The way processors are designed means that everything must be able to be observed, debugged, and fixed in the field. That means everything has to have fine-grained ability to control, disable, enable safer fallback paths, and even engage additional logic to reduce the state space in some cases (e.g., serialize pipeline while a particular operation occurs).

Usually these bug fixes decrease performance (except in cases where a performance bug is found and the fix actually increases performance), so you want the switches to be very fine-grained. So it's possible they fixed the stack engine bug without disabling it entirely.

It would be like shipping software and providing support and bug fixes for it for the next 5-10 years without patching the software, only updating the config file. It's quite amazing. For every one of these issues that hits the field, there will be many found internally during the internal hardware bring up and verification (which will be ongoing for at least part of the life of the CPU).


Yup, these are often colloquially called "chicken bits" because you put them in if you're afraid your new feature won't work (It will generally just fall back to the previous battle-tested implementation). I often wonder how far back you can pare a modern CPU with these.


Reminds me of what we dubbed "the cosmic ray incident."

During college, we were arranged in groups of 2 or 3 to do some pair programming for the more complicated exercises.

I had attached the debugger and had set a variable to 99, so the loop would execute one more time and we could test if our changes would work.

Went through the instructions step by step, and suddenly we got a segmentation fault. Between setting it to 99 and it accessing some data later on, the value changed to 107.

Quite a bit of confusion ensued. I made a backup of the executable before recompiling. Running them again, both the new version and the backup worked perfectly. The files matched bit for bit. To this day we have no idea what caused that bit flip.


Another good source of "impossible" bugs is overclocking.

https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35...


All chips degrade over time, so it doesn't have to be overlocking, after running a CPU for many years it may turn out that it can no longer reliably work at a stock voltage levels. It is very unlikely but it does happen.


One of the best bugs I've seen had a description fairly similar to this. Hot routine run at scale (floating point math for ads ML training) fails at a rate about 0.000000001. Turned out to be a very obscure bug in the context switching code in the linux kernel, the FP registers weren't being restored properly.


I suspect that windows still has a subtle FP restoration bug. We do large scale validation of floating point data and occasionally get ever so subtly different results.


I would be interested in hear the use cases for large scale validation of floating point data. I used to work with processors that occasionally corrupted operations due to hardware manufacturing defects and these kinds of problems are exceptionally hard to debug, so I'm curious what techniques are used.

In our case, we built programs that ran enormous numbers of semi-random programs on the accelerator and compared it to reliable results computed offline. About 1 in 1000 chips would - reproducibly - fail certain operations. Identifying this helped solve problems many of our researchers reported on specific accelerator clusters- they would get a Nan in their gradients which would kill training, and it was almost always explainable by a single processor (out of ~thousands) occasionally corrupting a float.


This is a known thing that happens with bad drivers, they can mess with user-mode FP flags. For a while (IIRC) Cisco's VPN software was corrupting FP state and causing Firefox to hang/crash, for example.

At a previous company we had to run a dll in our web servers (provided by the payment processing company) for PCI compliance reasons, and we later discovered it was messing with our FP flags and as a result serialization code was producing invalid floats. That was a fun one.


Given that you say "subtly", have you ruled out rounding/precision errors? I wouldn't be surprised if some processors would play fast-and-loose with the number of significant bits they really honour.


I had something similar to some of the things described here, with a dumber cause. Started getting hardfaults in my firmware after a context switch. Thought bad array access, incorrect FPU settings, misbehaving interrupt. All dead ends.

Realized that it was only happening when a particular motor was running, then further isolated to when it was only running at higher duty cycles. I had tested the motor in isolation but at low speeds, so I didn't see issues until I tried to run the full application. Turns out the EE had royally screwed up the current sense circuit, and the MCU on the ADC was seeing voltages lower than -1V, well beyond the absolute max ratings for the chip.

I guess ST doesn't make guarantees about what happens when you violate those ratings, but a corrupt stack pointer is not what I would have expected.


This is great. Reminds me of a crash I saw early in my career. It was a null-pointer exception, except it occurred right after confirming the address was non-null. This was on a single core with a non-preemptible kernel. So the processor just took the wrong branch! There was simply no other explanation.


Are you sure the compiler didn't say "since having a null pointer gives undefined behavior, we can optimize out the part that confirms the address is non-null"?


This is likely your answer. C++ story. I worked at a large company that had a "no exceptions" policy and a custom operator new. If a new expression failed it would return nullptr instead of throwing. So lots of people wrote "checking" code to make sure the result wasn't nullptr, except that the compiler would always just elide that code since the standard mandates that the result cannot be nullptr. Many weird crashes ensued.


There are non-throwing operator new overloads that can return nullptr, but I'm not sure if those are a relatively recent development. Did the non-throwing operator new overloads not exist at the time?


Hard to say. Most of the uses probably predated the custom operator new and so nobody thought about it. Not to mention the places you cannot sneak into to switch to std::nothrow.


Ah, that's fair. Didn't think of code that couldn't be changed.


`new (nothrow)` was in C++98 and in ARM C++ before that.


Ah, I didn't know that. Thanks!


The idea that a compiler should just silently omit anything that it’s pretty sure won’t be needed is one of the most bafflingly daft decisions I’ve ever encountered.

If a supplier sent you out-of-spec parts because they didn’t think your spec was actually important, you’d call them fraudulent, not clever.


> The idea that a compiler should just silently omit anything that it’s pretty sure won’t be needed is one of the most bafflingly daft decisions I’ve ever encountered.

It's not merely "pretty sure" - the language is specifically defined this way. C++ in particular requires as a minimum standard from practitioners a perfect knowledge of the vast, complex language standard. Anything less and you'll write a program which is ill-formed or has undefined behaviour, ie your program is nonsense.


I think the key word here is "silently"; It would be one thing if the compiler informed the developer it was going to skip a statement (and said "if you really want this statement kept in, add a preprocessor directive here")


Redundant NULL checks happen all over the place. So the result of your revised requirement would be a huge pile of useless diagnostics. Whereupon, as with similar diagnostics C++ programmers demand a way to switch them off because they're annoying, then they're back to being annoyed that the compiler didn't do what they expected.


Yeah, that's the problem.


Are there any good tools to see how the compiler is rewriting your code? Almost a compiler coupled with a decompiler to show me the diff of what I wrote and what's happening?


(You almost certainly know this, but I presume the parent doesn't.)

GCC can be passed `-fno-delete-null-pointer-checks` precisely to prevent this. Linux uses it, see e.g. https://lkml.org/lkml/2018/4/4/601 where it's discussed.


Wait, the compiler can do that? I thought only dereferencing a null pointer gives undefined behavior, just checking if a pointer is 0 or not shold be valid?

Yeah based on https://joelaro.wordpress.com/2015/09/30/gcc-optimization-fd... the assumption is that if the pointer was previously dereferenced, it can be assumed to be non-null, and hence all subsequent null-checks can be elided.

Based on https://news.ycombinator.com/item?id=17360772, the case in which this actually makes a difference is an address offset like

   int *n = &param->length
The compiler knows the structure so it can just inline the offset, but the optimizer considers this as being non-null.


You can check if the address is 0, but the value of the nullptr is implementation definied behaviour. You can have a system where the null pointer is 4.

And address 0 isn't too special on a machine level either. Embedded systems have valid data to read and write there.


Yes. This was upwards of 20 years ago on a DEC Alpha, and we were inspecting the compiled build in gdb using the core file from the crash.


What hardware/platform was this? I worked on AIX on POWER a long time ago and it had to map the zero page read-only just to support speculative execution of dereferencing the NULL pointer, if I remember right.

If you were on a platform that did this wrong, speculative execution could have been dereferencing the NULL pointer.


It was a DEC Alpha chip, over two decades ago. (Or MIPS?) Pretty sure there was no speculative execution. In any case, the senior engineer I was working with would have known that—getting to "it just did the wrong thing" was our final and least satisfying conclusion!


The Alpha does quite aggressive instruction reordering, you need to use more memory barriers than x86 in parallel code for example.


The DEC Alpha had speculative execution.

I found lots of bugs porting code to it. It was my first 64-bit platform.


I found a bug in the compiler on it! I was writing some fractal generating code, and found that something I did caused the compiler to output the error message: “Dave doesn’t think this should compile. Please send an email to dave<someone-or-other>@dec.com”. So I did. They replied that they found the problem and had a fix for it.


Got to read the assembly to really know what happened.

E.g. if the architecture has pointers with non-address bits (modes or segments or whatever) and those bits were set yet the rest of the address was 'null', and the check was for 'all bits zero' then you could conceivably get that situation.


Interesting, how did you fix it? Negate the comparison with an appropriate comment?


There was no fix! It was a one-off hardware error.


I recognize this StackOverflow user name. I have read many of their answers. They are usually excellent.

See more here: https://stackoverflow.com/users/50617/employed-russian


This reminded me a problem that I have seen almost 17-18 years ago. There was a file that we were trying to download but it was just being interrupted during download process. When we tried to download the file on another computer, it was fine. Long story short; we changed the ethernet card of that computer and it started downloading file. I don't remember all the details; but probably something wrong witj ethernet card driver.


I think the closest I ever came to this was when I stumbled into a bug in git.

We were asking git to do something utterly trivial, like clone, and it was segfaulting. We installed the debug symbols or something (as I recall you can install these separately in Debian) and started trying to see where the function was crashing. The crash was in a parser, and the code was doing,

  char *foo = strstr(input, "something constant");
and segfaulting later (but not chasing null!), during the first use of foo. I figured that input must, therefore, be a bad pointer, since the string literal is by definition fine, and the only way to get a bad pointer out of strstr was to give it bad input in the first place.

So, in gdb, I print the input point. It's valid. "Damn" I think to myself "we've gotten unlucky, and whatever triggers the cascade of UB hasn't happened this run. Freakin' heisenbug." So, I told gdb to just continue running the program: it crashed. Same backtrace: dereferencing foo triggered a segfault, and not with a null pointer.

If the input pointer to strstr is valid, how can the output be a bad pointer? strstr is documented as,

       These functions return a pointer to the beginning of the located
       substring, or NULL if the substring is not found.
So, what gives?

I attempted to debug strstr … that was almost a mistake. strstr is, unfortunately heavily optimized. It might have been this one: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86...

I'm okay with a debugger, but I'm not going the be able to follow where vectorized code is going wrong. Almost stupidly, I told strstr to (f)inish the function's execution, at which point gdb printed the return value: it was a valid pointer!

"Odd. Is this run going to succeed?" I continue execution: it crashes.

I restart the debugging, and run to the completion of strstr: the pointer is correct, but again, crash. How. The segfaulting address isn't the pointer strstr is returning, either, and nothing modifies foo after the strstr.

I single step out of strstr, and immediately print foo. The resulting pointer is wrong. strstr is working fine … but what, the assignment operator is broken? At this point, I'm sure I'm crazy. I disassemble the source, and lo and behold, the disassembly is trivially wrong. Forgive me as I can't amd64 unless I'm looking at it, but the disassembly is something like,

  call strstr
  cltq
  <store foo>
We look the odd instruction up; it sign extends eax into rax. That's … not a valid operation on a supposed char pointer … there's not even technically a value in eax at this point. strstr returned a pointer in rax. eax is just the lower 32 bits of that pointer, and sign extending that back into rax makes no sense. It's like the compiler thought the return from strstr was a signed int … and that's where it hits me: C is shooting us in the foot again.

> and if no declaration is visible for this identifier, the identifier is implicitly declared exactly as if. in the innermost block containing the function call. the declaration

> extern int identifier();

> appeared.

The default return for an undeclared function, in C89, is int. (I'm actually not sure what C11 says about this. The C89 rule appears to be gone, but there doesn't seem to be anything in its place, which is bizarre. "Undeclared" does not appear, except for an unrelated footnote.)

Slap in the proper include for strstr, recompile. No segfault, disassembly is correct.

Upgrade to latest version, disassemble: disassembly is correct. Someone else fixed the bug, in the meantime. Should have just tried upgrading in the first place…

(After typing this up: had to dig up the debugging session. It was strchr, not strstr. Potato, potato. Someday … this should be a blog post…)


Sounds fun, wish I had the talent to do some serious debugging.


If you are lucky, you'll debug a CPU bug once in a lifetime.

If you're really lucky, you'll never see one at all!


Debugging that must've been a PITA for sure.


If you want to experience this in a slightly more controlled way, I'd recommend you give the game "Turing complete" a spin.

It lets you build your own turing complete processor, and define a simple assembly language, starting from NAND gates and you can create your own arbitrarily wild edge cases for specific opcode combinations.


https://turingcomplete.game/ to save other curious parties like myself the search.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: