Hacker News new | past | comments | ask | show | jobs | submit login
GCC null pointer check removed (gcc.gnu.org)
239 points by Tomte on Aug 28, 2019 | hide | past | favorite | 88 comments



Wow, I am in awe of Jeff Law at the moment :-). I am glad they found the issue and are fixing it.

I was helping a high school student learn how compilers work and we started by opening up GCC and I was astonished at just how mindbogglingly complex it had become. We closed that down and opened up the delightful Bellard "Tiny C"[1] which was comprehensible at the level where this student was working. We created a new 'bool' type for that compiler.

[1] https://bellard.org/tcc/


If you want a simple modern compiler backend intended for production-readiness, consider Cranelift [1]. It's largely written by Dan Gohman, who is a longtime LLVM contributor (and a colleague of mine). It does a lot of the things LLVM's backend passes do, but it's cleaner and better factored, with the benefit of a decade of hindsight [2]. Cranelift should be great for teaching purposes.

[1]: https://github.com/CraneStation/cranelift

[2]: https://cranelift.readthedocs.io/en/latest/compare-llvm.html


I was wondering where I had heard of Cranelift before- seems like it is the default backend for wasmer[1].

[1] https://github.com/wasmerio/wasmer#building


Thanks. I'll check it out. In that vein, what do you think of QBE (open-source) and LCC for understandable compilers?

https://c9x.me/compile/

https://en.wikipedia.org/wiki/LCC_(compiler)


That is a pretty neat backend for compilers. Definitely worth using when going to the 'next level' between how do we take text and turn it into binary as level 1, and then level 2, how do we improve the binary we're going to generate by understanding what the ASTs are telling us.


Any plan to use this in rustc? If so, is there a plan to maintain cross language LTO w/ C++?


> Any plan to use this in rustc?

One of the primary expectations / use cases for cranelift is its use for rustc's debug builds: https://github.com/CraneStation/cranelift/blob/master/rustc....


Definitely, for debug builds. Eventually perhaps for release—LTO is one of the many things we would have to figure out if we go down this road.


For a compiler that's even more compact and easier to "chew on", I recommend C4[1] --- it even comes with an interpreter VM, and is only ~500 lines of code. It's also extremely understandable despite being quite terse.

[1] https://news.ycombinator.com/item?id=8558822


>we started by opening up GCC and I was astonished at just how mindbogglingly complex it had become

It's the price of being good ;)


Well, and just being ancient. LLVM is good too, but a lot cleaner. For instance (from GCC's docs):

> Reload is the GCC equivalent of Satan. See [gccsource:reload.c], [gccsource:reload1.c], and [gccsource:reload.h] if you have a brave soul.

https://gcc.gnu.org/wiki/reload


AFAIK, gcc still circles around clang.


No. Each compiler has its own set of strengths and weaknesses, which is probably why for any of the big 3 (4 if you count intel?) compilers there are people who think one is the best. Whatever circle of internet people I happen to see a lot all think MSVC is terrible for instance, but the one really low level project I happened to work on, it produced the best code for it.

I wouldn't say any of the compilers runs circles around another in general

There is also a tendency for, whatever compiler you used mainly while developing your thing, it appears the best since you have been tweaking for that target the whole time.


It would be pretty hard to say objectively that MSVC is terrible without knowing what's in it.


No. You can evaluate compiler speed, memory use, quality of error output, generated code quality, ie most of what anyone cares about without knowing what's in it.


On what code? Any code you pick is an arbitrary benchmark. There are infinite number of context free grammars you could feed the compiler and for any of them in might make some unknown optimization X that is superior to all other compilers. The number of programs that you can write are infinite therefore you cannot say for certain that is better or worse for all programs, or in other words, objectively say it is terrible.


The idea that you need to test with infinite inputs to be "objective" is preposterous. You can in fact test with one input and the result is still objective, and you can conclude that it is objectively terrible or not on that one input. In reality, a suite of real world tests within your application domain provides actual useful information -- those aren't arbitrary benchmarks if that's the actual code you care about.


No. All optimizations have tradeoffs, if you don't know what it's doing then you cannot know the tradeoffs made. Empirical black-box testing works to a degree, past which it only gives false assurance, as it would in this case.


But you can know what's doing. Just have it output the .s/.asm file

You can also look at the flags and the different optimizations you can turn on and off.

MSVC had some issues with C++ compatibility (a long time ago) but overall it is not bad. A compiler that can compile a whole operating system probably has most of its issues ironed out.


> But you can know what's doing. Just have it output the .s/.asm file

That's still black box testing. There ~may be~ almost certainly is hidden state that could mean that the optimization you expect to be applied is not (or an optimization is misapplied) under certain circumstances, and you can't/won't discover this in advance due to those conditions.

> A compiler that can compile a whole operating system probably has most of its issues ironed out.

History has shown this to be mostly wrong, given how many critical bugs have been discovered in GCC since it was capable of compiling linux


g++ drove me insane with its baroque error messages, but clang++ did not. However, I hear that g++ has improved.


I avoid C++ and stay in straight C as much as I can. GCC's C error messages have definitely improved in response to clang (especially around deep preprocessor stuff). I had assumed that the C++ messages had likewise improved, and thought that I had heard that as well.

Then I actually had to test that working on Envoy proxy. I was disappointed to get utterly incomprehensible error messages. I'm "GCC-by-default", so a coworker suggested I try switching to clang++... and sure enough, clang++ gave me decent error messages.


Yes and yes. I have seen g++ error messages especially template related ones from being horrendously inscrutable to at par with, and now better than clang. Of course this my subjective judgement. One area where i have found g++ to shine over clang is explaining the vectorization choices it makes -- great for understanding why a loop did not get simd'ed.

Competition has definitely served us well


Nope. It's not 2010 anymore.


they can remove reload tonight and most of HN wouldn't have a reason to care, but it'd hurt some old architectures.

I personally tried to convert one of them but damn is it hard. bring out all the bugs!


GCC was also designed to make it very difficult to modularize on purpose, I believe, when copyleft was still in its infancy.

https://gcc.gnu.org/ml/gcc/2007-11/msg00460.html https://gcc.gnu.org/ml/gcc/2004-12/msg00888.html

In retrospect I agree more and more with stallman that perhaps we should be actively avoiding helping to build proprietary software, but the tactic they used failed badly, hurt the quality of the software, and turned away prospective contributors such as myself.


"It's the price of being good ;) "

Not to mention producing code for a huge list of architectures.


With GCC it's more of a price of being old and good.

LLVM is a lot cleaner (and they are basically even in terms of 1* benchmarks)


I wouldn't be too sure of that. I remember reading libc source and thinking "wow this is so complex it must be written by geniuses and be really well optimized". Only years later did I hear that a lot of it is shitty for pretty much no reason and there is a lot of low hanging fruit to pick.

I mean gcc is really advanced and it does work very well. But if there is something that intuitively seems bad in there I wouldn't assume it's that way for a really deep reason beyond my comprehension.


As the other commenters have pointed out, this is a bug in GCC and not an aggressive optimization typical of compilers exploiting undefined behavior. Perhaps the title can reflect this somehow? "Bug in GCC removes null pointer check", etc.?


> aggressive ["]optimization[" ...] exploiting undefined behavior

That's still a compiler bug.

But yes, this is a bug in GCC and the title should show that.


Exploiting undefined behavior is something compilers can and will do; that's not a bug. (Specifically, if a compiler can choose to omit code where it can prove that going down that branch would lead to undefined behavior. That's not what happened here.)


To be clear, it's a bug for code to exploit undefined behavior.

A compiler may exploit the assumption that source code does not invoke undefined behavior.


Comparing a pointer to null isn't UB. De-referencing invalid pointers is ub.


I think I am already not alone in saying this on this thread, but that was a very fast investigation and bug fix. I am impressed.

Not at all the expectation I would have for facing a compiler bug. I guess it also helps that the report seemed clear and high quality. (Even had a revision where it was introduced.)


"Silently incorrectly compiles code in a very bad way" is a fairly unusual bug, particularly for a well known platform (also a credit to gcc).


Yes, but humans have natural limits in time and attention. This is fast turnaround.


This one is interesting... it looks like sibling call optimization was a bit too aggressive, or non-nullness is propagating incorrectly.

For those who understand UB, the problem is not that 'node' is NULL. The compiler can "prove" that 'node' is non-NULL, but the results are propagated to 'module' from main, which is not correct because 'main' is not the only call site of 'walk'.


I think the incorrect propagation is actually from one instance of walk() to the next recursive invocation (which has been replaced by a loop).


But the proof that it's non null is the deref in main?

I'm not sure I understand why people say this has nothing to do with the UB already derefed therefore not null optimization. How else is the compiler inferring the pointer could be not null?


> But the proof that it's non null is the deref in main?

This is an incorrect interpretation, it only proves that 'node' is non-null. Recall that 'walk' is called twice—once for 'node', and once for 'node->child'. The compiler bug (and this IS a compiler bug, make no mistake) is that the the proof that 'node != NULL' is incorrectly treated as a proof that 'module != NULL', which is not sound, and the bug manifests as incorrect program behavior during the recursive call 'walk(module->child, ...)'.


Yeah. The bug is that the compiler propagated not null from main.node to walk.module.

But how did the compiler determine main.node is not null?


> But how did the compiler determine main.node is not null?

The test case in the bug happens to use UB to generate the constraint, but this is just an example of how to trigger the bug. The bug itself is not related to UB, and there are other ways to generate the same constraint.

Here is an example on godbolt showing what I mean: https://godbolt.org/z/pRW97L

You'll notice that I removed the UB, but the bug is still there. You can see that the 'return;' statement is not emitted because it is colored with a white background. So the bug is not related to UB at all.


> You can see that the 'return;' statement is not emitted

Uh, what? The code looks correct to me:

  func:
    test rdi, rdi
    je .L7
    mov esi, 1
    jmp walk
  .L7:
    ret
There's a conditional branch on $rdi that jumps to .L7 (which returns) when node is NULL.


You're looking at 'func', the error is in 'walk'. Here is the top of 'walk':

  walk:
    push rbx
    mov rbx, rdi
    test esi, esi
    je .L3
    mov rdi, QWORD PTR [rdi]
    call walk
Note that esi = cleanup = 1 so the branch to .L3 is not taken, and the recursive call to 'walk' is taken. The crash will occur on the instruction before the call,

    mov rdi, QWORD PTR [rdi]
Because rdi = 0 on the second call to 'walk'.


Oh, yes, walk is still broken. I was focusing on func because you had added that and it coincidentally also had a white return statement.


Ok, this is the clarification I was seeking.


There's a deref of main.node and dereferencing null pointer is UB so it couldn't have been null?


Ouch, this is annoying. But it's a bug. Sometimes those happen, and they get fixed.

This is totally unlike the user hostility that results from playing gotcha with UB which has also resulted in removing null pointer checks. Those aren't bugs and don't get fixed.


[flagged]


Relax, it's an honest bug in open source software and they're working on fixing it (edit: they've already fixed it in trunk). What more do you expect?


[flagged]


This bug has nothing to do with undefined behavior, as many of the other commenters (including myself) have pointed out.


Only up to a point. This bug is a misfiring attempt at detecting undefined behavior to allow for optimizations. If we did not attempt to remove null pointer checks, that would have prevented this bug in GCC.

Hence, the practice of optimizing based on UB is relevant here, even if this was a compiler bug, rather than a compiler optimizing based on UB.


> This bug is a misfiring attempt at detecting undefined behavior to allow for optimizations.

It's closer to a conditional constant propagation that was accidentally applied unconditionally. The code was guarded by an if statement that asserted that a variable did not have a particular value, which lets the compiler delete further checks that the variable does or does not have the value. No UB behavior in play, unless you're complaining about the UB of "variables can't be changed unless the programmer changes them directly or indirectly."

If you pay attention to the change that introduced the bug--it was a change in the constant propagator that caused the bug.


> This bug is a misfiring attempt at detecting undefined behavior to allow for optimizations.

I don't think it is. It's the same as removing the conditional here:

  int foo = 1;
  if (foo) {
      // whatever
  }
The compiler (incorrectly) deduced that the pointer was non-NULL and removed the check because it saw it as unnecessary.


I'm curious why you think that? The origin is absolutely the usual exploitation of UB in the sense that the compiler supposes that code paths that are past one can be elided (because programmers never write bugs, that kind of interesting reasoning...)

Except it has a bug (the irony :p ) in the logic that computes that, and actually elides more code paths than it should. That shows that their architecture to do that kind of dangerous transformation is fragile, and they are not able to guarantee that that kind of transformation is really restricted to buggy source code (which would even then still be a bad idea IMO, but that is another story).


There's no undefined behavior, though. The code provided is well formed: it passes NULL to a function that checks for it, and the NULL check is being removed erroneously because GCC thinks it's not necessary. Here's an explanation of the bug: https://news.ycombinator.com/item?id=20822378


Of course there is no UB in the source code. That makes the problem worse, not less bad... Non-nullness is inferred from 'node->child = NULL;' which would be UB would node be NULL. At this point the compiler thinks it is superfluous to check whether 'node' is null and start to propagates that, except it messes up.

It seems that it messes up extremely badly because even -fno-delete-null-pointer-checks does not seem to fix the issue :/


> Non-nullness is inferred from 'node->child = NULL;' which would be UB would node be NULL.

There’s also a dereference inside of walk, except it’s protected by a NULL check.


Yes I see that, I was not describing what the compiler should observe to work properly, I was describing the erroneous reasoning it was making prior to the fix. I should have written: non-nullness of 'node'.

And TBH I'm starting to be more concerned by the fact that fno-omit-null-pointer-checks did not suppress that codegen bug than by its existence in the first place...


> I was describing the erroneous reasoning it was making prior to the fix. I should have written: non-nullness of 'node'.

I think you're misinterpreting the error in the compiler's reasoning: unless I am misunderstanding you, it doesn't seem to match the comment I linked earlier. The reason -fno-delete-null-pointer-checks did not help is because it only kicks in when you use a NULL pointer in a way that is undefined, but that's not what's occurring here.


> And TBH I'm starting to be more concerned by the fact that fno-omit-null-pointer-checks did not suppress that codegen bug than by its existence in the first place...

-fno-omit-null-pointer-checks doesn't suppress it because it has nothing to do with it. -fno-omit-null-pointer-checks means "don't infer from the presence of *x that x is non-NULL." In this case, the non-nullity of x was inferred directly from the check "if (x != NULL)".


> The origin is absolutely the usual exploitation of UB in the sense that the compiler supposes that code paths that are past one can be elided (because programmers never write bugs, that kind of interesting reasoning...)

This is demonstrably false. The logic follows from:

   if (val != NULL) {
     ... = val; /* val is not null because the if statement says so */
   }
This is not coming from anything that smells of "you said *val, so val is not NULL." It is merely a straightforward bug where the compiler attached some deduced facts that were context-sensitive, and then accidentally promoted them to context-insensitive facts without checking that it was true in all contexts.


Those are bugs and the compiler developers' refusal to fix them is precisely the user hostility you are complaining about.


No they are not. When your program invoke UB, it is a bug in your program, and not in the compiler.

The compiler author are helping you with things like -fsanitize=undefined. Or even lets you define some behavior which is normaly undefined with things like -fwrapv, -fno-strict-aliasing, -fno-delete-null-pointer-checks

But you are asking the compiler to optimize your program, if you give an invalid program to optimize, you get an invalid result. There is no bug there.


Whether it's a bug or not depends on what the desired behavior is. People disagree on what the desired behavior, thus people disagree about whether they are bugs.

If your desired behavior is to obey the spec and run as quickly as possible then they are not bugs. The user hostility could be argued to be in the spec.


I couldn't reproduce with 9.1.0 on Arch Linux since it's patched: https://git.archlinux.org/svntogit/packages.git/commit/trunk...

However the gcc 9.1.0 in nix/nixos is still affected (but not the default gcc, which is still gcc7).

EDIT: nix just updated to gcc 9.2.0. I was using a slightly outdated release: https://github.com/NixOS/nixpkgs/pull/66836


If anyone wants to try this themselves: https://godbolt.org/z/j4YJ0c


Reading that thread, I'm really impressed by the knowledge shown by the people that worked on it.


I have been following the development of some open source scheme compilers (chez and guile) and I'm always impressed by the work people do there. The chez code is amazing in most places, and I can understand why racket chose it as their base.

Guile has been getting some really neat optimizations lately (jit, declarative modules, efficient local recursive bindings and much more) and it is really fun to try to follow the code that gets published. Andy's blog over at wingolog.org is also great fun to read.


Compiler engineers are a smart bunch :)


I wish I am a compiler engineer...


Stop wishing, go building. It is hard but doable.

If you want an easy start re-implement an interpreted language first, then write a compiler for it.


Things turned out well this time.

The last Gcc bug I submitted was a regression that made my program take fully twice as long to run. For that case, they decided making it slow was the correct choice. (Anyway it matched Clang afterward. :-)

The program's main loop, where it had spent 90+% of its time, was just four instructions, which Haswell and later can do in one cycle per iteration, provided the instructions are presented right. Occasionally, it would encounter something that would take 50+ cycles to deal with, and then continue looping. The regression was that it re-arranged the instructions to save a cycle on the slow path and add a cycle to the fast path, so it now takes two cycles per iteration instead of one.


Stuff like this is why I really like sel4's approach to verification. They take the compiler out of the TCB by parsing and verifying the underlying binary.


I believe NaCl and PNaCl analyzes machine instructions (NaCl) or 'bitcode' (PNaCl) to sandbox binaries.


They do but it's a little different. (P)NaCL still allows memory safety and data verification bugs that don't compromise the integrity of the sandbox. sel4 goes a lot further by verifying that the total behavior of the application matches the machine readable formal specification. So (P)NaCL probably would have allowed the bug in question to get into production, whereas it would have been caught as a build error by sel4.


Scary, does this have a chance of introducing security issues in (mis)compiled code?


Obligatory Dan Bernstein presentation: https://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf

(note: portrait slides!)


> Wikipedia: “An optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program.”

> So the algorithm designer (viewed as a machine) is an optimizing compiler?

> Nonsense. Compiler designers have narrower focus. Example: “A compiler will not change an implementation of bubble sort to use mergesort.” — Why not?

(No answer is given.)

An optimising compiler is permitted to replace your bubble-sort by a merge-sort, provided apparent program-behaviour is unchanged.

The only reason real compilers aren't likely to do this, is that the transformation is too sophisticated/too specific, for current compiler technology.

> compiler designers take responsibility only for “machine-specific optimization”. Outside this bailiwick they freely blame algorithm designers

This is wrong.

It's wrong in the shallow sense: in a typical modern compiler, most optimisations are performed at the level of an IR, not at the level of the particular target machine language. Common subexpression elimination, for instance.

It's also wrong in a deeper sense: compilers may apply optimisations which improve time complexity. That is to say, they may transform the algorithm itself. The target machine has no bearing there.

Consider the following unoptimised loop:

    int found = 0;
    for (size_t i = 0; i != n; ++i) {
        if (arr[i]) {
          found = 1;
          // break;
        }
    }
An optimising compiler may be capable of essentially uncommenting the commented-out 'break'. This changes the time-complexity of the scan in an obvious way.

This optimisation could even be applied by a source-to-source optimiser, with no awareness of the the target machine.

(We can nitpick about possible tradeoffs regarding branch-prediction when adding the 'break', but I think my point is clear.)


Thanks for sharing this


I wonder if they'll get around to fixing this miscompliation bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348


It isn't particularly helpful to downvote a comment to -3 without actually expressing what you found to be incorrect about it.

I linked to an issue where all current GCC versions emits incorrect memory corrupting output from a pretty boring correct input which has been sitting idle for almost 4 months. The issue-- a morally similar issue of applying a constraint to the wrong scope-- was found in production code, in the wild-- it's not like its some machine generated test case. Currently the finders are working around it with -fno-stack-reuse but it's unclear how much other software is impacted... Should I be lobbying Linux distros to recompile everything with -fno-stack-reuse ? ... hard to say: there isn't an easy way to instrument the compiler to detect when the bug is being triggered.

A rapid and highly knowledgeable response to miscompilation is what I've historically experienced from GCC and thats what's exhibited on the headline issue, but seemingly not in the case I linked to.


That's definitely an interesting issue - I don't think it's too similar to the one in the original post, though. It's not about applying a constraint to the wrong scope, it's a case where the liveness analysis has correctly deduced that a block-scope variable in a loop is not live from one iteration of the loop to the next but then the address of that block-scope variable has been incorrectly kept live across the loop iterations.


There is a reason people are still staying with gcc-4.8. This pass in question (ssa-tree-copy) is just horrible, esp. in the latest versions. I had to blacklist gcc-9 because of similar issues with that. I didn't know that it did start with gcc-7 already.


Again: There really needs to be a warning (elevate-able to an error like usual) for all removed / "optimized out" code. A programmer added it for a reason, and either that reason is invalid and it can be removed OR the compiler's doing something the programmer did not intend and is thus exhibiting undesired behavior.


This is not possible.

   void foo(int x) {
       if (x)  do_something();
       else do_something_else();
   }

   void fn1() { foo(0); }
   void fn2() { foo(1); }

When the code is inlined, the compiler can propagate constant and optimize the `if`. Do you really expect two warnings there?

These warning would happen all the time.

And anyway, this does not apply to this particular bug which is just a compiler bug (which has been fixed). The warning was there all along:

    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


So you’re looking for a warning that will always show up in any non-trivial program?


This is a bug, not optimization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: