Hacker News new | past | comments | ask | show | jobs | submit login
How Should You Write a Fast Integer Overflow Check? (regehr.org)
93 points by cremno on April 29, 2014 | hide | past | favorite | 47 comments



We can take advantage of two's complement arithmetic, plus the knowledge that the overflow flag is the xor of the carry into the sign bit with the carry out of the sign bit, to devise a simple expression:

    int checked_add_n(int_t a, int_t b, int_t *rp) {
        uint_t s = (uint_t)a + (uint_t)b;
        *rp = s;
        return ((a ^ s)&(b ^ s)) >> (BITS - 1);
    }
This is essentially doing the same thing as checked_add_3. First, we extract the carry into the sign bit with

    ci = s ^ a ^ b; // upper bit of ci contains the relevant bit
Carry out can be found via the usual majority expression:

    co = ci&a ^ ci&b ^ a&b;
Xoring the two, and simplifying using De Morgan's laws and their relatives results in the expression above.

Although the metric is far from optimal, clang gives me 6 arithmetic instructions (minus the ret): http://goo.gl/Amdo93


Minor nit: the result of converting an unsigned integer to signed is implementation-defined if the unsigned value cannot be represented in the destination type. This can be fixed with the following bizarre expression:

    *rp = s <= INTN_MAX ? s : (intN_t)(s + INTN_MIN) + INTN_MIN;
"What the hell?” I hear the internet ask. if s <= INTN_MAX, then the result is representable as an intN_t, so we’re fine; otherwise s + INTN_MIN is guaranteed* to be representable as an intN_t, and the second addition of INTN_MIN “completes" the twos-complement (no-op) conversion to signed.

“But you can’t assume that the signed type is twos complement!” an actually of pedants responds. Fortunately, the <stdint.h> intN_t types, if they exist, are required to be twos-complement.

“But more operations and—oh-horrors-a-branch!” say the premature optimizers. Not to fear. Compilers are smart. Both gcc and clang are happy to look through this and no-op the whole thing away.

[*] for the <stdint.h> intN_t types, which are twos-complement with no padding.


You can't take advantage of two's complement arithmetic if you want portable C, though.


Are there any machines in widespread use today that don't use two's complement? I guess if there were, we'd hear a lot more about this issue... I know that some mainframes are an example, but they're somewhat of a niche.


Signed integer overflow is undefined in C, so optimizers can do really weird things to code that has overflows.



And see John Regehr's ( http://blog.regehr.org/, of the headline article) other posts. He's written a great deal about undefined behaviour in C/C++ and how modern compilers handle it. Check the ‘Compilers’ and ‘Software Correctness’ categories.


I think "smallest number of instructions" is a pretty weird metric to use if you are going for "fast" solutions.

checked_add_3() has a lot of instructions that can be executed in parallel. checked_add_4() has a lot of conditional code which can be skipped. (Branching might be expensive in that case, though on the other hand overflow is rare, so these branches can probably be predicted accurately.)

I have little doubt that the assembly version is fastest in practice, but to compare the other versions, they should be benchmarked.


EDIT 2: Editing...

EDIT 1: as spotted by @pbsd this was not working good with some values...

I have added a way of checking the overflow and I have tested it:

  #define BITS (8*sizeof(int))-1

  int checked_add(int a, int b, int *rp)
  {
      *rp = a+b;
      
      return (a^b) < 0 ? 0 : (a^(*rp)) < 0  ? 1 : 0) ;
  }
It is around 30% faster than the fastest in the blog and I believe the code is way cleaner. What we are doing is checking if all of the values has the same sign (the "^" is just a "xor"), if so, no overflow in other case, overflow.

This is the testing program (g++ compliant):

  #include <stdio.h>
  #include <time.h>
  #include <stdlib.h>

  #define BITS (8*sizeof(int))-1

  int checked_add2(int a, int b, int *rp) {
    uint ur = (uint)a + (uint)b;
    uint sr = ur >> (BITS-1);
    uint sa = (uint)a >> (BITS-1);
    uint sb = (uint)b >> (BITS-1);
    *rp = ur;
    return
      (sa && sb && !sr) ||
      (!sa && !sb && sr);
  }


  int checked_add(int a, int b, int *rp)
  {
      *rp = a+b;
      return ((a^b) | (a^(*rp)) < 0)  ? 1 : 0 ;
  }


  int main(int argc, char* argv[])
  {
      int a, b, c;
      long clong;
      srand(time(NULL));


      for(unsigned int i = 0; i < 50000000; ++i)
      {
          bool overflow = false;
          a = rand();
          b = rand();
          clong = (long)a + (long)b;

          overflow = checked_add(a,b,&c);

          if(clong != (long)c && !overflow)
              printf("Overflow not detected %i + %i = %i \n", a, b, c);

      }

      return 0;
  }

Time for "checked_add": 2.694s

Time for "checked_add2": 3.200s

This is the dump (6 instructions so far): http://goo.gl/cnBKdS


Unfortunately signed overflow behavior is undefined in C so you can't guarantee your XOR with rp will give the expected result. This is partly why the article is so careful in its casting and the order of its checks.

This actually easier to do in Java and many other languages because signed arithmetic is precisely defined.


One of points of the article is that if you write a+b then compiler is free to assume that no overflow happens. If the compiler was sufficiently smart then if you wrote

   if(checked_addr(a,b,rp)) {
       //something 1
   } else {
       //something 2
   }
then compiler is free to assume no overflow happens and replace code with:

    checked_add(a,b,rp);
    //something 2


Your function is incorrect. Consider the case a = b = 0x80000000.


I might be missing something, but I have tested it and it gives me "0" and no overflow.

edit: Ok, I do missed one spot (my mind tricked me), the answer shouldn't be 0.


Ok, I think this should do the trick:

return((a ^ b) | (a ^ (*rp)) < 0) ? 1 : 0;

doesn't it?


Not yet. Consider a = -1, b = 1. Due to the OR, any a and b with different signs will return overflow.


Ok this is my dumbest time in life xD (I just directly didn't take into consideration that part because it never has overflow...)...


Most people don't appreciate how complicated C's mix of signed and unsigned is, especially when undefined behaviour of signed overflow is included.


Definitely.

Child, this is why I always add unit tests.


It has always made me wonder why there is no direct way for checking for overflow in most languages, since most CPUs do have an overflow flag. Dedicating an operator or keyword is probably too much, but at least it can be a routine in stdlib that usually gets inlined.

I have also come to think "hey this perhaps could be a security hole" every time it comes to unchecked overflow. Does anyone know such exploits?


The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.

        — Henry Petroski



A good example of an integer overflow leading to a vulnerability was the 2002 vulnerability in OpenSSH: http://www.openssh.com/txt/preauth.adv


The assembly subroutine is not optimal because the compiler can not inline it. A ASM expression [1] is much better.

Also, it is weird to write the result to memory just to avoid to repeat the addition. A memory operation is quite complex, but there a few things simpler than adding two registers. A better abstraction would be to write a function that returns the overflow bit and nothing else. The addition itself is (almost) free anyway.

[1] http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Extended-Asm.htm...


The assembly version looks simpler and more maintainable than any of the C versions. What's the overhead of using it? Having to rewrite for each architecture you target (likely to be only 2 or 3)? If this level of optimisation matters to you then I guess you'd be down in assembly anyway?


It's always better to have a single C code for all architectures than an assembly stub for each, it reduces the chance to have different behaviours across implementations (or subtle bugs in one implementation but not the other).

Don't let the small clean assembly for this simple function fool you, writing optimized assembly on modern architectures is getting harder and harder, you have to take the cache into account for instance (both instruction and data) as well the particular implementation of the instruction on the CPU (microcode etc...). Fewer instructions does not always mean faster code, maybe TFA should have been more explicit about that instead of just saying it's "a crude measure of code quality". Look at the assembly generated by a compiler for a modern X86-64 architecture, you will see stuff that seem to make no sense such as NOPs in the middle of functions.

Also, in general, it's interesting to know what the compiler can and cannot optimize to decide when it's interesting to handcraft some assembly. It's always better to benchmark first instead of doing some premature optimization.

For instance, C does not have a bitwise rotation operator (while many architectures support a rotate instruction) but gcc easily recognizes rotation patterns and optimizes them without trouble.


>It's always better to have a single C code for all architectures than an assembly stub for each, it reduces the chance to have different behaviours across implementations (or subtle bugs in one implementation but not the other).

>Don't let the small clean assembly for this simple function fool you, writing optimized assembly on modern architectures is getting harder and harder

Maybe in the general case, but this is literally a single instruction over unchecked arithmetic. You just add and then branch on overflow. Kinda hard to screw that up.

It's really sad that people are so afraid of assembly these days that they can't even bother to write, say, a couple lines of assembly for their language VM's overflow-checking add instruction, and rely on hacks like this instead. Who cares if pure C is more portable if the assembly is shorter and it takes two minutes at worst to look up what the opcode for "branch on overflow" is for any other platform you want to port to? At some point, you're not playing it safe or even saving much time, you're just wasting CPU for the sake of laziness.


The project I work on compiles for eight different target architectures. Debian supports at least fourteen. Every program in a Linux distribution which has any inline assembly code means extra work that has to be done to support porting that distro to a new architecture, and code that has to be audited if an existing architecture gets a new calling convention (like x86's "x32", or ARM "hardfloat"). A portable 10-line implementation that gets reviewed and tested once is often going to be better than half a dozen 5 line architecture specific implementations.


You'll probably still have to write the C version anyway, as a fallback if you build on a new architecture and as a reference implementation to test the asm.

So at that point it's probably valuable to see if you can't get the compiler to generate the correct code by itself in a portable way.

I'm not afraid of writing assembly when I have to but I always consider it a last resort scenario when I really can't get the same result with some good old C.


Even if you could write perfect assembly, sticking it in your program makes it more difficult for optimizations to flow across the assembly you have written because it is opaque to the compiler.

Better yet, imagine that the compiler has performed value range propagation [1] to determine that the range checks, after inlining, were redundant or simply elidable. This would be infeasible if it were written in assembly.

[1] http://llvm.org/devmtg/2007-05/05-Lewycky-Predsimplify.pdf


If compilers don't understand inline Asm, maybe instead of avoiding it, why not try improving the compiler so it can understand? I know this can't be done if e.g. you use newer instructions that it doesn't know about and wouldn't emit anyway, but in the case of faster, shorter sequences using existing instructions and having an equivalent in C, I don't see why it can't "decompile" the instructions you write and use that information for optimisation. Even better if the compiler can learn from it and use that faster/shorter instruction sequence when it sees that expression in the future.

Better yet, imagine that the compiler has performed value range propagation [...] This would be infeasible if it were written in assembly.

VRP doesn't depend on the instruction set; it could be LLVM IR or it could be x86 or ARM or 6502 or anything else. It's just looking at values and operations on them.


Huh? ... decompile the optimized instructions to what? There is no "equivalent C". That's the entire point. Otherwise someone would just write it in C and the codegen would emit the optimized instructions.


> Huh? ... decompile the optimized instructions to what?

To the compiler's own IR, whatever that is.


Yes - the original article mentions this at the end of the first paragraph.


The phrase "premature optimization" gets thrown around a lot, but it seems to me we'd avoid this whole discussion (and benchmarking, and testing) if we could safely and easily inline the assembly version of the function. After all, a machine instruction is just a library function burnt into hardware.

Maybe a better term is "premature nonportability".


From `The CERT Oracle Secure Coding Standard for Java`

public static int safeAdd(int l, int r) throws OverflowException {

    if (r > 0
            ? l > Integer.MAX_VALUE - r
            : l < Integer.MIN_VALUE - r)
        throw new ArithmeticException(String.format(
                "Integer overflow: %d + %d", l, r));
    return l + r;
}

https://www.securecoding.cert.org/confluence/display/java/NU...


In java 8 you should probably use the Exact methods in java.lang.Math (In cases where the size is int or long and overflow errors need to be detected, the methods addExact, subtractExact, multiplyExact, and toIntExact throw an ArithmeticException when the results overflow)


As a similar but slightly different problem I wrote this article about testing whether a value can be converted from one type to another without loss: http://blog.reverberate.org/2012/12/testing-for-integer-over...


I would have expected to see the "jo" instructions in the assembly. Does anyone know why it was not used?


Because the function doesn't have anywhere to jump to. Instead it generates zero-or-one from the overflow flag (with seto) and the caller will then do something like test eax, eax/jnz handle_overflow.

In other words, the abstraction boundary introduced by the function call mechanism gets in the way of using the overflow flag directly with jo. Inlining would, presumably, re-expose the opportunity.


That is another part I missed, the ASM implementation seemed to optimise the function while potentially deoptimising the whole program.

For most code overflows are rare, jumping to deal with the rare cases but otherwise just doing the normal is often best for whole program performance.

But to be honest I am not current in assembly as I like managed languages to much.


Assuming that means "jump on overflow": branches are bad for your pipelines, so one should, in general, avoid them if possible. These functions easily can do without jumps (but, presumably, the caller will branch depending on their result)


Allow me to tilt at a windmill: even if you are optimizing code at this level, there is no reason to avoid branches.

On a modern processor, a correctly predicted branch-not-taken will almost always be free. Not just inexpensive, actually free, since a superscalar processor can execute it on an otherwise unused port. A correctly predicted branch-taken will not take more than a single extra cycle. And branch prediction is good enough that you can presume that if a pattern exists, it will be correctly predicted.

The thing that can be expensive is a mispredicted branch in a tight loop. That will cost you about 15 cycles (about twice as much as accessing L1, and 1/10th the price of hitting RAM). In a case like this, unless you are expecting an unpredictable mix of overflow and non-overflow, branches are not a problem. If the overflow is going to happen less than 10% of the time, using "jo" (jump overflow) after a numeric operation will be the fastest option.


For the referenced libo, the file overflow.c (https://github.com/xiw/libo/blob/master/overflow.c) surprised me: very short, and with a curious #include as its last line.

Anybody know the logic behind this?


Some compilers (MLton, fe) have static analysis in place to skip the check when no overflow can occur.


Note that MLton compiles a language which requires int arithmetic to raise an exception on overflow, the overhead of which provides a strong motivation for such analyses.


I saw Regehr and knew it had to be the U prof.


Lol, -3 for this? Must be some BYU fans up late...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: