We can take advantage of two's complement arithmetic, plus the knowledge that the overflow flag is the xor of the carry into the sign bit with the carry out of the sign bit, to devise a simple expression:
int checked_add_n(int_t a, int_t b, int_t *rp) {
uint_t s = (uint_t)a + (uint_t)b;
*rp = s;
return ((a ^ s)&(b ^ s)) >> (BITS - 1);
}
This is essentially doing the same thing as checked_add_3. First, we extract the carry into the sign bit with
ci = s ^ a ^ b; // upper bit of ci contains the relevant bit
Carry out can be found via the usual majority expression:
co = ci&a ^ ci&b ^ a&b;
Xoring the two, and simplifying using De Morgan's laws and their relatives results in the expression above.
Although the metric is far from optimal, clang gives me 6 arithmetic instructions (minus the ret): http://goo.gl/Amdo93
Minor nit: the result of converting an unsigned integer to signed is implementation-defined if the unsigned value cannot be represented in the destination type. This can be fixed with the following bizarre expression:
*rp = s <= INTN_MAX ? s : (intN_t)(s + INTN_MIN) + INTN_MIN;
"What the hell?” I hear the internet ask. if s <= INTN_MAX, then the result is representable as an intN_t, so we’re fine; otherwise s + INTN_MIN is guaranteed* to be representable as an intN_t, and the second addition of INTN_MIN “completes" the twos-complement (no-op) conversion to signed.
“But you can’t assume that the signed type is twos complement!” an actually of pedants responds. Fortunately, the <stdint.h> intN_t types, if they exist, are required to be twos-complement.
“But more operations and—oh-horrors-a-branch!” say the premature optimizers. Not to fear. Compilers are smart. Both gcc and clang are happy to look through this and no-op the whole thing away.
[*] for the <stdint.h> intN_t types, which are twos-complement with no padding.
Are there any machines in widespread use today that don't use two's complement? I guess if there were, we'd hear a lot more about this issue... I know that some mainframes are an example, but they're somewhat of a niche.
And see John Regehr's ( http://blog.regehr.org/, of the headline article) other posts. He's written a great deal about undefined behaviour in C/C++ and how modern compilers handle it. Check the ‘Compilers’ and ‘Software Correctness’ categories.
I think "smallest number of instructions" is a pretty weird metric to use if you are going for "fast" solutions.
checked_add_3() has a lot of instructions that can be executed in parallel. checked_add_4() has a lot of conditional code which can be skipped. (Branching might be expensive in that case, though on the other hand overflow is rare, so these branches can probably be predicted accurately.)
I have little doubt that the assembly version is fastest in practice, but to compare the other versions, they should be benchmarked.
EDIT 1: as spotted by @pbsd this was not working good with some values...
I have added a way of checking the overflow and I have tested it:
#define BITS (8*sizeof(int))-1
int checked_add(int a, int b, int *rp)
{
*rp = a+b;
return (a^b) < 0 ? 0 : (a^(*rp)) < 0 ? 1 : 0) ;
}
It is around 30% faster than the fastest in the blog and I believe the code is way cleaner. What we are doing is checking if all of the values has the same sign (the "^" is just a "xor"), if so, no overflow in other case, overflow.
This is the testing program (g++ compliant):
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#define BITS (8*sizeof(int))-1
int checked_add2(int a, int b, int *rp) {
uint ur = (uint)a + (uint)b;
uint sr = ur >> (BITS-1);
uint sa = (uint)a >> (BITS-1);
uint sb = (uint)b >> (BITS-1);
*rp = ur;
return
(sa && sb && !sr) ||
(!sa && !sb && sr);
}
int checked_add(int a, int b, int *rp)
{
*rp = a+b;
return ((a^b) | (a^(*rp)) < 0) ? 1 : 0 ;
}
int main(int argc, char* argv[])
{
int a, b, c;
long clong;
srand(time(NULL));
for(unsigned int i = 0; i < 50000000; ++i)
{
bool overflow = false;
a = rand();
b = rand();
clong = (long)a + (long)b;
overflow = checked_add(a,b,&c);
if(clong != (long)c && !overflow)
printf("Overflow not detected %i + %i = %i \n", a, b, c);
}
return 0;
}
Unfortunately signed overflow behavior is undefined in C so you can't guarantee your XOR with rp will give the expected result. This is partly why the article is so careful in its casting and the order of its checks.
This actually easier to do in Java and many other languages because signed arithmetic is precisely defined.
One of points of the article is that if you write a+b then compiler is free to assume that no overflow happens. If the compiler was sufficiently smart then if you wrote
It has always made me wonder why there is no direct way for checking for overflow in most languages, since most CPUs do have an overflow flag. Dedicating an operator or keyword is probably too much, but at least it can be a routine in stdlib that usually gets inlined.
I have also come to think "hey this perhaps could be a security hole" every time it comes to unchecked overflow. Does anyone know such exploits?
The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.
The assembly subroutine is not optimal because the compiler can not inline it. A ASM expression [1] is much better.
Also, it is weird to write the result to memory just to avoid to repeat the addition. A memory operation is quite complex, but there a few things simpler than adding two registers. A better abstraction would be to write a function that returns the overflow bit and nothing else. The addition itself is (almost) free anyway.
The assembly version looks simpler and more maintainable than any of the C versions. What's the overhead of using it? Having to rewrite for each architecture you target (likely to be only 2 or 3)? If this level of optimisation matters to you then I guess you'd be down in assembly anyway?
It's always better to have a single C code for all architectures than an assembly stub for each, it reduces the chance to have different behaviours across implementations (or subtle bugs in one implementation but not the other).
Don't let the small clean assembly for this simple function fool you, writing optimized assembly on modern architectures is getting harder and harder, you have to take the cache into account for instance (both instruction and data) as well the particular implementation of the instruction on the CPU (microcode etc...). Fewer instructions does not always mean faster code, maybe TFA should have been more explicit about that instead of just saying it's "a crude measure of code quality". Look at the assembly generated by a compiler for a modern X86-64 architecture, you will see stuff that seem to make no sense such as NOPs in the middle of functions.
Also, in general, it's interesting to know what the compiler can and cannot optimize to decide when it's interesting to handcraft some assembly. It's always better to benchmark first instead of doing some premature optimization.
For instance, C does not have a bitwise rotation operator (while many architectures support a rotate instruction) but gcc easily recognizes rotation patterns and optimizes them without trouble.
>It's always better to have a single C code for all architectures than an assembly stub for each, it reduces the chance to have different behaviours across implementations (or subtle bugs in one implementation but not the other).
>Don't let the small clean assembly for this simple function fool you, writing optimized assembly on modern architectures is getting harder and harder
Maybe in the general case, but this is literally a single instruction over unchecked arithmetic. You just add and then branch on overflow. Kinda hard to screw that up.
It's really sad that people are so afraid of assembly these days that they can't even bother to write, say, a couple lines of assembly for their language VM's overflow-checking add instruction, and rely on hacks like this instead. Who cares if pure C is more portable if the assembly is shorter and it takes two minutes at worst to look up what the opcode for "branch on overflow" is for any other platform you want to port to? At some point, you're not playing it safe or even saving much time, you're just wasting CPU for the sake of laziness.
The project I work on compiles for eight different target architectures. Debian supports at least fourteen. Every program in a Linux distribution which has any inline assembly code means extra work that has to be done to support porting that distro to a new architecture, and code that has to be audited if an existing architecture gets a new calling convention (like x86's "x32", or ARM "hardfloat"). A portable 10-line implementation that gets reviewed and tested once is often going to be better than half a dozen 5 line architecture specific implementations.
You'll probably still have to write the C version anyway, as a fallback if you build on a new architecture and as a reference implementation to test the asm.
So at that point it's probably valuable to see if you can't get the compiler to generate the correct code by itself in a portable way.
I'm not afraid of writing assembly when I have to but I always consider it a last resort scenario when I really can't get the same result with some good old C.
Even if you could write perfect assembly, sticking it in your program makes it more difficult for optimizations to flow across the assembly you have written because it is opaque to the compiler.
Better yet, imagine that the compiler has performed value range propagation [1] to determine that the range checks, after inlining, were redundant or simply elidable. This would be infeasible if it were written in assembly.
If compilers don't understand inline Asm, maybe instead of avoiding it, why not try improving the compiler so it can understand? I know this can't be done if e.g. you use newer instructions that it doesn't know about and wouldn't emit anyway, but in the case of faster, shorter sequences using existing instructions and having an equivalent in C, I don't see why it can't "decompile" the instructions you write and use that information for optimisation. Even better if the compiler can learn from it and use that faster/shorter instruction sequence when it sees that expression in the future.
Better yet, imagine that the compiler has performed value range propagation [...] This would be infeasible if it were written in assembly.
VRP doesn't depend on the instruction set; it could be LLVM IR or it could be x86 or ARM or 6502 or anything else. It's just looking at values and operations on them.
Huh? ... decompile the optimized instructions to what? There is no "equivalent C". That's the entire point. Otherwise someone would just write it in C and the codegen would emit the optimized instructions.
The phrase "premature optimization" gets thrown around a lot, but it seems to me we'd avoid this whole discussion (and benchmarking, and testing) if we could safely and easily inline the assembly version of the function. After all, a machine instruction is just a library function burnt into hardware.
Maybe a better term is "premature nonportability".
From `The CERT Oracle Secure Coding Standard for Java`
public static int safeAdd(int l, int r) throws OverflowException
{
if (r > 0
? l > Integer.MAX_VALUE - r
: l < Integer.MIN_VALUE - r)
throw new ArithmeticException(String.format(
"Integer overflow: %d + %d", l, r));
return l + r;
In java 8 you should probably use the Exact methods in java.lang.Math (In cases where the size is int or long and overflow errors need to be detected, the methods addExact, subtractExact, multiplyExact, and toIntExact throw an ArithmeticException when the results overflow)
Because the function doesn't have anywhere to jump to. Instead it generates zero-or-one from the overflow flag (with seto) and the caller will then do something like test eax, eax/jnz handle_overflow.
In other words, the abstraction boundary introduced by the function call mechanism gets in the way of using the overflow flag directly with jo. Inlining would, presumably, re-expose the opportunity.
Assuming that means "jump on overflow": branches are bad for your pipelines, so one should, in general, avoid them if possible. These functions easily can do without jumps (but, presumably, the caller will branch depending on their result)
Allow me to tilt at a windmill: even if you are optimizing code at this level, there is no reason to avoid branches.
On a modern processor, a correctly predicted branch-not-taken will almost always be free. Not just inexpensive, actually free, since a superscalar processor can execute it on an otherwise unused port. A correctly predicted branch-taken will not take more than a single extra cycle. And branch prediction is good enough that you can presume that if a pattern exists, it will be correctly predicted.
The thing that can be expensive is a mispredicted branch in a tight loop. That will cost you about 15 cycles (about twice as much as accessing L1, and 1/10th the price of hitting RAM). In a case like this, unless you are expecting an unpredictable mix of overflow and non-overflow, branches are not a problem. If the overflow is going to happen less than 10% of the time, using "jo" (jump overflow) after a numeric operation will be the fastest option.
Note that MLton compiles a language which requires int arithmetic to raise an exception on overflow, the overhead of which provides a strong motivation for such analyses.
Although the metric is far from optimal, clang gives me 6 arithmetic instructions (minus the ret): http://goo.gl/Amdo93