Hacker News new | past | comments | ask | show | jobs | submit login
Implicit Overflow Considered Harmful, and how to fix it (polybdenum.com)
40 points by danny00 on Oct 4, 2021 | hide | past | favorite | 59 comments



This article expresses some opinions I've also held about integer types. If we express integer types as what we need them to be capable of instead of what they are then the compiler can choose things like int_fast32_t on our behalf and without us having to deal with promotion or typecasting.

I've wondered too if it's possible to have more information about the possible values of a type. Imagine you declare an Int<1, 10>. You can only initialize it with 1-10 inclusive or face a compiler error. What happens if you then double it? Well the range must change to <2, 20> but you can also prove that the value will be even. If you increment it, could it become an Int<3, 21, Odd, ...> ?

My guess is there are an infinite number of similar traits an integer can have so, for this to be possible, the compiler would have to be able to evaluate only what traits are used by the code.

Does anything like this already exist?


Off the top of my head, this sounds like dependent typing: https://en.wikipedia.org/wiki/Dependent_type . Simply put, it allows you to define types that are generic on specific values, rather than other types. So with normal typing you have generics like List<T>, but with dependent typing you can do Int<1, 10> like you described. Similarly, you can define types like the type of odd integers, or the type of lists with at least 3 elements, etc. They are very cool, but as you can imagine complicated to implement. In general dependent types can make type checking a program undecidable.

You might also be interested in the numeric subtypes you can declare in some Pascal languages, like Ada. See http://www.isa.uniovi.es/docencia/TiempoReal/Recursos/textbo... for some examples. In Ada you can define subtypes, like the set of integers from 1 to 10. You can also define subtypes on floats, characters, and enums. You can also define the overflow rules for your type to a certain degree. It isn't quite as powerful as dependent types, I don't think you can really make it as generic, but still an interesting feature. One very interesting property of Ada is the ability to make arrays which are indexed by a non-integer type. So you can make a enum for the days of the week, then declare an array of strings indexed by the days of the week type. Then you can do things like "array[Saturday]" to access values. Effectively this lets you make an array of 7 values with strongly-typed indexing.


> Simply put, it allows you to define types that are generic on specific values, rather than other types

Dependent types are far more general than that. They allow a type to be parameterized on a "value" where the "value" isn't merely a program constant, but rather an arbitrary expression that's type correct at that point in the program. This is what allows them to express refinement types: the refinement type bundles a value with a description of a function that takes any given value to a correct-by-construction assertion that the value satisfies some interesting property.

So if you wanted to prove that a value x in your program is in the range [m, n], you "simply" have to show how you could compute (x - m) and (n - x) at that point in your program, where the - operator can only return non-negative numbers by construction. The flip side is that m and n can be arbitrary run-time values, and there's no run-time check at all; these are essentially 'phantom' values and any computation involved in these proofs is elided: checking that a proof is correct is a special case of type checking. This is also why such languages cannot be Turing-complete, since that would allow a non-terminating function to be a valid "proof" of any property.


It does exist already as a compiler optimisation method, where the compiler infers possible values at different locations in the code. It's called Value Range Propagation or Value Range Analysis [1]. Loops complicate it as a sibling comment mentions, but they don't prevent it from being useful.

There are indeed an infinite number of similar traits. Compilers have to limit themselves to a reasonable subset, because if you take the idea of general numeric traits far enough, you end up with general purpose theorem proving.

[1] https://en.wikipedia.org/wiki/Value_range_analysis


I think the hard part will be loops. What type would you then give to your integers? If you want it to be precise, the type checking might become undecidable. Too loose, and you end up with Int without any additional info.


I'd envisage that working like Rust failing to compile when you have a mutable reference inside a loop but working if you clone it or make it immutable except with the additional case that you can just promote it to the catch-all type. If it's a loop over a range of ints, it might be possible to use that range to determine the necessary type.

All that said, if you're using unbounded loops doing arithmetic on signed integers then you're already dealing with UB in C/C++ land.


Ada has integer types that are basically defined by just their range, and if they are wrapping. So for example you might define

    type My_Tiny_Int is range 1 .. 20;
    type Unsigned_Wrapping_int is mod 2 ** 5;
I don't think it does any fancy inference based on that, just overflow checks.


It should exist, in theory, but the type becomes moot in the presence of any dynamic input.


Well... either the type becomes meaningless, or the type becomes something that must be enforced in the presence of dynamic input.

If your type says "an integer from 1 to 10", and you ask the user to enter a number from 1 to 10, and they enter 20, then something has to happen. If the compiler haa a routine that parses user input into "integer from 1 to 10" (or any range), then it has to either clip, wrap, or fail. If it doesn't, then you just parsed the user input into an int. When you try to stick that in your "integer from 1 to 10", one of the same things must happen. (Or, as you say, the type becomes moot.)


"If you’re designing a high level language, the obvious solution is to use true integer semantics (aka bigints) everywhere like Python does. However, this approach is not popular in low level languages due to the difficulty of optimizing bigint code."

I feel like this is a bit of a leap - in a great many scenarios, you don't ever want a number bigger than will fit in a 64-bit int. A number that big likely indicates a bug or corrupt data coming from somewhere else. It can also create a bunch of larger scale problems - a 256-bit int creeps in somewhere and now all your ints are getting promoted from 64-bit register ints to heap allocations, all your operations slow down, the application runs out of memory or just hangs...

To me, the obvious solution is to throw/panic on overflow, as some runtime environments can: https://docs.microsoft.com/en-us/dotnet/csharp/language-refe...

The elaborate CS solutions via things like range analysis are very cool, though, and I look forward to being able to use a language in production that does stuff like what's described in the article.


Sounds like a good use case for refinement types or general dependent types, where you can elide overflow checks if (and only if) you can prove to the compiler that a computation won't overflow.

That way you can put explicit overflow checks at the boundaries of your application, and then internally you pass around compile-time "proofs" of safety and/or perform explicit overflow checks where needed.


> in a great many scenarios, you don't ever want a number bigger than will fit in a 64-bit int.

You might think that, but remember that during math operations the number of bits isn't all that matters, and even if your final result can fit in a 64-bit int, you can get errors in calculations when the intermediate results depend on needing extra precision.


Adding at least a factor of two for safety is reasonable. So e.g. if you compute with 64-bit integers, you shouldn't ask if you ever need numbers larger than (2^64)-1, but if you need numbers larger than 2^62 (one bit for allowing signed ints, one for intermediates). That catches the most common examples like overflowing when calculating the average of two numbers or when calculating an offset.

The less safety factor you have the more carefully you have to consider what your intermediates are doing. But going up one level of precision probably won't hurt performance too bad, if your language makes it easy (like 128-bit ints)


Yep, integer overflow suddenly turns into heap overflow. Not good.


I was reading a 1987 article from Byte magazine that reminded me that many of the plots of "fractals" that you've seen are wrong.

Specifically, when people make a plot like

https://galileo-unbound.blog/2018/12/10/the-wonderful-world-...

you see these areas that look like television static that are wrong. The reason why you see chaotic motion in those areas (a infinite number of unstable fixed points) is that there are an infinite number of stable fixed points and they breed unstable fixed points on the separatrices between them.

If you look at the edges of the unstable areas you see a hierarchy of stable islands that come out of resonances having resonances with other resonances that have resonances with even more resonances, etc. If you could run the simulation long enough AND you ran it accurately you'd see that throughout the whole unstable region the chaotic motion weaves around numerous islands of stability it which would make a much more detailed and beautiful plot.

If you run this with, say, 64-bit fixed point math, you'd find that the dynamics after 64 doubling times is dominated by rounding errors. So a plot like that is showing just a fraction of the detail that's there.

Now you could increase the precision as you run over time and it ought to be possible to use interval math to do a grid search and prove the calculations are accurate, but it turns something that is O(N) in time where N is the number of iterations to something that is O(N^2). In terms of storage it goes from O(1) to O(N).

It's one of those problems that usually gets ignored and that's some of the reason why chaos and fractals get less respect and less progress than they deserve. I've been thinking about better ways to draw them however, not because I care about the dynamics themselves, but I want to make more beautiful images and see improved dynamics as the royal road to that.


I think those cases are a minority, and in most cases you would have been fine with just slightly larger ints. It doesn't seems to cause any issue in python for example, because the cases where integers grow exponentially (which is the same as making their size grow linearly, so its still slow memory-wise) and without bounds are not all that common.

For example, an overflow bug was famous in java for being present in the standard library for a very long time: in a core function performing a binary search, the operation computing the middle of the two bounds was a simple average (max + min)/2, which overflow when max and min are big enough, even if the final result always fit in a standard integer.

In a lot of other cases, integer are used to address memory (or to represent array indexes, etc), so in most code they will naturally be restricted by the available amount of memory, even if intermediate computations can temporarily overflow.


Converting a 'numeric overflow exception' to a 'you eventually run out of memory' failure is much worse, because exhausting memory will cause system-wide impact instead of just killing the individual process handling a request/operation. OOM is also generally handled much worse in software vs throwing an exception at the point where data first exceeds the expected range. Do you really want to kick off the linux oom killer every time someone uploads a corrupted png to your server?

In general, it's much better to catch out-of-range/corrupt data early, and silently promoting to a 256-bit int is effectively hiding corrupt data in the majority of cases. The maximum 64-bit int is really big.


If people are interested in how compilers can automatically remove expensive overflow checks I wrote a blog post about it in the context of Ruby and Java.

https://chrisseaton.com/truffleruby/stamping-out-overflow-ch...


So stamps are like dependent types?


Yes - but invisible to the user and inferred automatically.


Oh cool. That's pretty neat!


So much software depends on implicit overflow that it isn't funny.

We can live with Java not having unsigned ints because, with implicit overflow, math operations on signed and unsigned ints are mostly the same.


>> So much software depends on implicit overflow...

I use it all the time in embedded systems. It's very handy to represent angles as signed 16bit integers with a scaling of 2^15 = pi radians. The "overflow" naturally wraps around from pi to -pi right where you want it. I do consider this is a bit of a special case, but it's one that I've come to rely on where it makes sense.


I like Zig's approach here: normal arithmetic had undefined overflow behavior, but if you want overflow there's the overflow arithmetic operators +% -% *% and ÷%, and I think saturating arithmetic operators (+| -| *| ÷|) just landed in the master.


Dedicated operators is OK, but IME the semantics are value-level e.g. one value is overflowing while the other is saturating. In that sense, I think Rust' `Wrapping<T>` is really nice though the ergonomics are not ideal.


> It's very handy to represent angles as signed 16bit integers with a scaling of 2^15 = pi radians.

And it is in incorrect as *signed* integer overflows are undefined behavior (at least in c++). It's only defined for unsigned integers. It's not a theoretical bugs, modern compiler optimizers assume that overflows never happen.

You can build with `-fsanitize=undefined` then run to find all those bugs.


>> And it is in incorrect as signed integer overflows are undefined behavior (at least in c++).

I'm well aware that it's undefined behavior, but every compiler I've used it on produces "correct" results. The reason it's undefined is probably related to the prevalence of computers that did not use 2's complement arithmetic 40 years ago. In practice it's very reliable today. It seems like they could eliminate this piece of UB from the standard simply by defining it to behave the way everyone does it.


You are probably well aware, but this is a very contentious approach. The problem isn't that the math is going to be changed so that the wraparound doesn't occur, but that the compiler is (eventually) going to do a larger scale optimization that silently removes your check to see whether a wrap around has occurred. Compiler writers argue that the such optimizations are necessary to get good performance, so it's unlikely to ever be become defined behavior.

While you might continue to get away with it, and while I agree with you that the standard should be changed, I personally think you are being extremely foolish by relying on this sort of UB in anything other than toy programs that you expect to compile once and verify for correctness. The fix of using "unsigned" for cases where you expect wraparound is so simple that you should switch your approach.


Embedded systems compilers already have a loose regard for the compiler spec. If you're only building for one system, you'll only use one compiler and the reality of what it does beats the theory of what it should do.


>> The problem isn't that the math is going to be changed so that the wraparound doesn't occur, but that the compiler is (eventually) going to do a larger scale optimization that silently removes your check to see whether a wrap around has occurred.

I don't need to check for wrap around, that's the entire reason for using that particular scaling for an angle measure. Same for calculating signed changes in angle diff = a - b just works.


Embedded projects typically never upgrade the compiler - not for the lifetime of the product.


Compilers have taken advantage of undefined behavior in signed overflow for many years now. So unless you use an very ancient compiler, your code is buggy. You are just lucky if it "appears" to work. What happens is that adding signed integer will appear to work most of the time... except not always. For example, your compiler may detect that 2 numbers A and B are positive (they may come from constants, or there was a previous 'if' that checked that they are positive) and when you multiply `A * B` your compiler will assume that the result is also positive because multiplying 2 positive numbers is always positive if we assume that overflow are not allowed. So if you do `if (`A * B < 0)` the compiler can and probably will remove it entirely (dead code) when turning optimizations on. Here it seems quite obvious but there can be many other less obvious ways in which the optimizer will break your code if you rely on undefined behavior. Don't rely on UB in your code.

About embedded project never upgrading compiler, this is just wrong in general anyway.

In your case, use uint16_t rather than in16_t if you want defined behavior when wrapping with 16 bits.


it's not. Chromes first paid bug was an overflow issue. Like you I had always used it so I wrote code like this

    if (intA + intB < intA) {
       // stop becasue overflow
    } else {
       // do thing
    }
this worked until a new compiler took advantage of the fact that overflow is undefined so the check always failed allowing exploit


Just compile with -fwrapv.


gcc will easily miscompile code with signed overflow.

But you do not have to use signed integers, you can map your range to an unsigned integer range (for example [-pi, pi) -> [0, 2*16)


If phkahler is using gcc or clang, they can add -fwrapv to their compiler options to ensure it will work as they expect.


> And it is in incorrect...in C++

Hardware converged on two's complement arithmetic for signed arithmetic decades ago. C/C++ has really polluted people's thinking in a profound and devious way.

The real reason that C++ continues to insist on undefined behavior is because compilers optimize code in ways that are incorrect w.r.t. overflow and compiler writers have too much power in C++ standardization. What is the value of such optimizations? Mum.


This particular UB is used for optimisation in the following manner: when you write a loop that iterates over an array using a signed index variable, it doesn't check that the index doesn't overflow. Apparently doing it 'properly' would cost something like 6% performance loss in such code.

Even so, I feel the proper solution would be to use the correct type for array indexing (either one that has the same word size as a pointer, in which case it cannot overflow, or just a special index_t type defined for this purpose). Forcing everything else to have to put up with UB in the least expected places is not, in my mind, a good way to go.

Unfortunately I don't think this battle can be won. They won't even remove the UB from functions like tolower(), even though it would cost precisely nothing...


I've heard many variations of the above, and the solution to almost all of them is loop versioning as appropriate to the particular case. In this case, a single check before the loop would suffice, and then that 6% disappears.

The battle can't be won because of a perspective difference. It's really difficult to communicate the importance of a sane programming model when a sane programming model has not been a priority for C or C++ since their respective inceptions. They've prioritized low-level manipulations and performance over everything else and ironically painted themselves into silly corners like this one; even when you want the hardware instruction that does signed two's complement arithmetic, you can't get it, because the other firmly-held belief that compilers are never wrong.


Implicit overflow is how your processor works. Until that changes it will always be appropriate in some programming languages.


Not really. Every processor has an Overflow Flag [1] that can be used to check for overflows in arithmetic operations.

Most programming languages don't have a way to explicitly action on contents of that flag since it would complicate optimisations and program flow, but it's still a pity that this caused so many bugs.

[1] https://en.wikipedia.org/wiki/Overflow_flag


I don't have enough time for "Considered Harmful" articles any more. https://en.wikipedia.org/wiki/Considered_harmful


Related: Deep Impact Deadly Embrace: Beware of Register Overflow Conditions https://llis.nasa.gov/lesson/10701


This is basically the solution py2many uses for handling overflow from fixed width integer types when transpiling from python to low level systemsy languages such as Rust/C++ and Go.

Please file bugs if something deviates from what's expressed here.


Link to the test that's relevant for this discussion:

https://github.com/adsharma/py2many/blob/main/tests/cases/in...

This is an explicit deviation from python's bigint, which doesn't map very well to systemsey languages. The next logical step is to build on this to have dependent and refinement types.

Work in progress here:

https://github.com/adsharma/Typpete


We need somebody to write:

“Considered harmful” considered harmful


Been done already, as early as 2002: https://meyerweb.com/eric/comment/chech.html

Now do "The unreasonable effectiveness of 'considered harmful' articles"

edit: can't find a good self-aware "Passive voice considered harmful" essay...


Oh, all I ever see is people stating the rule "passive voice should always be avoided" and letting it at that.


Sorry, that's a not-terribly-funny joke: "blabla considered harmful" is passive voice. Somebody in my life is extremely strict about passive voice, and it's fresh on my mind...


It's cool. "Passive voice considered harmful" is a much more interesting phrasing. It's not as absolute as the normal joke, but what it lacks on certainty it makes up on intricacy!


While the article discusses possible solutions to an important problem and it may have some good ideas, I strongly dislike the choice of words of the author, because such inappropriate terminology is one of the main causes why there are problems with handling integers.

When using a bit string with N bits, e.g. a 64-bit bit string, it is possible to implement 3 distinct integer types.

Two types are subtypes of the integer numbers in the mathematical sense, i.e. interval types a.k.a. range types, which can be either signed, e.g. with values in the interval [-2^63, 2^63-1], or unsigned, e.g. with values in the interval [0,2^64-1].

The third type are modular numbers, e.g. numbers modulo 2^64.

Unfortunately I am not aware of any programming language that either allows the use of all 3 kinds of types or at least uses appropriate names for the available types.

For example, the types named "unsigned" in C and C++ are not unsigned numbers, but modular numbers. They should never be used for a variable that is intended to belong to an interval of integer numbers, unless there is certainty that during any computations with that value it is impossible for reductions by the modulus to appear.

The wrapping behavior of modular numbers is not an "implicit overflow", it is their normal behavior according to their definition. There is no overflow of modular numbers.

On the other hand, for the interval numbers (e.g. C/C++ "int" or "long") a decent programming language should define what happens on integer overflow.

The fact that many modern programming languages have chosen to specify that the behavior on integer overflow is undefined is extremely annoying and this is the root cause of all bugs caused by overflows.

There are 2 acceptable behaviors, the same as for floating point numbers, either integer saturation, where the interval limits are considered infinities, which are the result of overflows and then they are preserved by all subsequent operations until they are tested later, or an integer overflow exception signaled immediately.

Most ancient computers and programming languages always generated exceptions on integer overflow.

Ignoring integer overflow is a modern method of cost reduction which moves the burden from CPU designers & manufacturers to the programmers who must now either be able to predict correctly whether integer overflows will ever happen or compile with options for overflow checking.

Due to the poor support for integer overflow checking in most modern CPUs, this option causes a much greater performance penalty than really necessary.

Fortunately the existence of the IEEE 754 standard has prevented such cost reduction measures for floating-point operations, but for integer operations there is no standard to force the CPU makers to do the right thing.


> Unfortunately I am not aware of any programming language that either allows the use of all 3 kinds of types or at least uses appropriate names for the available types.

Ada. You can specify your integer type range (not just integers, but sticking with that here):

  subtype Celsius is Integer range -273 .. 1000; -- permits implicit, but range checked, assignment with Integer types and other subtypes of Integer like Natural or Positive
  type Celsius is range -273 .. 1000; -- no implicit conversion
  type Byte is mod 2**8; -- permits overflow, 255+2 = 1
  type Non_Modular_Byte is range 0..(2**8 - 1); -- No overflow, 255 + 2 => Exception
https://learn.adacore.com/courses/intro-to-ada/chapters/stro...


You are right.

Ada had interval types since the beginning (1979), but I had forgotten that modular integer types have been added to Ada in the 1995 Ada standard.

So yes, Ada is probably the only one that got this right.


I want to add that what the author of the article has correctly noticed, but this was not expressed clearly, is that between any 2 modular number types with distinct moduli there cannot be a subtype relationship.

That means that in languages with modular number types, e.g. C/C++, there should not exist any implicit conversions between modular numbers of different sizes or between a modular integer type and an interval integer type.

That means that all promotions between a smaller unsigned type and a larger unsigned type or between a signed integer type and an unsigned integer type, which exist in C/C++, are wrong and they should be deprecated.

On the other hand the implicit conversions between an interval type and another interval type including it are correct.

So only the implicit integer promotions

signed char => short => int => long => long long

are right in C/C++, all the other implicit integer conversions are wrong.


While uint32_t is not an uint64_t in the is-a sense, I don't see why any specific value of uint32_t can't be converted to an uint64_t. What kind of error would be prevented by disallowing this conversion?


For example, if you have an uint8_t with the value 200, that denotes either the number 200 or the number 456 or the number 712 and so on.

If you have an uint16_t with the value 200, that denotes either the number 200 or the number 65736 or the number 131272 and so on.

So it is obvious that the two "200" do not mean the same thing.

If you do a longer sequence of operations with uint8_t or uint16_t, inserting an implicit integer promotion in the middle of the chain of operations will change the end result, depending on the place where it is inserted. (E.g. 200*200 will be either 144 or 40000, depending on whether the promotion is done before or after the multiplication)

Of course most programmers do not actually use the "unsigned" numbers as modular numbers but as a surrogate for the lack of proper "unsigned" interval numbers and they attempt to avoid any operations which will cause modular reduction.

If the programmer knows that no modular reduction has occurred, he/she can write an explicit type conversion from a smaller "unsigned" type to a larger type.

The compiler cannot know which was the intention of the programmer. Because the "unsigned" numbers are defined as modular numbers, not as interval numbers, any implicit conversion can be erroneous. If the conversions are actually correct, the programmer should write them explicitly.


I think what’s meant is this: if you do

  uint32_t foo = 42
you’re effectively saying

  foo = 42 (mod 2³²)
So, foo stands for the equivalence set

  {…, 42-2³², 42, 42+2³², …}
If you say

  uint64_t bar = 42
you’re making the different statement that

  bar = 42 (mod 2⁶⁴)
You can’t say that foo and bar are the same number, because (for example)

   foo * 2³² ≠ bar * 2³²


> Unfortunately I am not aware of any programming language that either allows the use of all 3 kinds of types or at least uses appropriate names for the available types.

Rust.

u64 - Unsigned ints

i64 - Signed ints

Wrapping<u64> - Ints mod 2^64

There's also Wrapping<i64>, i64 with twos complement arithmetic


So what does u64 do when overflowing? Panic?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: