This article expresses some opinions I've also held about integer types. If we express integer types as what we need them to be capable of instead of what they are then the compiler can choose things like int_fast32_t on our behalf and without us having to deal with promotion or typecasting.
I've wondered too if it's possible to have more information about the possible values of a type. Imagine you declare an Int<1, 10>. You can only initialize it with 1-10 inclusive or face a compiler error. What happens if you then double it? Well the range must change to <2, 20> but you can also prove that the value will be even. If you increment it, could it become an Int<3, 21, Odd, ...> ?
My guess is there are an infinite number of similar traits an integer can have so, for this to be possible, the compiler would have to be able to evaluate only what traits are used by the code.
Off the top of my head, this sounds like dependent typing: https://en.wikipedia.org/wiki/Dependent_type . Simply put, it allows you to define types that are generic on specific values, rather than other types. So with normal typing you have generics like List<T>, but with dependent typing you can do Int<1, 10> like you described. Similarly, you can define types like the type of odd integers, or the type of lists with at least 3 elements, etc. They are very cool, but as you can imagine complicated to implement. In general dependent types can make type checking a program undecidable.
You might also be interested in the numeric subtypes you can declare in some Pascal languages, like Ada. See http://www.isa.uniovi.es/docencia/TiempoReal/Recursos/textbo... for some examples. In Ada you can define subtypes, like the set of integers from 1 to 10. You can also define subtypes on floats, characters, and enums. You can also define the overflow rules for your type to a certain degree. It isn't quite as powerful as dependent types, I don't think you can really make it as generic, but still an interesting feature. One very interesting property of Ada is the ability to make arrays which are indexed by a non-integer type. So you can make a enum for the days of the week, then declare an array of strings indexed by the days of the week type. Then you can do things like "array[Saturday]" to access values. Effectively this lets you make an array of 7 values with strongly-typed indexing.
> Simply put, it allows you to define types that are generic on specific values, rather than other types
Dependent types are far more general than that. They allow a type to be parameterized on a "value" where the "value" isn't merely a program constant, but rather an arbitrary expression that's type correct at that point in the program. This is what allows them to express refinement types: the refinement type bundles a value with a description of a function that takes any given value to a correct-by-construction assertion that the value satisfies some interesting property.
So if you wanted to prove that a value x in your program is in the range [m, n], you "simply" have to show how you could compute (x - m) and (n - x) at that point in your program, where the - operator can only return non-negative numbers by construction. The flip side is that m and n can be arbitrary run-time values, and there's no run-time check at all; these are essentially 'phantom' values and any computation involved in these proofs is elided: checking that a proof is correct is a special case of type checking. This is also why such languages cannot be Turing-complete, since that would allow a non-terminating function to be a valid "proof" of any property.
It does exist already as a compiler optimisation method, where the compiler infers possible values at different locations in the code. It's called Value Range Propagation or Value Range Analysis [1]. Loops complicate it as a sibling comment mentions, but they don't prevent it from being useful.
There are indeed an infinite number of similar traits. Compilers have to limit themselves to a reasonable subset, because if you take the idea of general numeric traits far enough, you end up with general purpose theorem proving.
I think the hard part will be loops. What type would you then give to your integers? If you want it to be precise, the type checking might become undecidable. Too loose, and you end up with Int without any additional info.
I'd envisage that working like Rust failing to compile when you have a mutable reference inside a loop but working if you clone it or make it immutable except with the additional case that you can just promote it to the catch-all type. If it's a loop over a range of ints, it might be possible to use that range to determine the necessary type.
All that said, if you're using unbounded loops doing arithmetic on signed integers then you're already dealing with UB in C/C++ land.
Well... either the type becomes meaningless, or the type becomes something that must be enforced in the presence of dynamic input.
If your type says "an integer from 1 to 10", and you ask the user to enter a number from 1 to 10, and they enter 20, then something has to happen. If the compiler haa a routine that parses user input into "integer from 1 to 10" (or any range), then it has to either clip, wrap, or fail. If it doesn't, then you just parsed the user input into an int. When you try to stick that in your "integer from 1 to 10", one of the same things must happen. (Or, as you say, the type becomes moot.)
"If you’re designing a high level language, the obvious solution is to use true integer semantics (aka bigints) everywhere like Python does. However, this approach is not popular in low level languages due to the difficulty of optimizing bigint code."
I feel like this is a bit of a leap - in a great many scenarios, you don't ever want a number bigger than will fit in a 64-bit int. A number that big likely indicates a bug or corrupt data coming from somewhere else. It can also create a bunch of larger scale problems - a 256-bit int creeps in somewhere and now all your ints are getting promoted from 64-bit register ints to heap allocations, all your operations slow down, the application runs out of memory or just hangs...
The elaborate CS solutions via things like range analysis are very cool, though, and I look forward to being able to use a language in production that does stuff like what's described in the article.
Sounds like a good use case for refinement types or general dependent types, where you can elide overflow checks if (and only if) you can prove to the compiler that a computation won't overflow.
That way you can put explicit overflow checks at the boundaries of your application, and then internally you pass around compile-time "proofs" of safety and/or perform explicit overflow checks where needed.
> in a great many scenarios, you don't ever want a number bigger than will fit in a 64-bit int.
You might think that, but remember that during math operations the number of bits isn't all that matters, and even if your final result can fit in a 64-bit int, you can get errors in calculations when the intermediate results depend on needing extra precision.
Adding at least a factor of two for safety is reasonable. So e.g. if you compute with 64-bit integers, you shouldn't ask if you ever need numbers larger than (2^64)-1, but if you need numbers larger than 2^62 (one bit for allowing signed ints, one for intermediates). That catches the most common examples like overflowing when calculating the average of two numbers or when calculating an offset.
The less safety factor you have the more carefully you have to consider what your intermediates are doing. But going up one level of precision probably won't hurt performance too bad, if your language makes it easy (like 128-bit ints)
you see these areas that look like television static that are wrong. The reason why you see chaotic motion in those areas (a infinite number of unstable fixed points) is that there are an infinite number of stable fixed points and they breed unstable fixed points on the separatrices between them.
If you look at the edges of the unstable areas you see a hierarchy of stable islands that come out of resonances having resonances with other resonances that have resonances with even more resonances, etc. If you could run the simulation long enough AND you ran it accurately you'd see that throughout the whole unstable region the chaotic motion weaves around numerous islands of stability it which would make a much more detailed and beautiful plot.
If you run this with, say, 64-bit fixed point math, you'd find that the dynamics after 64 doubling times is dominated by rounding errors. So a plot like that is showing just a fraction of the detail that's there.
Now you could increase the precision as you run over time and it ought to be possible to use interval math to do a grid search and prove the calculations are accurate, but it turns something that is O(N) in time where N is the number of iterations to something that is O(N^2). In terms of storage it goes from O(1) to O(N).
It's one of those problems that usually gets ignored and that's some of the reason why chaos and fractals get less respect and less progress than they deserve. I've been thinking about better ways to draw them however, not because I care about the dynamics themselves, but I want to make more beautiful images and see improved dynamics as the royal road to that.
I think those cases are a minority, and in most cases you would have been fine with just slightly larger ints. It doesn't seems to cause any issue in python for example, because the cases where integers grow exponentially (which is the same as making their size grow linearly, so its still slow memory-wise) and without bounds are not all that common.
For example, an overflow bug was famous in java for being present in the standard library for a very long time: in a core function performing a binary search, the operation computing the middle of the two bounds was a simple average (max + min)/2, which overflow when max and min are big enough, even if the final result always fit in a standard integer.
In a lot of other cases, integer are used to address memory (or to represent array indexes, etc), so in most code they will naturally be restricted by the available amount of memory, even if intermediate computations can temporarily overflow.
Converting a 'numeric overflow exception' to a 'you eventually run out of memory' failure is much worse, because exhausting memory will cause system-wide impact instead of just killing the individual process handling a request/operation. OOM is also generally handled much worse in software vs throwing an exception at the point where data first exceeds the expected range. Do you really want to kick off the linux oom killer every time someone uploads a corrupted png to your server?
In general, it's much better to catch out-of-range/corrupt data early, and silently promoting to a 256-bit int is effectively hiding corrupt data in the majority of cases. The maximum 64-bit int is really big.
If people are interested in how compilers can automatically remove expensive overflow checks I wrote a blog post about it in the context of Ruby and Java.
>> So much software depends on implicit overflow...
I use it all the time in embedded systems. It's very handy to represent angles as signed 16bit integers with a scaling of 2^15 = pi radians. The "overflow" naturally wraps around from pi to -pi right where you want it. I do consider this is a bit of a special case, but it's one that I've come to rely on where it makes sense.
I like Zig's approach here: normal arithmetic had undefined overflow behavior, but if you want overflow there's the overflow arithmetic operators +% -% *% and ÷%, and I think saturating arithmetic operators (+| -| *| ÷|) just landed in the master.
Dedicated operators is OK, but IME the semantics are value-level e.g. one value is overflowing while the other is saturating. In that sense, I think Rust' `Wrapping<T>` is really nice though the ergonomics are not ideal.
> It's very handy to represent angles as signed 16bit integers with a scaling of 2^15 = pi radians.
And it is in incorrect as *signed* integer overflows are undefined behavior (at least in c++). It's only defined for unsigned integers. It's not a theoretical bugs, modern compiler optimizers assume that overflows never happen.
You can build with `-fsanitize=undefined` then run to find all those bugs.
>> And it is in incorrect as signed integer overflows are undefined behavior (at least in c++).
I'm well aware that it's undefined behavior, but every compiler I've used it on produces "correct" results. The reason it's undefined is probably related to the prevalence of computers that did not use 2's complement arithmetic 40 years ago. In practice it's very reliable today. It seems like they could eliminate this piece of UB from the standard simply by defining it to behave the way everyone does it.
You are probably well aware, but this is a very contentious approach. The problem isn't that the math is going to be changed so that the wraparound doesn't occur, but that the compiler is (eventually) going to do a larger scale optimization that silently removes your check to see whether a wrap around has occurred. Compiler writers argue that the such optimizations are necessary to get good performance, so it's unlikely to ever be become defined behavior.
While you might continue to get away with it, and while I agree with you that the standard should be changed, I personally think you are being extremely foolish by relying on this sort of UB in anything other than toy programs that you expect to compile once and verify for correctness. The fix of using "unsigned" for cases where you expect wraparound is so simple that you should switch your approach.
Embedded systems compilers already have a loose regard for the compiler spec. If you're only building for one system, you'll only use one compiler and the reality of what it does beats the theory of what it should do.
>> The problem isn't that the math is going to be changed so that the wraparound doesn't occur, but that the compiler is (eventually) going to do a larger scale optimization that silently removes your check to see whether a wrap around has occurred.
I don't need to check for wrap around, that's the entire reason for using that particular scaling for an angle measure. Same for calculating signed changes in angle diff = a - b just works.
Compilers have taken advantage of undefined behavior in signed overflow for many years now. So unless you use an very ancient compiler, your code is buggy. You are just lucky if it "appears" to work. What happens is that adding signed integer will appear to work most of the time... except not always. For example, your compiler may detect that 2 numbers A and B are positive (they may come from constants, or there was a previous 'if' that checked that they are positive) and when you multiply `A * B` your compiler will assume that the result is also positive because multiplying 2 positive numbers is always positive if we assume that overflow are not allowed. So if you do `if (`A * B < 0)` the compiler can and probably will remove it entirely (dead code) when turning optimizations on. Here it seems quite obvious but there can be many other less obvious ways in which the optimizer will break your code if you rely on undefined behavior. Don't rely on UB in your code.
About embedded project never upgrading compiler, this is just wrong in general anyway.
In your case, use uint16_t rather than in16_t if you want defined behavior when wrapping with 16 bits.
Hardware converged on two's complement arithmetic for signed arithmetic decades ago. C/C++ has really polluted people's thinking in a profound and devious way.
The real reason that C++ continues to insist on undefined behavior is because compilers optimize code in ways that are incorrect w.r.t. overflow and compiler writers have too much power in C++ standardization. What is the value of such optimizations? Mum.
This particular UB is used for optimisation in the following manner: when you write a loop that iterates over an array using a signed index variable, it doesn't check that the index doesn't overflow. Apparently doing it 'properly' would cost something like 6% performance loss in such code.
Even so, I feel the proper solution would be to use the correct type for array indexing (either one that has the same word size as a pointer, in which case it cannot overflow, or just a special index_t type defined for this purpose). Forcing everything else to have to put up with UB in the least expected places is not, in my mind, a good way to go.
Unfortunately I don't think this battle can be won. They won't even remove the UB from functions like tolower(), even though it would cost precisely nothing...
I've heard many variations of the above, and the solution to almost all of them is loop versioning as appropriate to the particular case. In this case, a single check before the loop would suffice, and then that 6% disappears.
The battle can't be won because of a perspective difference. It's really difficult to communicate the importance of a sane programming model when a sane programming model has not been a priority for C or C++ since their respective inceptions. They've prioritized low-level manipulations and performance over everything else and ironically painted themselves into silly corners like this one; even when you want the hardware instruction that does signed two's complement arithmetic, you can't get it, because the other firmly-held belief that compilers are never wrong.
Not really. Every processor has an Overflow Flag [1] that can be used to check for overflows in arithmetic operations.
Most programming languages don't have a way to explicitly action on contents of that flag since it would complicate optimisations and program flow, but it's still a pity that this caused so many bugs.
This is basically the solution py2many uses for handling overflow from fixed width integer types when transpiling from python to low level systemsy languages such as Rust/C++ and Go.
Please file bugs if something deviates from what's expressed here.
This is an explicit deviation from python's bigint, which doesn't map very well to systemsey languages. The next logical step is to build on this to have dependent and refinement types.
Sorry, that's a not-terribly-funny joke: "blabla considered harmful" is passive voice. Somebody in my life is extremely strict about passive voice, and it's fresh on my mind...
It's cool. "Passive voice considered harmful" is a much more interesting phrasing. It's not as absolute as the normal joke, but what it lacks on certainty it makes up on intricacy!
While the article discusses possible solutions to an important problem and it may have some good ideas, I strongly dislike the choice of words of the author, because such inappropriate terminology is one of the main causes why there are problems with handling integers.
When using a bit string with N bits, e.g. a 64-bit bit string, it is possible to implement 3 distinct integer types.
Two types are subtypes of the integer numbers in the mathematical sense, i.e. interval types a.k.a. range types, which can be either signed, e.g. with values in the interval [-2^63, 2^63-1], or unsigned, e.g. with values in the interval [0,2^64-1].
The third type are modular numbers, e.g. numbers modulo 2^64.
Unfortunately I am not aware of any programming language that either allows the use of all 3 kinds of types or at least uses appropriate names for the available types.
For example, the types named "unsigned" in C and C++ are not unsigned numbers, but modular numbers. They should never be used for a variable that is intended to belong to an interval of integer numbers, unless there is certainty that during any computations with that value it is impossible for reductions by the modulus to appear.
The wrapping behavior of modular numbers is not an "implicit overflow", it is their normal behavior according to their definition. There is no overflow of modular numbers.
On the other hand, for the interval numbers (e.g. C/C++ "int" or "long") a decent programming language should define what happens on integer overflow.
The fact that many modern programming languages have chosen to specify that the behavior on integer overflow is undefined is extremely annoying and this is the root cause of all bugs caused by overflows.
There are 2 acceptable behaviors, the same as for floating point numbers, either integer saturation, where the interval limits are considered infinities, which are the result of overflows and then they are preserved by all subsequent operations until they are tested later, or an integer overflow exception signaled immediately.
Most ancient computers and programming languages always generated exceptions on integer overflow.
Ignoring integer overflow is a modern method of cost reduction which moves the burden from CPU designers & manufacturers to the programmers who must now either be able to predict correctly whether integer overflows will ever happen or compile with options for overflow checking.
Due to the poor support for integer overflow checking in most modern CPUs, this option causes a much greater performance penalty than really necessary.
Fortunately the existence of the IEEE 754 standard has prevented such cost reduction measures for floating-point operations, but for integer operations there is no standard to force the CPU makers to do the right thing.
> Unfortunately I am not aware of any programming language that either allows the use of all 3 kinds of types or at least uses appropriate names for the available types.
Ada. You can specify your integer type range (not just integers, but sticking with that here):
subtype Celsius is Integer range -273 .. 1000; -- permits implicit, but range checked, assignment with Integer types and other subtypes of Integer like Natural or Positive
type Celsius is range -273 .. 1000; -- no implicit conversion
type Byte is mod 2**8; -- permits overflow, 255+2 = 1
type Non_Modular_Byte is range 0..(2**8 - 1); -- No overflow, 255 + 2 => Exception
I want to add that what the author of the article has correctly noticed, but this was not expressed clearly, is that between any 2 modular number types with distinct moduli there cannot be a subtype relationship.
That means that in languages with modular number types, e.g. C/C++, there should not exist any implicit conversions between modular numbers of different sizes or between a modular integer type and an interval integer type.
That means that all promotions between a smaller unsigned type and a larger unsigned type or between a signed integer type and an unsigned integer type, which exist in C/C++, are wrong and they should be deprecated.
On the other hand the implicit conversions between an interval type and another interval type including it are correct.
So only the implicit integer promotions
signed char => short => int => long => long long
are right in C/C++, all the other implicit integer conversions are wrong.
While uint32_t is not an uint64_t in the is-a sense, I don't see why any specific value of uint32_t can't be converted to an uint64_t. What kind of error would be prevented by disallowing this conversion?
For example, if you have an uint8_t with the value 200, that denotes either the number 200 or the number 456 or the number 712 and so on.
If you have an uint16_t with the value 200, that denotes either the number 200 or the number 65736 or the number 131272 and so on.
So it is obvious that the two "200" do not mean the same thing.
If you do a longer sequence of operations with uint8_t or uint16_t, inserting an implicit integer promotion in the middle of the chain of operations will change the end result, depending on the place where it is inserted. (E.g. 200*200 will be either 144 or 40000, depending on whether the promotion is done before or after the multiplication)
Of course most programmers do not actually use the "unsigned" numbers as modular numbers but as a surrogate for the lack of proper "unsigned" interval numbers and they attempt to avoid any operations which will cause modular reduction.
If the programmer knows that no modular reduction has occurred, he/she can write an explicit type conversion from a smaller "unsigned" type to a larger type.
The compiler cannot know which was the intention of the programmer. Because the "unsigned" numbers are defined as modular numbers, not as interval numbers, any implicit conversion can be erroneous. If the conversions are actually correct, the programmer should write them explicitly.
> Unfortunately I am not aware of any programming language that either allows the use of all 3 kinds of types or at least uses appropriate names for the available types.
Rust.
u64 - Unsigned ints
i64 - Signed ints
Wrapping<u64> - Ints mod 2^64
There's also Wrapping<i64>, i64 with twos complement arithmetic
I've wondered too if it's possible to have more information about the possible values of a type. Imagine you declare an Int<1, 10>. You can only initialize it with 1-10 inclusive or face a compiler error. What happens if you then double it? Well the range must change to <2, 20> but you can also prove that the value will be even. If you increment it, could it become an Int<3, 21, Odd, ...> ?
My guess is there are an infinite number of similar traits an integer can have so, for this to be possible, the compiler would have to be able to evaluate only what traits are used by the code.
Does anything like this already exist?