I wouldn't describe this as "no-one knows" the type of char + char.
I know what the type of char + char is. I know that it's either int or unsigned int, depending on the ranges of values supported by types char and int. I know what it is for any given implementation. And I know that it's int, not unsigned int, for every implementation I've ever used or am likely to use.
Implementation-defined features are not some unsolvable mystery. They're just implementation-defined.
And we can count 99.999% of used implementations on one hand. If you’re on a strange platform, there’s a reason for that, and chances are you’re uniquely aware of any differences or will be writing assembly.
This confirms one of the guidelines I've always been taught; char is not an arithmetic type, and never treat it as such. It represents ascii characters, and nothing else.
char is an arithmetic type, but it rarely makes sense to treat it as one, because its signedness is implementation-defined. If you want a very narrow integer type, both signed char and unsigned char are arithmetic types, and can reasonably be used that way. (Arrays of unsigned char are also used for raw memory.)
And you should understand how char, signed char, and unsigned char behave when you do use them as arithmetic types.
Promotion to int or unsigned int, depending on the range of the type, can be confusing. The same applies to all integer types with lower rank than int, including short, unsigned short, and intN_t and uintN_t for N==8 (and probably for N==16, and maybe for larger N).
Note also that this:
char c = '0';
++c;
is guaranteed to set c to '1'. (This guarantee applies only to decimal digits, not to letters.)
'0' through '9' are guaranteed to be contiguous, so doing arithmetic around that fact is legitimate. And char does not generally represent ASCII characters; it could be some other charset.
(u)int8_t have the same problems, including aliasing because they are just alias of (unsigned) char. Sometimes it's nice to have modular arithmetic mod 256, or compact memory layout for eg. count sketches.
If int8_t exists[1], then you know that char is 8 bits[2] and therefore know that char in char + char always promotes to int because int must have at least 16 value bits and a 16-bit int can represent any char value regardless of signedness.
[1] int8_t is not required.
[2] char is the fundamental unit of addressability. sizeof char always evaluates to 1, sizeof int8_t must be non-0, char must be at least 8 bits, and int8_t must be precisely 8 bits, therefore sizeof int8_t == sizeof char and CHAR_BIT == 8.
It's worth mentioning explicitly that while `c - 'a'` is the more obvious application of character addition in your example, `c >= 'a'` is another one that's even more common. Pretty much everyone immediately understands that we want to be able to sort characters.
Yeah- but the problem of poorly-defined result type isn't present for comparison operators, since a bool is a b... oh, hold on, C. Since an int is an int. Sigh.
Adding two characters isn't strictly needed for that -- you're relying on the assumption that (c - 'a') is of type character, but it's actually the offset between two characters. The rules for those two types would be:
char + char = invalid
char + offset = offset + char = char
offset + offset = offset
char - char = offset
char - offset = char
offset - char = invalid
offset - offset = offset
Given that, (c - 'a') + 'A' is perfectly valid without adding two characters.
That relationship is valid for ASCII, and for character sets derived from ASCII, but it's not guaranteed by the language. In particular, in EBCDIC the alphabet is non-contiguous.
It does not. Subtracting a char from a char involves usual arithmetic conversions as well and the result is typically an int. Next, you have addition between an int and a char.
That's a good question, now that we have [u]int8_t, [u]int16_t, etc. for explicit bitness values. (Although both can have more than specified number of bits on some platforms.)
But `uint16_t + uint16_t` has exactly the same problem -- if it's a typedef for `unsigned short`, there will be promotion to either `int` or `unsigned int`.
A multiplication `uint16_t * uint16_t` can still cause an overflow after promotion to signed int, which is undefined behavior!
So "unsigned types wrap around" doesn't apply to `uintN_t`, because you can never know for sure whether those types are "smaller than int" and thus get promoted to signed types when you do any arithmetic.
Of course, in practice this just means: every C and C++ program relies on tons of implementation-defined behavior. A `sizeof(int)` greater than 32-bits would break most code in existence (e.g. hash code computations using `uint32_t`).
> because you can never know for sure whether those types are "smaller than int" and thus get promoted to signed types when you do any arithmetic
You can deduce the width (number of sign + value bits) of the standard integer types from their limits (e.g. INT_MAX, INT_MIN, etc). The problem has been that this is non-trivial if not impossible to do from the preprocessor. The next C standard will include width constants (e.g. INT_WIDTH) for the standard integer types.
How would you form strings by adding 'char' type values together? This is not about concatenation operation, we're talking about C/C++. They do not have syntactic sugar concat for chars.
(In C++, you of course have operator overloading, that's how std::string concat sugar works.)
'1' + '1' == 'b'. Because 49 + 49 == 98. ASCII '1' == 49, and 'b' == 98.
>I think that char – char should definitely be legal. The distance between characters is well defined. Same for char + numeric. Both logically makes sense. I think a good analogy might be floors in a building. Asking what’s the distance between the second and seventh floor makes sense, or what’s two floors above the 4th. But the question ‘what’s the 5th floor plus the 6th floor’ doesn’t make sense.
>Affine space describes these kind of relationships in mathematics. Eg position and disposition in n dimension, or count and offset in buffers, even timestamp and duration.
I've been writing C for 25 years... and while I technically know "the answer," it's effectively a closed door in my mind because I don't always know where my code will end up.
A sadistic part of me would prefer if it was interpreted as a bitwise and... not because that's good or reasonable or smart... but to punish the behavior. But then that backfires when people use it for underhanded code.
Yes, yes. The spec is filled with anachronisms that are no longer pertinent in today's machines. char + char gets promoted to int every time in today's compilers. Try it out here: https://godbolt.org/z/V5HEvV
There is what the standard says, and there is what people actually do. If everyone promotes char to int in practice, then any machine where this doesn't happen is going to have a tough time running the bulk of code out there.
In a standards committee, the standard is the standard, in practice common practice is the standard.
50-50 anachronisms vs flexibility to allow C to run on novel machines that we don't currently envision. Sure, it would be nice for developers on today's machines to reduce it to the conventional subset.
If by novel you also imply compatible, then sure. The moment you create a machine that's incompatible with the conventions adopted by the most popular compilers and architectures, you break a ton of software built upon those conventions, and sink your hardware in the market because it's a portability nightmare.
Specs don't matter beyond the conventions they inspire.
A machine where char is as large as int is unlikely in practice as it isn't very useful. C11 (at least) defines INT_MIN/MAX as covering at least the range of an int16_t type.
That said, the int promotion alone may be surprising / nonobvious to some people (it was to me, when I learned about it!).
Also, if char is signed, char + char may be UB and with known overflowing values the compiler may deduce it's a can't-happen situation, generating code accordingly. Or when encountered at runtime, it may hose your program state arbitrarily, etc.
There are hardly any systems relevant today for which adding two char would result in an unsigned int. So basically just treat it as int and call it a day.
I know what the type of char + char is. I know that it's either int or unsigned int, depending on the ranges of values supported by types char and int. I know what it is for any given implementation. And I know that it's int, not unsigned int, for every implementation I've ever used or am likely to use.
Implementation-defined features are not some unsolvable mystery. They're just implementation-defined.