Hacker News new | past | comments | ask | show | jobs | submit login

I had an online discussion some years back where I suggested that C nail the size of char to 8 bits. He responded that there was a CPU that had chars be 32 bits, and wasn't that great that a C compiler for it would be Standard compliant?

I replied by pointing out that nearly every non-trivial C program would have to be recoded to work on that architecture. So what purpose did the Standard allowing that actually achieve?

I also see no problem for a vendor of a C compiler for that architecture making a reasonable dialect of C for it. After all, to accommodate the memory architecture of the x86, nearly all C compilers in the 80's adopted near/far pointers, and while not Standard compliant, it really didn't matter, and was tremendously successful.

D made some decisions early on that worked out very well:

1. 2's complement wraparound arithmetic

2. sizes of basic integer types are fixed at 1 byte for chars, 2 for shorts, 4 for integers, 8 for longs. This worked out very well

3. floating point is IEEE

4. char's are UTF-8 code units (*)

5. chars are unsigned

These 5 points make for tremendous simplicity gains for D programmers, and ironically increase portability of D code.

After reading the paper, I'm inclined to change the definition of UB in D to not mean it can be assumed to not happen and not be unintended.

(*) thanks for the correction




> So what purpose did the Standard allowing that actually achieve?

I believe the situation was that there were C implementations for DSPs (32-bit-addressable only) and IBM mainframes (36-bit addressable only), and when ANSI/ISO C was established, they naturally wanted their implementations to be able to conform to that new standard. So the standard was made flexible enough to accommodate such implementations.

Similarly why signed overflow is undefined behavior. There were existing implementations that trapped (CPU interrupt) on signed overflow.

I might have gotten the details wrong, but that's what I remember from reading comp.std.c (Usenet) in the 90s.


I had another such discussion where I suggested that C abandon support for EBCDIC. I was told it was great that C supported any character set! I said C certainly does not, and gave RADIX50 as an example.

How many C programs today would work with EBCDIC? Zero? There's no modern point in C not requiring ASCII, at a minimum.


FWIW, C++ dropped support for EBCDIC when it dropped trigraphs. IBM complained till the last minute.


What is it about RADIX50 that makes it incompatible with the C standard? The lack of a reserved null value?


https://en.wikipedia.org/wiki/DEC_RADIX_50

Note the lack of lower case, characters like { } ( ) [ ] * : & | # !, and on and on.


Oh, as a source file encoding. Well, it lacks the characters needed for trigraphs, so maybe we could add support with a quadgraphs extension...


> I was told it was great that C supported any character set

I'm all for extensibility mechanisms to shut down the people like this. You want EBCDIC? It's now a github repository; go ahead and maintain it, or shut up.


> char's are UTF-8 code points

I'm guessing you mean that char is a UTF-8 code unit as you keep saying they're only one byte and a code point is far too large to fit in a byte / octet.

But that still seems very weird because a UTF-8 code unit is almost but not quite the same as a byte, so that users might be astounded when they can't put a byte into a char in this scheme (because it isn't a valid UTF-8 code unit) and yet almost no useful value is accrued by such a rule.


Ah, code unit is correct. I am always getting code unit and code point conflated. Sorry about that.

> a UTF-8 code unit is almost but not quite the same as a byte

D dodges that problem by having `byte` for signed 8 bits, and `ubyte` for unsigned 8 bits. That gets rid of the "is char unsigned or signed" confusion and bugs.


Aha, so if my D code doesn't care about text at all, I only need to worry about ubyte (or if I want them signed, byte) and char is irrelevant to me. That makes some sense.

Still it seems weird to me to separate the concerns "This byte isn't a valid UTF-8 code unit" and "This string isn't valid UTF-8". I don't know what flavour of error handling D has, but I can't conceive of too many places where I want to handle those cases differently, so if they're separate I have to do the same work twice.

Also, one very natural response to invalid UTF-8 is to emit U+FFFD (the replacement character �) but of course that doesn't fit into a UTF-8 code unit, so an API which takes a ubyte and gives back a char must error.


I think you'll find that once you get used to the idea that char is for UTF-8, and ubyte is for other representations, it not only feels natural but the code becomes clearer.

As for invalid code points, you get to choose the behavior:

1. throw exception

2. use the replacement char

3. ignore it

All are valid in real life, and so D is accommodating.


What is D's type for a unicode character (code point?)


A `string`, which is a typedef for `const(char)[]`, or a readonly view of an array of characters.


Note that because of the above definition of char (any UTF-8 code unit) these "strings" aren't necessarily valid UTF-8 and you might be better off with its dchar type (which is UTF-32 and thus can hold an entire Unicode code point)

It's unclear to me whether there's actually enforcement to ensure char can't become say 0xC0 (this byte can't occur in UTF-8 because it claims to be the prefix of a multi-byte sequence, yet it also implies that sequence should only be one byte long... which isn't multi-byte)

I spent a while reading the D website and came away if anything more confused on this topic.


Checking to see if the string is valid UTF-8 requires code to execute, which gets in the way of performance. Therefore, you can check with functions like:

https://dlang.org/phobos/std_utf.html#validate

when it is convenient in your code to do so.


I don't see much value in a distinct char type which has no real semantics beyond being an unsigned byte.

Calling validate on a whole string makes at least some sense, I don't love it, but it's not crazy. But it's obviously never worth "validating" a single char, so why not just call them bytes (or ubytes if that's the preferred D nomenclature)


There are some extra semantics that strings have in D, but you have a point. Why make it special?

It's because strings are such a basic thing in programming, that baking them into language has a way of standardizing them. Instead of everyone having their own string type, there is one string type.

One of the problems with strings not being a builtin type is what does one do with string tokens? (C++ has this problem with its library string type.) Being an array of bytes in a function signature gives no clue if it is intended to be a string, or an array of other data.

It's just a big advantage to building it in, and making it interact smoothly with the other core language features.


> I had an online discussion some years back where I suggested that C nail the size of char to 8 bits. He responded that there was a CPU that had chars be 32 bits, and wasn't that great that a C compiler for it would be Standard compliant?

Back in C infancy days, there had existed architectures where a byte could hold 9 bits that C compilers had to be written for. The 36-bit PDP-10 architecture springs to mind, and some Burroughs or Honeywell mainframes had those – I remember reading a little C reference book authored by Kernighan, Ritchie and somebody else explicitely calling out the fact that a C implementation could not rely on the fact of the byte always being 8 bits long and also stressing that the «sizeof» operator was reporting the number of bytes in a type irrespective of the bit width of the byte.

9 bit byte architectures have all but perished, however, C has carried the legacy of creative days of the computer architecture design along.


I know. I've programmed on 36 bit machines (the amazing PDP-10), and machines with 10 bit bytes (Mattel Intellivision).

But that was over 40 years ago.

Time to move on.


> After reading the paper, I'm inclined to change the definition of UB in D to not mean it can be assumed to not happen and not be unintended.

What's the current definition of UB in D?


Same as in C.


I don't think it would be a very good outcome if people forked C such that everyone working on DSP platforms and new platforms that you just haven't heard of had to use a fork with flexible CHAR_BIT while the standard defined it to be 8. Who is served by this forking? Plenty of software works fine with different CHAR_BIT values, although some poorly-written programs do need to be fixed.


It'd almost certainly have to be a fork anyway.

> Plenty of software works fine with different CHAR_BIT values,

Most won't. And good luck writing a diff program (for example) that works with CHAR_BIT == 32.

> although some poorly-written programs do need to be fixed.

Blaming all the problems on poorly-written programs does not work out well in real life. Good language design makes errors detectable and unlikely, rather than blaming the programmer.


Somehow it never occurred to me to run `diff` on a SHARC. Horse for courses.


> Plenty of software works fine with different CHAR_BIT values, although some poorly-written programs do need to be fixed.

Since C traditionally doesn’t have much of a testing culture I’m not sure you can trust any decent sized project here. I’d even be surprised if you could change the size of ‘int’. And moving off 2s complement is definitely out.


`long` is such a mess in C I never use it anymore. I use `int` and `long long`. `long` is so useless today that it's a code smell and the Standard should just delete it.


Both Linux (in C) and Rust choose to name types based on the physical size so as to be unambiguous where possible, although they don't entirely agree on the resulting names

Linux and Rust agree u32 is the right name for a 32-bit unsigned integer type and u64 is the name for the 64-bit unsigned integer type, but Rust calls the signed 32-bit integer i32 while Linux names that s32 for example.

It would of course be very difficult for C itself to declare that they're now naming the 32-bit unsigned integer u32 - due to compatibility. But Rust would actually be able to adopt the Linux names (if it wanted to) without issue, because of the Edition system, simply say OK in the 2023 edition, we're aliasing i32 as s32, and it's done. All your old code still works, and raw identifiers even let any maniacs who named something important "s32" still access that from modern "Rust 2023" code.


I didn't choose the integer suffixes for types because:

1. they are slower to type

2. they just aren't pretty to look at

3. saying and hearing `int` is so much easier than `eye thirty-two`. Just imagine relying on a screen reader and listening to all those eye thirty two's

4. 5 minutes with the language, and you know what size `int` is, as there is no ambiguity. I haven't run across anyone confused by it in 20 years :-)


If I was designing a new language I'd rather specify integer types by their range of expected values, and let the storage size be an implementation detail.

(not sure if this would ever come up, but if a variable had eg values 1000-1003, then technically you could optimize it to a 4-bit value.)


Depending on what your language is for you probably want both.

WUFFS type base.u8[0 ..= 42] is a type which fits in one byte but only has values zero through 42 inclusive, and it's a different type from base.u32[0 ..= 42] because that one takes four bytes.

Rust cares about this stuff to some extent but only internally, for types with a "niche" where it can squirrel away the None case of Option in an invalid value without needing more space. e.g. Rust's optional pointer is the same size as a traditional C-style pointer, because NULL isn't a valid pointer value so that means None, and Rust's optional NonZeroU64 is the same size as a u64 (64-bits) because zero isn't valid so that means None.


Ada is what you're looking for!


Don't. Every type has its place in a well-defined structure, passing through main architecture changes. Use 'int' if the expected range of values fit in 16 bits, use 'long' if it fits in 32 bits and use 'long long' otherwise. 'int' is guaranteed to be the fastest at-least-16 bit type and 'long' is guaranteed to be the fastest at-least-32 bit type for any architecture (matching the register width on most 32/64 bit platforms - except Windows, and being 32-bit wide even on 8/16 bit architectures), and each of these types have their own promotion class.

Don't use fixed width types by default as printing them or converting to string correctly will get ugly: printf("... %" PRId32 " ...");

Generally avoid the use of fixed-width 8-bit or 16-bit types in calculations or you might shoot yourself in the foot due to integer promotion rules.

Use fixed width types only for things that pass the boundary of a system, like in HW registers, serializing, database storage or network packets, but cast only at the end when converting to the type (optionally followed by bitwise manipulations) and cast immediately to other type when processing (after bitfield extraction and such things, of course).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: