Hacker News new | past | comments | ask | show | jobs | submit login
Some things every C programmer should know about C (2002) (archive.org)
223 points by kick on Dec 9, 2019 | hide | past | favorite | 108 comments



> Bitfields may appear only as struct/union members, so there are no pointers to bitfields, arrays of bitfields

You can certainly have a pointer to a struct member, so this isn't the reason why you can't have pointers to bitfields. The reason is that bits are not addressable, fullstop.


Bits are addressable, just not with a normal pointer. It would have been possible to have a special fat pointer for bits, similar to how C++ sometimes has fat pointers for member functions (depending on compiler implementation).

The restriction in C can only be explained as a limitation of the language itself -- although probably motivated by the implementation complexity it would require, for a niche use case.


I didn't mention pointers regarding bits, I mentioned addressability - a bit cannot have an address (in any language I'm aware of), though of course you can have any number of ways of accessing it.


Pointers, as a language concept, don't have to correspond to the addressing schemes of the hardware or ISA. On some architectures instructions may only be able to address aligned whole words. Some microcontrollers (e.g. Intel MCS-51) feature bit-addressable memory. Apparently, there's a special __bit type supported by the Small Device C Compiler for using bit addressable memory on such devices, although I don't know if it has support for taking pointers to these.


They do not have to. But then it wouldn't be C, which by design has a straight forward and obvious mapping to the underlying machine.

For example, there are machines (some DSPs) that individual octects are not efficiently addressable and usually a C byte in these machines is 16 or 32 bita.


Pointers are very much a language concept and very much not an architecture concept. I enjoy this particular writeup that touches on some of the distinctions. Of particular interest is the fact that the C standard itself states that two pointers are not equivalent simply by virtue of having the same address value.

https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html

I also happen to very much enjoy this piece on how the C abstract machine has very little in common with modern architecture.

https://queue.acm.org/detail.cfm?id=3212479


This exchange was an enjoyable read. C was designed for portability because they had those PSP computers or whatever they were but the problem is that each had its own unique architecture, switch arrangement for operation and maybe even endianess. I don't know. The whole point of the matter was to make a computer language portable enough by a person's desire to write a compiler for the architecture. Why people do not like that I can not comprehend.


They don't have to, but they're commonly understood to refer to memory addresses, which, on most ISAs, are locations of octets.

Even if the ISA only allows word- or dword-aligned loads from memory, the addresses still typically enumerate bytes, not words or dwords.

Based on a quick summary of the MCS-51 that I googled up, it looks like its memory addressing scheme still assigns addresses to bytes, and has special operations that allow you to further specify a bit offset within that memory address.


> it looks like its memory addressing scheme still assigns addresses to bytes, and has special operations that allow you to further specify a bit offset within that memory address.

There are also instructions which use an addressing scheme which takes an 8-bit bit address, with the 0x00 - 0x7f corresponding to lower memory, and 0x80 - 0xff corresponding to 16 specific registers in the Special Function Register set.


The 8051 has bit addressable memory.


Isn't a byte supposed to correspond to the smallest addressable unit of memory?


The original usage of the term "byte" was to refer to fields of variable length consecutive bits on a bit-addressable machine: https://en.wikipedia.org/wiki/Byte#History

Nowadays a byte is conventionally eight bits, especially for measures like "megabyte", but the term octet is often used to avoid ambiguity. Commonly they're used for pointers, yet often only words are addressable by machine instructions (e.g. many ARM instructions take a byte address yet raise a hardware exception on use of unaligned addresses).


Interesting, but I think the notion of a byte in C is different. But I'm not able to look it up at the moment.


It is, but it's defined rather weirdly:

"byte: addressable unit of data storage large enough to hold any member of the basic character set of the execution environment"

Hence why the type that corresponds to it is "char"! Beyond that, the only thing that kinda sorta implies that it's the smallest addressable unit is the definition of CHAR_BIT:

"number of bits for smallest object that is not a bit-field (byte)"


I think in other words what you say is that the C standard defines sizeof(char) = 1; so that 1 is one byte and that char must be one byte however different architectures can have an addressable space of a size different than 8 bits, 1 byte is not always 8 bits.

This might be why the code space alphabet is defined by the standard so it will at least put an emphasis on 8 bits == 1 byte.


C definitely doesn't require bytes to have 8 bits - it only requires them to have at least 8 bits. And there are architectures on which C char has as many bits as int (SHARC).

The question, though, was about whether it's the minimum addressable unit of memory. In the C memory model, it is, but by implication - you can't have two pointers that compare non-equal, but differ by less than 1, so a type with sizeof==1 is by definition the smallest you can uniquely address. However, the C memory model doesn't have to reflect the underlying hardware architecture.


SHARC has no such requirement. Having char and int the same size was not universal. The CPU vendor shipped such a compiler, but that was not the only compiler.

The CPU itself used 32-bit addresses to access machine words, the size of which was determined by what was being accessed. External memory was limited to 32-bit. Internal memory had regions that could be 32-bit, 40-bit, or 48-bit. An address increment of 1 would thus move by that many bits.

Mercury Computer Systems shipped a byte-oriented port of gcc. Pointers to char and short were rotated and XORed as needed to reduce incompatibility. Pointers to larger objects were in the hardware format. This allowed a high degree of compatibility with ordinary software while still running efficiently when working with the larger objects. There was also a 64-bit double, unlike the 32-bit one in the other compiler. Data structures were all compatible with PowerPC and i860, allowing heterogeneous shared memory multiprocessor systems.


You can implement byte addressing on any architecture, of course. That's what I meant by "the C memory model doesn't have to reflect the underlying hardware architecture". But as you point out yourself, this requires pointers which are basically not raw hardware addresses, and which are more expensive to work with, because they require the compiler to do the same kind of stuff it has to do for bit fields. So the natural implementation - with no unexpected perf gotchas - tends towards pointers as raw hardware addresses, and thus char as the smallest unit those can address.


It may well vary depending on which C standard you're talking about. ISO C defines both a byte and a char as at least long enough to contain characters "of the basic character set of the execution environment". They must be uniquely addressable. Although it seems their definitions don't preclude them from being different, or from sub-bytes being uniquely addressable by pointers.


> Bitfields may appear only as struct/union members[, so bitfields may NOT appear as referents of pointers, elements of arrays, etc], so there are no pointers to bitfields, arrays of bitfields, etc


Yes, that's straightforward at least when coming to C from assembly (as it used to be the case once upon a time...) In practice, though, bitfields are seldom used, so everything related to them is largely academic.


Bitfields aren't commonly used, especially in public interfaces. But usage is far more prevalent than for typical "largely academic" features. For example,

  $ grep -rE ':[[:digit:]][[:digit:]]*;' /usr/include/ | wc -l
     702
  $ uname -a
  OpenBSD orville.25thandClement.com 6.6 GENERIC.MP#3 amd64

  $ grep -rE ':[[:digit:]][[:digit:]]*;' /usr/include/ | wc -l
  276
  $ uname -a
  Linux alpine-3-10 4.9.65-1-hardened #2-Alpine SMP Mon Nov 27 15:36:10 GMT 2017 x86_64 Linux

  $ grep -rE ':[[:digit:]][[:digit:]]*;' /usr/include/ | wc -l
  532
  $ uname -a
  Linux splunk0 5.0.0-36-generic #39-Ubuntu SMP Tue Nov 12 09:46:06 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


they're also more commonly used in security tooling


Not C but I used bitfields pretty extensively for the toy 24-bit virtual machine I wrote in D. Internally, it's how I represented registers, bytes and words. There's only 16MB available for addressing so I wanted individual bits to be accessible. It was the easiest way I found to do that.


Actual bitfields are used extensively. I am obviously talking about the 'bitfield' C feature, which isn't.


D's are fairly similar to C's bitfields. They're more portable though as far as I know because you can specify the endianness as opposed to C which uses the platform's endianness. Which from what I remember is the main reason C's bitfields are a poor choice for things. Any code isn't portable to architectures with a different endianness.


Yes, we're discussing C's bitfields...


Largely academic? That is just not true in any sense. Bitfields are extremely important in many cases, not the least of which is making lock free data structures where you need to pack in information that can be atomically swapped.


I have been developing professionally in C on embedded systems for 20 years and never had to interact with the 'bitfield' C feature.

In fact, many internal coding standards forbid its use because it is poorly defined and compiler-dependent.

Moreover, it does not guarantee atomicity so if atomicity is critical then you want to access the actual bitfield 'manually'.


When dealing with lock free algorithms the amount of bits that you can do a compare and swap on are limited. For x64 it is 64 bits unaligned or 128 bits aligned. Needing to use all of these bits as efficiency as possible is common and bitfields are a big part of this. Whether you use bitfields frequently in your work is irrelevant.


I think you're still missing the point that we're specifically discussing the 'bit field' C language feature, not bit fields in general. I do use bit fields, everyone does at one point or another when working on embedded systems. But commonly this is done 'manually' (with masks and shifts) without using C's support for bit fields.


I would rather let the compiler do that. If that's what you were talking about, why did you say 'bitfields aren't atomic' ?


They are very common and useful for emulators.

They are indeed trouble for super-portable interfaces. They also tempt some people into trouble with hardware drivers, but then that would already be trouble anyway due to instruction reordering in CPU or memory operation reordering beyond the CPU. (volatile is neither sufficient nor necessary)

In a normal emulator, none of that is a problem. Reasonable uses of bitfields are compatible between the x86_64 ELF ABI (Linux, etc.) and the Windows ABI. What you gain is C code that is simultaneously concise and readable.

For example, a CPU instruction can be represented in a header file by a union of anonymous structs, each of which contains a padding bitfield and a bitfield for part of the instruction. Access in the C file is then just as easy as for normal struct members. It works for non-CPU hardware too, like DMA descriptors and motherboard registers.


Macros make code concise and readable.

As I said, C's bit fields are often banned in coding guidelines.


How do you get a macro to make an L-value out of a bitfield? That is, it goes on the left of an assignment. I'm doing stuff like this:

ethdev->phy.foo = val; // foo is a bitfield in struct phy

cpu->status.irqlevel = newlevel; // irqlevel is a bitfield in the CPU's status word

No, I don't want a 2-argument macro (or worse, with an implied variable) called something like SET_FOO or SET_IRQLEVEL.

Crummy coding guidelines are not a useful argument against the value of bitfields in the C programming language. Coding guidelines can also ban floating-point types, recursion, or symbol names longer than 6 characters. The problem is not C. The problem is the coding guidelines.


It's not because you do not understand something and its reasons to exist (though I hinted at them in previous comments) that it is 'crummy'...


This reminds me somewhat of Deep C (2011), a great read for C/C++ programmers: https://www.slideshare.net/olvemaudal/deep-c


One mistake in the first native C compiler was that all struct arms were in the same namespace, so

  struct a {
    struct *a next;
  };

  struct b {
    struct *b next;
  };
was illegal because the second declaration of "next" is a redeclaration.


This is true. It also explains why a lot of old-time Unix struct member names (including a lot that made it into POSIX or are otherwise in use today) have redundant-seeming prefixes. eg. struct stat has st_mode and not simply mode. struct sockaddr_in has sin_addr and not simply addr.


That's still common today for subsequent interfaces as it makes it possible to write macros for preserving API compatibility. For example, to upgrade (struct stat).st_mtime to sub-second granularity you could do:

  #define st_mtime st_mtimensec.tv_sec
POSIX preserves many prefixes for this reason. C99 brought anonymous unions (a Plan 9 invention), which makes many of these API tricks possible without using macros.


Correction: C11 brought anonymous unions, which perhaps explains why they're not used in standard headers. Standard headers on most unix systems do make use of C99 features these days.


Thanks for looking that up, I didn't know the timeframe on that feature.

I've known for a long time that Microsoft headers make use of anonymous unions, which makes it a rare case where they have implemented a recent ISO C feature ahead of time. I seem to recall GCC in the same timeframe would accept this with a warning that it's nonstandard. [I guess not anymore.]


In addition to C11's anonymous untagged members,

  struct foo { union { int i; double d; }; }
Visual Studio also supports anonymous tagged members,

  union bar { int i; double d; };
  struct foo { union bar; };
Is the latter more common in Microsoft land? I've always thought the latter were more useful and wished they were standardized. Anonymous tagged fields make it possible to reuse plain struct and union definitions, permitting a simple form of inheritance; whereas with untagged fields you have to rely on macros for sharing definitions.

GCC (https://gcc.gnu.org/onlinedocs/gcc/Unnamed-Fields.html) and clang (confirmed macOS clang-1001.0.46.4) support anonymous tagged members with -fms-extensions, but I can't bring myself to make use of it as I still try to keep many of my projects working with Sun Studio and otherwise try to avoid non-standard features.


> Is the latter more common in Microsoft land?

I don't know if I have an authoritative sample on that. I saw it exactly once when I worked at MS and I thought it was weird. But I guess it aids in the "psuedo-inheritance" game that people play with structs and their first member being a "base class". Like GObject or gtk+ do. Or struct sockaddr, which duplicates the first few fields.


Ohhhh... damn. Obviously :)


I wouldn't call it a mistake. That's how the language was defined at the time. If it was a mistake, it was a mistake in the language, not in the compiler. (Admittedly, given the lack of a standard at the time, the line between the two is less clear than it is now.)


Did you mean

  struct a *next;
or was putting the asterisk before the type name legal back then?


If you view member access as a (global) function on a value (possibly an lvalue), then it actually makes a whole lot of sense.

I find code dealing with structures that have their members prefixed with a unique prefix much easier to read, usually.

Given the pervasiveness of APIs that contain struct definitions nowadays, it's also understandable that the unique-constraint was removed.


Indeed and Haskell still has this issue today, because member access is a top-level function.


Still, it creates all kinds of problems for code generation (templates) and code by convention (as with generic).

I have not seen a case of it creating problems where there weren't already plenty of other problems with the code and metaprograming had to be abandoned anyway. But it does create problems.


It would make more sense if fields actually were functions, or at least function-like in syntax.

Then again, OCaml still has module-scoped scoped record field names today, even though it also uses the dot syntax.


In an era where line editors were a standard tool, and sed and awk were invented to do bulk changes on code, this is a defensible choice.


  > There is an implicit "x != 0" in: if (x), while (x), ... etc.
  > An explicit "x != 0" in these contexts serves no semantic purpose.
  > And "x == 0" in these contexts might be better written as "!x".
I disagree with this advice because the idea is not portable across languages.

For example, if we let the variable x hold the integer value 0, then:

  Java    'if (x)' is a compile-time type error.
  Python  'if x:'  is falsy, similar to C.
  Ruby    'if x'   is truthy!


While this is technically correct, idiomatic javascript differs from idiomatic ruby, which (that is the point) differs from idiomatic C.

So either no idioms are correct, or it is correct to apply the idioms pertaining to the language of choice.


The advice in the article is too broad because there are times when the comparison is not meant in the truthy/falsy sense. For example, strcmp and friends return 0 if the strings are equal, less than 0 if str1 is lexically less than str2, and more than 0 if str1 is lexically greater than str2.

So if(!strcmp(str1,str2)) is a valid way to check for equality, but I think the negation makes it confusing; it is tempting to intuitively read it as checking if the strings are not equal. if(strcmp(str1,str2) == 0) is the better choice.


Calling it (eg) strdiff instead would help here, if(!strdiff(str1,str2)) matches the intuitive interpretation.


It's a three way comparison, not an equality one. If you treat it like that then the return value makes more sense.


Sure - thats why it shouldn't be str_ne or similar. strDIFF (or rather, strcmp) returns the direction in which strings DIFFer - less than (negative), greater than (positive), or equal (zero).


Do you stick to features portable across all languages for all your code?


Yes. This is why I exclusively write polyglot source files, so you can choose to compile my code in either c++ or php or bash or whitespace.

Jokes aside, polyglot source files are pretty cool, and generally things like abuse of the c++ preprocessor are really neat to see, if completely unwise to do in production.


Maybe they only write polyglots :)


It's not meant to be portable. It's just that C is a low-level, weakly-typed language where everything is bytes and bits, and every pointer may be null at any moment. Disregarding type-safety is OK for C, so it's handy to conflate 0 and false (after all, bools are isomorphic to positive numbers modulus 2). It's these other languages who are in the wrong because any high- or medium-level language must have proper Option and Bool types without null or false infecting every type out there.


Yeah... Ruby is in the wrong there. Not even js does that

Insisting if takes a bool like java does is perfectly acceptable. But just taking a 0 value and making it "true" makes sense only in Bash?


It depends on what the set of falsey values is for that language.

E.g. in Common Lisp, the only falsey value is NIL (the empty list). This is easy to understand, easy to remember, and straightforward in practice because idiomatic Lisp does a lot of list operations. It wouldn't make sense for 0 to be truthy, because that's a legit value, and NIL represents the absence of a value.

I don't know what values are falsey in Ruby, so I can't comment on it.


Lua does this as well, and I quite prefer it.

If I return a number or nil, I want to check `if ret`, not `if type(ret) == "number"`. The latter suggests that it might be some other type‡, the former that it might be missing entirely.

A language which provides a proper Boolean has no need for the association between zero and falseness, it is a conflation of levels which can lead to subtle bugs.

‡ yes, nil is also a type. I mean some other type of type.


Scheme recognizes all values except #f, including 0, '() which is how you say nil in Scheme, and "", as true in conditionals. As it should be; having to guess whether something might be falsy is a pain in my rear solved by having precisely one falsy value. As a bonus you can use #f as a sentinel value to denote the absence of a valid parameter or return value when a valid value would be some non-boolean object, and then check for validity with a simple (if x ...)


Lua does that too. Lua only has two values considered false, nil and false. All else, in a Boolean context, evaluates to true.


It makes a lot more sense when you consider that all values in Ruby would be pointers in C. And if x is a pointer to int, if(x) is still checking whether the pointer itself is null, not whether the int it points to is 0.


erlang and erlang VM languages come with 0 being truthy, and the only falsey values are false and nil. I had a junior python programmer disagree with that and within 3 weeks of onboarding him we found a place where it made sense for 0 to be a truthy value. I believe it was a place where we had an options list. One of the options was a value that is settable to zero. In elixir:

    timeout = options[:timeout] || :infinity
if zero is falsy, you can be royally screwed with that code, if timeout should be settable to zero (definitely a thing).


Completely agree... but the most serious issue is that it does not read well.

Consider how you would read it: "if grass is green ..." is effectively shortened to "if grass ...".

This makes no sense.

I always use explicit conditions and consider "if(x)" and antipattern and have done since I was learning C back in the mid eighties.


Unfortunately your example doesn't illustrate the point very well:

    if (grass_is_green) { ... }
vs

    if (grass_is_green != 0) { ... }
or

    if (grass_is_green == 1) { ... }
The one with the implit comparison to zero is more readable


Certain industries also have guidelines that chime in on this issue. Where I work, we have to create defines for TRUE and FALSE, and everything needs to be compared to them. Essentially making things look like

if (bool_name == TRUE)

There is no ambiguity, and it forces you to handle exactly equal to 1 and not equal to zero separately.

if (bool_name == TRUE)

and

if (bit_field_value > 0)

are logically different even if they can be simplified to the same thing.


In C, comparing boolean-like variables to true and false constants is bad practice.

The comp.lang.c FAQ touches upon this:

http://c-faq.com/bool/bool2.html


I find myself both agreeing and disagreeing here. I put comparators compulsively everywhere, especially because I switch between multiple languages, but I also don't trust my memory and keep a close eye on the documentation, therefore I usually end up with

    if (isupper(c) != 0)


   if (isupper(c) != 0)
This code is illegible to me. How do you read it aloud? "if c is upper is not zero" ? It makes no sense. Compare it with the normal way of writing "if(isupper(c))" which is pronounced "if c is upper", which is perfectly clear, grammatical English.


The real tripper-upper with isupper (and siblings) is that the argument is of type int and must have values in the range 0 to UCHAR_MAX (those of unsigned char) or else EOF. Anything else is undefined behavior.


That is bad practice IMO because it makes it easy to accidentally introduce bugs and produces code that a reader will misinterpret, making it harder to find bugs.

The better approach is to always compare against FALSE, because that avoids the ambiguity of == TRUE only catching a single truthy value out of the millions of possible truthy values, contrary to the well-known semantics of C.

If you really need to distinguish between 1 and other truthy values, then just compare against 1 explicitly. Don't muddle the concept of truthiness.


All times where a Boolean value would be assigned, it must be explicitly checked and then set to either TRUE or FALSE.

The reason is for writing embedded code that doesn't depend on a specific compiler or architecture, which is important when code that was written in the 80s is still getting used everyday in new safety critical systems, where any kind of standard libraries are forbidden.

I agree it's not the best but it's definitely not uncommon.


oh my god, this is horrific. I suppose you are just trolling. There's no way this can be an accepted practice anywhere.


As someone who has actually worked with a codebase with plenty of such expressions, I suspect it was really an "accepted practice" at some point in time, stemming from a mindless interpretation of the "be explicit" guideline taken too far.

The worst part is that sometimes these redundancies grow and combine to create even more confounding expressions, frequently resulting in monsters resembling this:

    if ( ((((!(var)!= FALSE)) != TRUE)) )
    {
        return FALSE;
    }
    else
    {
        return TRUE;
    }


The one with the implicit comparison is more readable because it's closest to simulating the test of a Boolean variable.

The other two leave the reader with the lingering suspicion that the variable takes on two or more values. The latter more so than the former.


> Given two operands to a binary operator, find the first type in this list that matches one of the operands, then convert the other operand to that type.

Does anyone know for sure whether or not the Arduino compiler deviates from this standard, or ever had, in previous versions?

Does multiplication qualify as a binary operator?

IIRC, I think I was compound multiplying the product of an int and a char into a long (see below), and I had to explicitly cast the char as int to get the right results, much to my surprise, as I had expected the char to get promoted to an int. But I could also have made some mistakes, it's been awhile...

  longType *= intType * (int)charType;
EDIT: Found my old code snippet, and it kinda invalidates my original question, what I was actually doing was

  longType *= charType * charType * charType
which leads the right side to evaluate first without promotion, thus modulating the end result by 256, which was not intended. I then cast all of them as long, but probably (certainly!...?) only one would've been enough.

EDIT2: Or at least, two of them have to be casted, else it might evaluate two chars first, modulating the result, before actually promoting to long to multiply with another long. (Or not, because it promotes to the most significant data type on the right side? Ugh... this is why I prefer to be extra verbose.)


Arduino boards (the ones based on AVR chips) use the GCC C++ compiler adapted for the AVR. The current IDE version 1.8.10 uses avr-gcc-7.3.0 (https://github.com/arduino/Arduino/blob/master/hardware/pack...). The previous IDE version 1.8.9 used avr-gcc-5.4.0.

The compiler is configured to follow the -std=gnu++11 variant. Here are the flags: https://github.com/arduino/ArduinoCore-avr/blob/master/platf...

If you use other Arduino-compatible boards (e.g. ARM chips, ESP8266 or ESP32), they will use a different C++ compiler and different settings. See for example:

* SAMD uses 'arm-none-eabi-g++' (https://github.com/arduino/ArduinoCore-sam/blob/master/platf...)

* ESP8266 uses xtensa-lx106-elf-gcc (https://github.com/arduino/esp8266/blob/master/platform.txt)

* ESP32 uses xtensa-esp32-elf-gcc (https://github.com/espressif/arduino-esp32/blob/master/platf...)

So Arduino programming is "just" C++ with additional libraries and an IDE. Though I wrote "just" in quotes because they've done a lot of work to make things easier and approachable for beginners, so I don't want to diminish their work.


Thanks, looks like there's an -mmcu flag that gets passed to the compiler as well, telling it what chip the code is intended for.

I think the answers I'm looking for are buried in those specifications. Gotta check them out, one of these days...

> So Arduino programming is "just" C++ with additional libraries and an IDE.

Considering that flag, I think it's something more (or less) than just plain C++...


Which side it is doesn't matter. Assuming signed operands, if either one is long, then the other one is also promoted to long; otherwise, both are promoted to int. So bool * bool or char * char or short * short is actually int * int, and the resulting type is also int.


Does the Arduino really promote two chars to int when performing on them?

It is an 8-bit chip, and that would take a number of cycles more, than to just leave them...?


I'm not sure about Arduino specifically, but 8-bit usually refers to memory bus width, not the register size. IIRC Arduino has 16-bit registers? So I'd expect it to have a 16-bit int.

And depending on the exact expression, it doesn't have to actually do anything differently - these integer conversions describe the expected result of the operation, not how it's performed in assembly. So when you write something like this:

   char a, b, c;
   ...
   a = b + c
the spec requires that b+c is treated as an int addition, but the resulting int is then assigned to a char variable, which basically truncates it. Furthermore, if b+c is larger than can fit in a char, then we have signed overflow (assuming that char is signed, which is usually the case), which is undefined behavior. Altogether, this means that the compiler can actually do an 8-bit addition here while remaining within the bounds of the spec.

OTOH if the result is assigned to an int variable, then yeah, it'd have to promote them - but that would also be far less surprising to you than if it didn't, no?

Where this can potentially lead to more overhead than you'd expect is use of the temporary as input to another subexpresion. E.g. suppose we have this:

   a = (b + c) / 2
Suppose b is 200 and c is 100. The spec requires that the result of this operation is 150, because b+c is performed on ints, and thus won't overflow. If it were an 8-bit addition of signed chars, assuming the typical wraparound behavior on overflow, you'd get 22.


When referring to the Arduino, unless explicitly stating a specific chip, most people are actually referring to the AT328P chip, which is a chip with 8-bit architecture (registers are 8-bit, 16-bit data types are split between a high and a low register). It has 32 KB programmable memory, and 2 KB RAM.

Indeed, though, on the Arduino platform, an int is usually 16 bit, but not because that's the chip's native archticture, but because that's the range required. Though I can't say for sure, I highly doubt (and now am actually almost completely certain in my doubts) that it would promote two chars to ints to perform operations on them, and last night also realized why, namely, there are counts for the number of cycles required to do operations on various data types [0].

If what you say about integer promotion is true, there would be no difference in clock cycles between an int and a byte, but there is.

So in essence, no, the Arduino is definitely somehow deviating from spec when it comes to promotion.

[0]: https://forum.arduino.cc/index.php?topic=92684.msg696420#msg...

EDIT: I just realized, I think I need to actually look into the (Atmel's) AVR specs, as the Arduino IDE is basically just a wrapper for that.


Interesting. Yeah, you're right, int has to be at least 16-bit by the spec.

But, again, this would only manifest in expressions where you mix operators, or try to assign the result to a variable of a broader type. So (a+b+c) could still be evaluated entirely in 8 bits without becoming non-conforming, if target is a char, and ditto for (a-b-c) etc. So long as you know that only the last 8 bits of the result is all that matters, the promotion can be disregarded. It's only when you apply some other operator - multiplication, division, bit shift, or comparison - to the intermediate result that you can observe the increased width. But that doesn't actually happen all that often.

(Conversely, it might also mean that it is not conforming, but there are relatively few practical cases where it actually manifests.)


I have seen this happen with pointers on the x86_64 arch and the comparison operator where they appear to be the same value in the lower 4 bytes but the upper 4 bytes are different.

I am not exactly certain, but what I think happens is that when you define your charType variable that the upper bytes in the word are still the same random data, so when you multiply without casting, it is treating the char as an int and multiplying the whole 4 bytes. When you cast, it probably just multiplies the lowest byte and carries if necessary. I am also not certain if the compiler would optimize to using a shift operation if one operand is a multiple of 2.

Again, not exactly certain if this is the case, but it's similar to what I've noticed.


If I recall correctly, the arduino compiler is actually a C++ compiler so it will require more explicit casts, but it's been some time since I used arduino. Also the declaration:

char foo;

may be signed or unsigned depending on the architecture and compiler, which may lead to surprising results.


Multiplication is a binary operator, here binary just means has two inputs. ~x would be a unary operator and x * y would be a binary one.


That shouldn't be necessary. The "usual arithmetic conversions" should cover that.


> Operator Precedence and Associativity

There's also the operator(7) manpage for quick reference.


I'm surprised a bit by the "Translation Steps" part at the end.

I assumed much of the steps listed happened during parsing, such as processing escape characters and converting newlines, but those are done beforehand.

Perhaps it is because C uses a preprocessor and some macros would not be possible if all the steps were performed while parsing?


Remember kids: Bitfields are not thread-safe!


Gotta one-up your pedantry here: kids, "atomic" and "thread-safe" are not the same thing. Bit operations are not in general atomic (though on some architectures they can be), but atomic operations themselves are only a requirement, and not sufficient for, thread safety. Thread safety is an architectural property. You can write terribly racy code using 100% atomic operations.


Nothing in C is thread-safe out of the box.


Certain operations are, such as static variable initialization (which I believe is done during program startup in C).


Static variable initialisation that is not initialisation of such variables local to functions is performed before the program starts running, so there cannot be multiple threads.

I can't remember the case in C, but in C++ the initialisation of a function local static variable is guaranteed to be thread-safe.


It doesn't matter that much whether the variable is global or local, but rather what the valid initializers are.

In C, all statics can only be initialized with compile-time constants. Thus, they can all be initialized before anything in the program starts running (indeed, there's no code initializing them in most implementations - they are just bytes in the data segment).

In C++, initializers can be arbitrary expressions, which means that an initializer for a global variable can spawn a thread, and that thread might still be running when another global is being initialized with a different expression that is not a compile-time constant. There are no thread safety guarantees wrt those kinds of conflicts, for either globals or locals. But for static locals with runtime initializers, the evaluation of initializer is deferred until execution actually reaches the definition of that variable - and thus C++ has a special provision for thread-safe synchronization if that happens on two different threads concurrently.


> Static variable initialisation that is not initialisation of such variables local to functions is performed before the program starts running, so there cannot be multiple threads.

you forget the dlopen case :)


> I can't remember the case in C, but in C++ the initialisation of a function local static variable is guaranteed to be thread-safe.

Only in C++11 onwards.


[flagged]


Letting go of C is not nearly as easy as you’d make it seem.


Show me a non-C, high level bootloader for a non-trivial system.



Will the AS/400 (now IBM i) written in PL/S do?

Or do you prefer the first RISC firmware written by IBM in PL.8?

Then again, maybe VMS one written in BLISS.


I'll agree when there's a good replacement.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: