There is a need for volatile in portable C programming, in two circumstances:
1. When an asynchronous signal handler modifies a variable that is inspected by the interrupted mainline code (e.g. to set a flag indicating that the signal went off), that variable must be of type "volatile sig_atomic_t". (The article points to some outside discussion by Hans Boehm about this in relation to Unix, but it's not just for Unix; it's in ISO C.)
2. When a function, after saving a context with setjmp(), modifies automatic local variables, and then the context is restored with longjmp(), those variables must be declared volatile, or else they will not reliably have the correct values. (E.g. the longjmp may or may not restore the values which they had at setjmp time, or do it for some of those variables but not others.)
No matter how a C compiler treats volatile, in order to be ISO C conforming, if the program correctly uses volatile in the above situations, it must somehow work. Even if it is useless for anything else: threads, hardware, ...
{
volatile int i = 0;
if (setjmp(jmp_buf) == 0) {
i++;
longjmp(jmp_buf, 1);
} else {
printf("i == %d\n", i);
}
}
Here, the printf should produce i == 1, which is not required if the volatile is removed.
For instance, if i is located in a machine register, and setjmp/longjmp work by saving and restoring registers (or just some registers, including that one), the longjmp operation will restore the value that existed in the register at the time setjmp was called, which is zero.
If that's a problem in a given compiler, even if it has a garbage implementation of volatile, it has to pay attention to the fact that volatile is being used in setjmp code.
This existed before threads were introduced into the C language. It's only about a single control flow being interrupted by signal, remaining suspended while the handler executes. As soon as the suspended code resumes, the updated sig_atomic_t value is visible to it, and the whole value (not some torn mix of the old a new value). It has nothing to do with threads or cores; if one thread runs a handler and another looks at the flag, it's not even relevant that a signal handler is involved.
The sig_atomic_t type that is specified for that purpose and must be used, which rules out the uncertainty you're referring to in (1). It is not a type qualifier but a specifier.
For 2. that seems like a defect in the specification of setjmp()? The equivalent "just works" for functions like pthread_mutex_lock() etc. Those calls implicitly add barriers to force reloading.
Think how the compiler would compile it. The compiler must assume pthread_mutex_lock (as any opaque non-otherwise decorated functions) clobbers memory. So if the address of 'i' escapes the containing function, then the compiler must make sure it is correctly written to memory before the call. If it doesn't escape, the compiler can potentially leave it in a (callee-save) register and there still wouldn't be any multithreading issues.
longjump is special: it can return twice and the effect is observable even for non-escaping variables: so values that must be preserved across function calls need to be forced to memory by declaring them volatile.
GCC specifically marks longjmp as returns_twice, which as far as I can tell prevents the surrounding function from being inlinable and additionally treats all local variables as volatile at the point of the longjmp call (forcing them to memory even if not escaping), but that's a GCC extension.
I'm almost out of my depth here, but I believe this isn't (only) about escape analysis. A function call (like pthread_mutex_lock(), or any other) is running on the same thread.
I can see how failure to prove that a variable hasn't escaped a certain scope must prevent compiler reordering when a function of unknown implementation is called -- but not how that failure should require emitting memory barrier instructions.
However I realize that setjmp()/longjmp() isn't about threading either. What those functions do is quite weird.
The compiler won't emit memory barriers for pthread_mutex_lock. It only need to make sure that all globally observable (i.e whose address has escaped)values are flushed from registers into memory. In practice this means that opaque function calls act as compiler memory barriers. Any additional required hardware memory barrier instruction is inside the implementation of pthread_mutex_lock itself.
Yes. I can see how setjmp()/longjmp() would need additional/special treatment by escape analysis. Now I'm only wondering why the problem would be limited to (syntactically) automatic local variables. If the control flow (returns twice etc) is surprising to the compiler, couldn't that affect optimizations to non-local variables too?
It could, but there is no latitude about it specified in the standard. Only automatic locals are allowd to turn to pixie dust after a longjmp, and only if they have been modified since the setjmp.
Thus, if other things are a problem, the compiler just has to slow down in that section of code where setjmp is being used and not do those optimizations (without being told not to via volatile).
By the way, I have run into a problem, quite recently, where a setjmp-like routine (not setjmp itself) caused a problem with access to a global variable, under gcc.
This was caused by -fPIE builds, enabled in some toolchains and distros.
The global variable in question was accessed, under the hood, via a global offset table or something like that. Basically, a layer of indirection due to the position independence of the code. The invisible pointer variables needed to access the global are, of course, themselves local.
A problem happened whereby code executed since the setjmp like function prepared the value of a hidden local in order to access that global variable. When the longjmp-like function was executed, this was trashed. Yet the code assumed the value is stable; that it can access the global variable through that address. The result being a mysterious crash.
Not sure if the issue is reproducible with real setjmp and longjmp.
Calls to functions like pthread_mutex_lock() don't magically add memory barriers. From the perspective of the compiler they are regular function calls. They "work" because of escape analysis, but that applies to any function that is called.
It's true that the C specification could have said that the setjmp() function is "special". Then the only way to implement it is to spill most local variables to the stack before the call. I suppose the C authors didn't want to introduce this special case (is there any other function that the C compiler is required to treat specially?)
Pretty sure that pthread_mutex_lock() etc. have to add memory barriers, in some way or another, depending on the architecture. Regular function calls shouldn't require full inter-thread memory synchronization just because escape analysis doesn't know the callee.
However setjmp()/longjmp() are different beasts entirely, and the problem here isn't related to multi-threading and thus not related to hardware memory ordering.
Those memory barriers are in the function implementation. They don't exist at the call site. Again, from the perspective of the compiler it's a regular function call.
> However setjmp()/longjmp() are different beasts entirely
Yes, exactly, they are not "equivalent" to pthread_mutex_lock() at all, which is what you suggested in the beginning. A call to pthread_mutex_lock() is a regular function call as far as the compiler is concerned.
Nope! The memory synchronizing properties of those functions are at the specification level. POSIX says so, and so the implementation has to make it so, somehow. That could involve recognizing those functions in the compiler. Usually external function calls are good enough to have a compiler barrier (the compiler won't reorder accesses around those locking calls), so that the function then just has to contain the hardware memory barriers.
It's not even clear what part you're dismissively replying "Nope!" to.
I'll be explicit: on POSIX systems that implement the POSIX threads extension there is a header file called pthread.h that declares a regular function called pthread_mutex_lock() and that function can be called as a regular function by an ISO C compliant compiler.
(POSIX also allows defining macros that can achieve the same effect, possibly more efficiently, but pthread_mutex_lock() et al. have to exist also as regular function definitions.)
So the point remains: pthread_mutex_lock() works not because the C compiler treats it specially. That makes sense since the C standard doesn't even mention it. Unlike setjmp() it's not part of the C standard, and it doesn't need to be, because none of its behavior requires compiler support, beyond what is already required by the platform ABI.
Here's a better analogy. The equivalent just works if we use C++ exception handling instead of setjmp. You can change variables after a try, and those values will be reliably observed in the catch.
setjmp and longjmp are a module that you can write in a small amount of assembly language, without changing anything in the compiler to support them. (Provided it has a bona fide volatile.)
Exception handling is a fairly complex feature requiring supporting in the compiler, with various strategies that have different performance trade-offs.
> Here's a better analogy. The equivalent just works if we use C++ exception handling instead of setjmp. You can change variables after a try, and those values will be reliably observed in the catch
For some reason, this is an older version of my comment. I am sure I updated it to remove "gaping omissions", and noted that the article references a discussion by Hans Boehm of signal handlers in Unix programs. It is not Unix-specific though; ISO C specifies signals to some extent and that bit with volatile sig_atomic_t. It can exist in any platform.
In C++ there's actually a lot more freedom. You can access non-atomic non-volatile-std :: sig_atomic_t variables as long as you don't violate the data race rules.
Not really. The two languages are similar on this. The 'volatile' before 'sig_atomic_t' is still required in C++ for the same reasons as C. You can access non-volatile sig_atomic_t variables in C too, but in both languages that's not enough for the signalling that type exists for, so you have to use 'volatile sig_atomic_t', in C and C++.
An example of scenario 1) is the loop below. The 'volatile' is required in C++ the same as in C. If you interrupt the loop below in C++ with a signal handler that updates 'flag', the loop is not guaranteed to exit unless 'flag' is declared volatile.
You can test this easily. Just now I compiled the C++ code below with Clang/LLVM with -O on a Mac, and GCC with -O on Linux. On both systems and compilers, Control-C fails to stop the process if 'volatile' is not used. If compiled without -O, Control-C always interrupts the process, but you can't rely on behaviour of disabled optimisations.
#include <signal.h>
#if 0
volatile
#endif
sig_atomic_t flag = 0;
void handler(int sig) {
flag = 1;
}
int main(void) {
signal(SIGINT, handler);
while (!flag) { /* Spin waiting for flag */ }
return 0;
}
In C++ this violates the data race rules both with and without volatile, but because it's sig_atomic_t it has a special carve out _only_ if it's volatile. See https://eel.is/c++draft/basic#intro.races-22
C however states :
> When the processing of the abstract machine is interrupted by receipt of a signal, the values of objects that are neither lock-free atomic objects nor of type volatile sig_atomic_t are unspecified, [...] The representation of any object modified by the handler that is neither a lock-free atomic object nor of type volatile sig_atomic_t becomes indeterminate when the handler exits.
This wording is not present in C++, as it instead defines how signal handlers fit into the memory model.
This means that (with adjustments for C atomics):
int val = 0;
std::atomic<bool> flag{false};
void handler(int sig) {
if (!set) {
val = 1;
flag = true;
}
}
int main(void) {
signal(SIGINT, handler);
while (!flag) { /* Spin waiting for flag */ }
return val;
}
The post is still accurate, but in 2011 C and C++ added atomics, which are a more portable alternative to uses of volatile for atomicity. They can be more efficient in some cases than the locks suggested by the post, especially in CPUs with higher core counts. (Note that dual-core consumer CPUs were around by 2010 but had only existed for a few years. Linux only finished removing the Big Kernel Lock in 2011.)
C11 did add _Atomic, BUT, they are not more portable than using volatile.
In C11 any type of any size can have an atomic qualifyer. That means you can have a 50 byte struct that is an atomic. No hardware has a 50 byte atomic instruction so that is not implementable using atomics. The standard gets around this by letting an implementation have a hidden mutex to guarantee that the operations will be atomic.
The problem with this is Windows. Windows lets an application load dynamicaly and shared libraries (DLL). This breaks the C11 Atomic model. Let me illustrate using an example:
Application A creates an atomic data structure, and the implementation creates a mutex for it. Application B does the same thing. Application A wants to share this data structute with dll X. It then has to share its mutex with the DLL sop that the DLL and application uses the same syncronization primitive. Now Application B wants to do the same thing, problem is DLL X cant use Application Bs Mutex, becaus it is required to use Apllication As mutex.
C11s Atomics will never be implemented on Windows because they cant be! Besides, all major compilers do support intrinsic atomics using volatile, that are nearly identical, (and in some ways better understood) so thats what I recomend using. Linus has indicated that he thinks the C11 concurrent memory model is broken so the kernel will continiue to use volatile and intrincics.
On Linux, applications are required to load the libatomic shared library to correctly implement the full standard (on some older platforms like ARM, this is true even for small values since the hardware didn’t have an exchange instruction). There are a lot of operations that atomic int can do that volatile can’t do or will do incorrectly (such as seq-cst ordering or exchange). Why is this impossible for Windows? That sounds like a compiler implementation flaw, not an OS level impossibility.
Because Windows lets a process link to a library that has already been initialized by another process, so the process cant share its mutex during initialization, becaus the initialization has allready happend.
volotile gives you some decired propperties when multi-threadding, it is observable and therfor order rependent, but it does not have release/aquire semantics and it is not required to syncronize changes to other processors, so you need to use atomic intrincics in conjunction with volatile types. Volatile alone is NOT enough to be thread safe.
How do you link that? I thought I knew the API pretty well, but that is a new one for me.
volatile is only ordered with other volatile calls and is otherwise UB when combined with non-atomics. Whereas an atomic is well-defined for ordering on other operations too. There are even some operations (eg seq-cst on a set of memory locations) which are known to be incorrectly executed in certain scenarios if modeled with volatile+fence, and require the use of atomics.
Volatile is "observable" and all observable behaviour is order dependent. (so for instance a volatile and printef can not be reorderd, because both are obeservable). Howqever this is ONLY in a single threaded context. This is why you need a atomic intrincic operation to operate on a volatile for it to be thread safe.
Even on hardware where loads and srtores are atomic (x86), you still need the atomic ops to manage ordering.
What is “observable”? Only atomic seq-cst is order preserving, and even then only if there is no data races, and only if the compiler thinks it could even be observed by another thread. Otherwise, the compiler (and CPU) can and will choose to reorder operations. A printf call could even be reordered if the compiler could observe that it does not contain a volatile access. The volatile qualifier forces the operation to occur even if the compiler thinks the result would not be observable. But unless you work a lot with signals or memory mapped device registers, how is volatile even relevant, especially when the atomic ops are required anyways?
Volatile structs are also not guaranteed atomic, so you do not lose any portability by changing questionable volatile atomic word sized variables to _Atomic (except portability to pre-C11 compilers of course, but then you can #define _Atomic volatile and pray for the best).
Sharing the memory of a loaded library is not a problem though. Every major operating system has done that for decades. That is just a consequence of copy-on-write pages though and doesn’t affect process isolation. Unless you meant something different than that?
There is also fork on posix systems, which is incompatible with using atomics in the child process for basically this reason though of accidentally partially sharing a loaded library between two processes. Most libc documentation will state that only async-safe calls are permitted after fork until exec for this reason.
the difference is that in windows if two applications load the same DLLs they don not just share code, they also share state and data. If a DLL has a global variable, it can be accessed by both applications.
This means that you can use DLLs as a mechanism to communicate between multiple applications.
Really? I can't believe that could possibly be true out of the box [1]. It would be a massive violation of process separation: buggy programs would be able to take down other processes, which is not something that really happens after WinME. I have 0 knowledge about Win32, do you have a pointer to some docs describing this behavior?
[1] of course even on unix you can mmap state on demand if you want to share between processes, but it is absolutely not the default.
Yeah, now that quelsolaar clarified, I am fairly certain that claim is not true, for exactly the reasons you describe. Of course, there is also the practical example of the mingw compiler, which does implement C11 for Windows, as a counter-example to their claim that it cannot be done.
So, there is probably a kernel of truth, as far as I know the DLL model in Windows is equivalent of RTLD_LOCAL, so global variables are actually instantiated per DLL (but of course not shared cross process), which for example makes allocation behave differently. So a spinlock pool between the main program and a DLL wouldn't be shared, making cross-DLL emulated atomics problematic. But I guess there are ways around that or simply the expectation is not to share non-address-free atomics across DLL boundaries.
In my experience, being similar to RTLD_LOCAL avoids a whole slew of sharing/unique accidents compared to the pile of hacks that is ELF. It is sometimes both the hardest and easiest platform to work with since it is the conceptually most consistent but also therefore the most primitive linker. But that is just not an issue, as the compiler must work anyways to ensure atomics work correctly per the platform ABI.
Indeed the problem is otherwise not restricted just to memory sharing: even the particular CPU instructions chosen can mean one compiler is incompatible with the output of a different compiler when it comes to atomics even when locks are not involved (the specifics of which barriers are used and where often mean there are multiple valid, but mutually incompatible ways, to emit atomic instructions)
> The post is still accurate, but in 2011 C and C++ added atomics, which are a more portable alternative to uses of volatile for atomicity.
Atomics and volatile solve different problems, though. Atomics ensure a read or write completes uninterrupted. Volatile ensures that a read or write is never optimised away.
I think C11 atomics can be optimised away (for example, reading a value twice in a row might result in only a single actual read).
> Atomics and volatile solve different problems, though.
Yep, that's why atomics are only an alternative to uses of volatile _for atomicity_. For the original use case of accessing hardware registers, volatile is still the correct choice.
It is indeed possible for C11 atomics to be optimized, although interestingly, the three major compilers do very little such optimization. This paper [1] lists some optimizations that are implemented in LLVM and some that aren't; it's from 2015 but from some quick testing it seems like not much has changed since.
If optimizing repeated atomic loads is indeed allowed, waiting for a signal by spinning on an atomic load could loop forever. Yet I have the feeling most people consider such code to be valid. Are they wrong?
While its true that atomic can solve the issue of atomic operations (increment, compare and swap, ...) that volatile doesn't try to solve, it is also required to solve the issue that volatile tried but failed to solve correctly: you have no ordering guaranteed between volatile and non-volatile memory accesses (see problem no.5 of the article).
In that way, atomic complete volatile instead of being orthogonal o it, because it provides the ordering semantic missing in volatile. And it doesn't replace it completely, because as you said, atomic accesses can still be optimized away.
So in most use-cases of volatile, you actually want to declare your variables atomic+volatile along with the correct memory_order on your atomic operations.
In the Linux Kernel instead of marking variables/types as volatile you mark _accesses_ as volatile. there's a pair of macros READ_ONCE/WRITE_ONCE that temporarily cast the pointer for you. I think this is a better way to use volatile.
Even then I think it's rarely useful outside of x86-specific code (where the CPU gives you quite a lot of memory ordering guarantees). Would be interesting to check how often it gets used elsewhere.
This is a very good post; too many people (even myself, sometimes) forget that volatile doesn't mean that the statement containing it cannot be reordered.
[EDIT: the one thing he missed, which I would have liked to know, is about using volatile with int (or sig_atomic_t) as an "eventually consistent" value, for example one global `sig_atomic_t end_flag = 0;`, a single writer (a SIGINT handler to set it to 1), and many threads with `while (end_flag == 0) { ... }` loops.
I've been using this pattern for a while with no obvious problems - access to `end_flag` can be rearranged by the compiler, barriers are irrelevant, the value can be corrupted by a race on every read and it won't matter - the thread will get the eventual value of end_flag on the next loop and end.]
> one global `sig_atomic_t end_flag = 0;`, a single writer (a SIGINT handler to set it to 1), and many threads with `while (end_flag == 0) { ... }` loops.
While unlikely to cause problems, this is a data race (a set of at least two concurrent accesses, of which at least one is not an atomic access and at least one is a write) and therefore constitutes undefined behavior.
sig_atomic_t is only safe to use from one thread, where concurrency is given by a signal handler.
I suspect sig_atomic_t does work fine when we're talking about POSIX signals, but OP was probably thinking more from an embedded programming and hardware interrupt handlers, which don't conform to POSIX signal semantics.
> I suspect sig_atomic_t does work fine when we're talking about POSIX signals, but OP was probably thinking more from an embedded programming and hardware interrupt handlers, which don't conform to POSIX signal semantics.
It's not the sig_atomic_t that I think is wrong (could be plain int), it's the "Is it safe to have one writer to a zero-initialised volatile value, and many readers checking for non-zero of that value?"
Now I wouldn't use this and expect correctness in the value, but even when the value that is read is corrupted because that single write did not finish (it's zero, one or something else), it will be non-zero eventually, and so the thread will end.
Technically, using volatile between threads is a data race and therefore UB [1]; the guarantees made around sig_atomic_t only apply between a thread and a signal handler on the same thread.
Though, I'd argue that the no-optimization guarantee of volatile actually does justify reasoning of the form "it's not undefined behavior because the hardware guarantees it", which is a mistake anywhere else in C. On essentially all architectures, loads and stores of volatile integers act the same way as loads and stores of atomics using memory_order_relaxed (or stronger, depending on the architecture). So it may be legal to rely on volatile being atomic, as long as you don't expect the code to be compiled on some hypothetical architecture that doesn't have this feature.
At the hardware level, there is no distinction between a regular load, a volatile load, and a relaxed atomic
load (assuming small sizes and optimal alignments). But the compiler can still can do things that break your code or that miss optimizations when given incorrect ordering annotations.
It seems to me that volatile ends up falling almost between acquire and relaxed ordering, in terms of the behavior of most CPUs and compilers, for small aligned values: it doesn’t synchronize any other operation but does prevent folding of consecutive operations.
A place that volatile shows up in C today that is stunningly handy is when writing eBPF programs.
In eBPF you have to appease a verifier which is trying to prove safety and liveness properties of the compiled program. To prove safety properties you often need to ensure that some offset into a buffer (bpf map) will be in bounds. Even if you judiciously sprinkle such bound checks into your code, the compiler may eliminate them entirely or perform them on some different register or stack value it knows to be semantically sufficient. Unfortunately the verifier is not as smart as the compiler.
Using volatile to reload some offset just before bounds checking it and using it to index the map is a very reliable approach to getting code to verify.
I'm not entirely convinced that the ninth case is necessarily an miscompilation. My understanding of the standard is that C99 6.7.3 p6 and C11 6.7.3 p7 allows for exactly this behavior. One can argue (as the author does in the linked paper) that the last sentence is about what it means hardware-wise, but completely optimizing away statements that have no effect on the abstract machine level ('x;', 'x = x;', ...) is something that seems not only permitted but even reasonable.
But it does not make much sense to actually want to force a read for side-effect in portable code. For a MMIO register of some MCU/SoC the code is not going to be portable in any meaningful sense anyway, and for things like portable OS drivers for weird hardware (like VGA...) you have to use some kind of OS-provided macro that does the right thing wrt. caching and barriers anyway.
In a discussion like this you have to mention that Microsoft’s compiler extended volatile to mean atomic[1], although its default behavior depends on target ISA apparently. Regardless, just use c11/c++11 atomics at this point.
Yes but they consider the additional semantics a mistake:
"we strongly recommend that you specify /volatile:iso, and use explicit synchronization primitives and compiler intrinsics when you are dealing with memory that is shared across threads."
/volatile:iso is the default for ARM as the extended semantics would be extremely penalising.
Reminded me of this article : http://www.ddj.com/cpp/184403766 - volatile: The Multithreaded Programmer's Best Friend
By Andrei Alexandrescu, February 01, 2001
As a former embedded engineer on older Motorola and ARM processors I have seen reordering happening, and more than once had to check the generated assembly code, the other items more or less makes sense if you don't expect too much from your compiler, for example using volatile to get atomicity.
Using volatile on multi-threaded code is ok as long as you know what you are doing, for example kicking a watchdog at a defined physical address could be fine from different threads.
I've been bitten by reordering. In my case, the toolchain developers implemented the reordering step in the assembler as an extra optimization step (on by default of course), so I had to disassemble the binary to even find the problem. They had redefined the assembly language semantics to require "volatile" keywords wherever you needed ordering maintained. I turned that particular optimization off.
(Meta: As already pointed out by @comex, this post is from 2010 and it would be helpful with a tag in the title to make that clearer.)
That said, it's an awesome post (not surprising considering the source). I found the initial set-up explaining the concept of C's abstract machine very succinct and nice, it's something I would wish more people discussing the language to be (well) aware of.
I can add a bit of historical context here. John wrote that during the development of WINAVR, the first GCC compiler for AVR. The discussions can be found in the AVR-GCC mailing list archives.
When the 'Small' optimization is used in AVR GCC it is aggressive and leads often to broken code if 'volatile' is not properly used.
AVR GCC, using Small optimization, would remove any variable that has no side effects from the compiler's view. Setting a value in an interrupt handler would be outside of the view of the compiler's abstract machine, so 'flag' variables were often removed. Changes in hardware registers fall into the same category.
These removals result in broken code.
Volatile is never a replacement for atomics or proper locks/mutexes/semaphores.
If hardware such as an interrupt handler or hardware register is not involved, then using volatile is a creating a race condition, which may or may not ever become apparent.
Erich Styger wrote more recently about volatile here.
There is a need for volatile in portable C programming, in two circumstances:
1. When an asynchronous signal handler modifies a variable that is inspected by the interrupted mainline code (e.g. to set a flag indicating that the signal went off), that variable must be of type "volatile sig_atomic_t". (The article points to some outside discussion by Hans Boehm about this in relation to Unix, but it's not just for Unix; it's in ISO C.)
2. When a function, after saving a context with setjmp(), modifies automatic local variables, and then the context is restored with longjmp(), those variables must be declared volatile, or else they will not reliably have the correct values. (E.g. the longjmp may or may not restore the values which they had at setjmp time, or do it for some of those variables but not others.)
No matter how a C compiler treats volatile, in order to be ISO C conforming, if the program correctly uses volatile in the above situations, it must somehow work. Even if it is useless for anything else: threads, hardware, ...
Here, the printf should produce i == 1, which is not required if the volatile is removed.For instance, if i is located in a machine register, and setjmp/longjmp work by saving and restoring registers (or just some registers, including that one), the longjmp operation will restore the value that existed in the register at the time setjmp was called, which is zero.
If that's a problem in a given compiler, even if it has a garbage implementation of volatile, it has to pay attention to the fact that volatile is being used in setjmp code.