Why/How are you assuming the reader will read from the cache? The very definition of volatile means that this read will not be read from the cache! I'd check the generated ASM before assuming that's how it'd work. And read this: http://lwn.net/Articles/233479/
Because it does. Are you sure you aren't misunderstanding the meaning of volatile in C99 (and the C++03 standard has the same semantics for volatile as C99, it even refers back to C99 in a footnote)?
The C standard has no notion of the memory hierarchy and therefore does not know or care about the processor cache. Volatile means that reading/writing from/to volatile variables must strictly follow the rules of the abstract machine, and not bypass these rules as an optimization. When people say that volatile means that the value may not be cached, they mean that the variable MUST be written to or read from every time it is accessed. This means it may not be cached in a register or otherwise avoid the actual variable access as an optimization, that is, that it may not bypass interacting with the abstract machine. What this means is that a memory read or write must be issued, but the existance of processor caches (L1, L2, L3) is outside the scope of the abstract machine and is a platform detail.
On x86 and x86-64, reading or writing memory using temporal load and store instructions lets the CPU, if it feels like it, cache the value in the processor cache. As far as C knows or cares, its in memory, but its up to the CPU to decide if it actually is or not. This means that in practice, if the variable is accessed often, it would be in L1 or L2 cache and reading it would be quite fast. When writing to it (since it is volatile, a write is a store to memory instead of a mov to a register), the processor sees that the cache has changed and invalidates it for other cores, so the next read would hit main memory and get the new value.
Note that non of this is visible in the generated ASM (unless the compiler generated non-temporal loads/stores, in which case the rocessor cache would be bypassed), as it is applied transparently by the processor.
For the record, before I wrote this code, I researched it a lot. Also, Arch Robinson, the architect behind Intel Threading Building Blocks, in the comments to his article on volatile, confirmed that this works (at least on intel platforms). Furthermore, nothing I've read in the standard or other articles (such as the one you linked) contradict my assumptions. Note that anywhere I have tested this, it works as expected. I am interested in hearing if I overlooked soemthing fundamental, though, especially when porting to ARM (which, for example, may require additional instructions to make writes visible to other cores, something volatile will NOT do).
The C standard only states:
An object that has volatile-qualified type may be
modified in ways unknown to the implementation or have
other unknown side effects. Therefore any expression
referring to such an object shall be evaluated strictly
according to the rules of the abstract machine, as
described in 5.1.2.3. Furthermore, at every sequence
point the value last stored in the object shall agree
with that prescribed by the abstract machine, except as
modified by the unknown factors mentioned previously.)
What constitutes an access to an object that has
volatile-qualified type is implementation-defined.
The only thing I can see that you'd have to worry about is the reads and writes to "flag". Technically, it's only assumed to be an atomic operation and while that's generally not a bad assumption for 8-bit variables on x86 SMP, it'll probably give you some trouble on Itanic and ARM. You may need to use an explicit memory barrier.
x86 or x86_64 SMP architectures will generally have an implicit memory barrier for all volatile reads/writes on simple data types, but I'm fairly sure this doesn't hold for ARM. In particular, Visual Studio 2005 and up will treat all volatile reads as membars with acquire semantics and all volatile writes as membars with release semantics.
If you have access to pthreads, a conditional variable would do the trick - but probably overkill. It all depends if you want to let pthreads worry about the portability and all the cpu-dependent ifdefs and ifndefs or if you're willing to code the membars in yourself.
If/when I port to ARM I will probably conditionally compile to use atomic operations (or at least an explicit memory barrier) when on ARM, but for my x86/x86-64 code, since I don't need to, I'd rather avoid it. As you said, it would be overkill.
Since this is the only case where I do something strange, I don't mind handling platform specific code myself. The rest of the codebase delegates such things to libraries.