Programming Language Memory Models

raphlinus · on July 6, 2021

A GPU followup to this article.

While on CPU sequentially consistent semantics are efficient to implement, that seems to be much less true on GPU. Thus, Vulkan completely eliminates sequential consistency and provides only acquire/release semantics[1].

It is extremely difficult to reason about programs using these advanced memory semantics. For example, there is a discussion about whether a spinlock implemented in terms of acquire and release can be reordered in a way to introduce deadlock (see reddit discussion linked from [2]). I was curious enough about this I tried to model it in CDSChecker, but did not get definitive results (the deadlock checker in that tool is enabled for mutexes provided by API, but not for mutexes built out of primitives). I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.

Thus, I find myself quite unsure whether this kind of spinlock would work on Vulkan or would be prone to deadlock. It's also possible it could be fixed by putting a release barrier before the lock loop.

We have some serious experts on HN, so hopefully someone who knows the answer can enlighten us - mixed in of course with all the confidently wrong assertions that inevitably pop up in discussions about memory model semantics.

[1]: https://www.khronos.org/blog/comparing-the-vulkan-spir-v-mem...

[2]: https://rigtorp.se/spinlock/

raphlinus · on July 6, 2021

Also: it remains difficult to fully nail down the semantics of sequential consistency as well, especially when it's mixed with other memory semantics. Very likely next time Russ updates his article he should add a reference to Repairing Sequential Consistency in C/C++11[1].

[1]: https://plv.mpi-sws.org/scfix/full.pdf

rsc · on July 7, 2021

Thanks for the GPU insights and links (and the paper link below)!

I based my claim about Rust from https://doc.rust-lang.org/nomicon/atomics.html. ("Rust pretty blatantly just inherits the memory model for atomics from C++20.") Perhaps that is out of date?

spinlocker · on July 7, 2021

I believe your claim is correct: https://news.ycombinator.com/item?id=27758461.

rigtorp · on July 7, 2021

There's even more discussion on the lock memory ordering on Stackoverflow: https://stackoverflow.com/questions/61299704/how-c-standard-...

Taking a lock only needs to be an acquire operation and a compiler barrier for other lock operations. Using seq_cst or acq_rel semantics is stronger than needed. From my reading and discussions with people from WG21 the current argument for why taking a lock only requires acq semantics is that a compiler optimization that transforms a non-deadlocking program into a potentially deadlocking program is not allowed. There's an interesting twitter thread where we discuss this I can't find anymore :(.

rsc · on July 7, 2021

That is an amazing thread. The fact that C++ apparently allows optimizing

    #include <stdio.h>
    
    int stop = 1;
    
    void maybeStop() {
        if(stop)
            for(;;);
    }
    
    int main() {
        printf("hello, ");
        maybeStop();
        printf("world\n");
    }

into

    int main() {
        printf("hello, world\n");
    }

(as Clang does today) does not inspire confidence about disallowing moving the loop in the other example. If the compiler is allowed to assume that this loop terminates, why not the lock loop?

Maybe there is a reason, but none of this inspires confidence.

rigtorp · on July 7, 2021

The standard says that a thread must eventually terminate, do an atomic operation or do IO. So the while(lock.exchange(true)); loop is different.

Also keep in mind that C++11 specifies std::mutex::lock() to have acquire semantics and unlock() to have release semantics on the lock object. In order for std::mutex to actually work the reordering of m1.unlock(); m2.lock(); to m2.lock(); m1.unlock(); must be disallowed. But since m1 and m2 are separate objects m1.unlock() has no happens before relationship with m2.lock(). This seems to be a problem in the C++11 memory model. The arguments I have heard from some WG21 people is that there is no problem since transforming a wellformed terminating program into a non-terminating program is not allowed. I can't find the wording in the C++ standard that asserts this. But oh well, it works right now on gcc/llvm/msvc.

gpderetta · on July 7, 2021

I don't remember the exact wording, but the standard explicitly makes an exception for the always terminating assumptions, for loops accessing atomic variables or having side effects (i.e. volatile or I/O).

spinlocker · on July 7, 2021

> I'll also note that using AcqRel semantics is not provided by the Rust version of compare_exchange_weak (perhaps a nit on TFA's assertion that Rust adopts the C++ memory model wholesale), so if acquire to lock the spinlock is not adequate, it's likely it would need to go to SeqCst.

Is this true? AcqRel seems to be accepted by the compiler for the success ordering of compare_exchange_weak.

raphlinus · on July 7, 2021

https://doc.rust-lang.org/std/sync/atomic/struct.AtomicU32.h...

It's accepted by the compiler, but if provided, it compiles to a panic.

spinlocker · on July 7, 2021

On the page you linked panics are only mentioned for load and store and the code below seems to work just fine?

    let x = atomic::AtomicU32::new(0);
    x.compare_exchange_weak(
        0,
        1,
        atomic::Ordering::AcqRel,
        atomic::Ordering::Relaxed).unwrap();
    println!("{}", x.load(atomic::Ordering::Relaxed));

raphlinus · on July 7, 2021

Ah, you're right. I was using the same ordering for success and failure. It is possible to use AcqRel in the success case.

spinlocker · on July 7, 2021

Looks like in C++ memory_order_acq_rel is treated like memory_order_acquire when it's a load and memory_order_release when it's a store. I would argue that this isn't really a difference in memory model but a difference in API.

raphlinus · on July 8, 2021

I agree. Sorry for the misdirection - the panic is something I observed when I was experimenting with it, and I misinterpreted it.

dragontamer · on July 6, 2021

GPU-spinlocks are a bad idea, unless the spinlock is applied over the entire Thread-group.

Even then, I'm pretty sure the spinlock is a bad idea, because you probably should be using GPUs as a coprocessor and enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs. The kernel-spawn and kernel-end mechanism provides you your synchronization functionality ("happens-before") when you need it.

---------

From there on out: the GPU-low level synchronization of choice is the thread-barrier (which can extend out beyond a wavefront, but only up to a block).

--------

So that'd be my advice: use a thread-barrier at the lowest level for thread blocks (synchronization between 1024 threads and below). And use kernel-start / kernel-end graphs (aka: CUDA Stream and/or OpenCL Task Graphs) for synchronizing groups of more than 1024 threads together.

Otherwise, I've done some experiments with acquire/release and basic lock/unlock mechanisms. They seem to work as expected. You get deadlocks immediately on older hardware because of the implicit SIMD-execution (so you want only thread#0 or active-thread#0 to perform the lock for the whole wavefront / thread block). You'll still want to use thread-barriers for higher performance synchronization.

Frankly, I'm not exactly sure why you'd want to use a spinlock since thread-barriers are simply higher performance in the GPU world.

raphlinus · on July 6, 2021

In general spinlocks are a bad idea, but you do see them in contexts like decoupled look-back. As you say, thread granularity is a problem (unless you're on CUDA on Volta+ hardware, which has independent thread scheduling), so you want threadgroup or workgroup granularity.

In any case, I'm interested in pushing the boundaries of lock-free algorithms. It is of course easy to reason about kernel-{start/end} synchronization, but the granularity may be too coarse for some interesting applications.

dragontamer · on July 6, 2021

This is the first time I've heard of the term "decoupled look-back". But I see that it refers to CUB's implementation of device-wide scan.

I briefly looked at the code, and came across: https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_scan...

I'm seeing lots of calls to "CTA_SYNC()", which ends up being just a "__syncthreads" (a simple thread-barrier). See: https://github.com/NVIDIA/cub/blob/a8910accebe74ce043a13026f...

I admit that I'm looking rather quickly though, but... I'm not exactly seeing where this mysterious "spinlock" is that you're talking about. I haven't tried very hard yet but maybe you can point out what code exactly in this device_scan / decoupled look-back uses a spinlock? Cause I'm just not seeing it.

----------

And of course: a call to cub's "device scan" is innately ordered to kernel-start / kernel-end. So there's your synchronization mechanism right there and then.

raphlinus · on July 6, 2021

I don't think CUB is doing decoupled look-back, the reference you want is: https://research.nvidia.com/publication/single-pass-parallel...

It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.

dragontamer · on July 6, 2021

3rd paragraph:

> In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives

The CUB-library also states:

https://nvlabs.github.io/cub/structcub_1_1_device_scan.html

>> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs have implemented our "decoupled look-back" algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report [1]

Where [1] is a footnote pointing at the exact paper you just linked.

-----------

> It doesn't use the word "spin" but repeated polling (step 4 in the algorithm presented in section 4.1, particularly when the flag is X) is basically the same.

That certainly sounds spinlock-ish. At least that gives me what to look for in the code.

raphlinus · on July 6, 2021

Ah, good then, I didn't know that, thanks for the cite. I haven't studied the CUB code on it carefully.

wcarss · on July 6, 2021

The prior article in this series from ~a week ago is 'Hardware Memory Models', at https://research.swtch.com/hwmm, with some hn-discussion here: https://news.ycombinator.com/item?id=27684703

Another somewhat recently posted (but years-old) page with different but related content is 'Memory Models that Underlie Programming Languages': http://canonical.org/~kragen/memory-models/

a few previous hn discussions of that one:

https://news.ycombinator.com/item?id=17099608

https://news.ycombinator.com/item?id=27455509

https://news.ycombinator.com/item?id=13293290

electricshampo1 · on July 6, 2021

"Java and JavaScript have avoided introducing weak (acquire/release) synchronizing atomics, which seem tailored for x86."

This is not true for Java; see

http://gee.cs.oswego.edu/dl/html/j9mm.html

https://docs.oracle.com/en/java/javase/16/docs/api/java.base...

dragontamer · on July 6, 2021

Its not true in general. x86 CANNOT have weak acquire/release semantics. x86 is "too strong", you get total-store ordering by default.

If you want to test out weaker acquire/release semantics, you need to buy an ARM or POWER9 processor.

rsc · on July 7, 2021

ARMv7 or earlier it appears. On ARMv8 with direct hw support for SC atomics, the SC atomics are the suggested implementation of acq/rel too. See the ARMv8 section of https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.

As I mentioned in the post (https://research.swtch.com/plmm#sc), Herb Sutter claimed in 2017 that POWER was going to do something to make SC atomics cheaper. If it did, then that might end up being cheaper than the old sync-based acq/rel too, same as ARM, in which case we'd end up with SC = acq/rel on both ARM and POWER. It looks like that didn't happen, but I'd be very interested to know what did, if anything.

gpderetta · on July 7, 2021

I would say that acquire/release map very well to x86 (were they are free). Technically x86 is slightly stronger as it doesn't allow IRIW, but seq cst is too expensive to implement by default.

Conversely acq/rel are from somewhat to very expensive to implement on ARM/POWER.

dragontamer · on July 7, 2021

Acq/rel are nonsense on x86, worse than a NOP. It compiles down into nothing.

x86 cannot specify a load/store any more relaxed than total-store ordering (which is even "stronger" than acquire/release)

ARM / POWER9 were originally "consume/release". But upon C++11, the agreement was that consume/release was too complicated, and acquire/release model was created instead.

Java was the granddaddy of modern memory models but focused on Seq-Cst (the strongest model: the one that makes "sense" to most programmers). C++ inherited Java's seq-cst, but recognized that low-level programmers wanted something faster: both "fully relaxed" and acq/rel as the two faster ways to load/store.

knz42 · on July 6, 2021

A lot of the complexity comes from the lack of expressivity in languages to relate variables (or data structure fields) semantically to each other. If there was a way to tell the compiler "these variables are always accessed in tandem", the compiler could be smart about ordering and memory fences.

The idea to extend programming languages and type systems in that direction is not new: folk who've been using distributed computing for computations have to think about this already, and could teach a few things to folk who use shared memory multi-processors.

Here's an idea for ISA primitives that could help a language group variables together: bind/propagate operators on (combinations of) address ranges. https://pure.uva.nl/ws/files/1813114/109501_19.pdf

smasher164 · on July 6, 2021

Even with that expressivity, someone who incorrectly relates or forgets to relate two variables could experience the same issues. It's still important to address what happens when the program has data races or when it is data-race-free but the memory model permits overreaching optimizations. The language and implementation should strive to make a program approximately correct.

dragontamer · on July 6, 2021

That's Java's Object.lock() mechanism.

All variables inside of an object (aka: any class) are assumed to be related to each other. synchronized(foobar_object){ baz(); } ensures that all uses of foobar_object inside the synchronization{} area are sequential (and therefore correct).

--------

The issue is that some people (a minority) are interested in "beating locks" and making something even more efficient.

karmakaze · on July 6, 2021

In Java, any object can be used to synchronize any data, e.g.

  synchronized(foobar_object){ foo(); }
  synchronized(foobar_object){ bar(); }
  synchronized(foobar_object){ baz(); }

Will have foo, bar, baz methods well behaved in any data that they share regardless of whether they are foobar methods or methods of any other class(es). It is exactly analogous to the S(a) -> S(a) synchronizing instruction from the article that establishes a happens-before partitioning each thread into before/after the S(a).

The only time synchronized(explicit_object) relates to anything else is when also using the keyword where `synchronized void foo()` is equivalent (with a minor performance difference) to `synchronized(this) { ... }` wrapping the entire body of the foo method.

pjmlp · on July 7, 2021

Although in highly parallel code, the primitives from java.util.concurrent are to be preferred.

I highly advise reading "Java Concurrency in Practice".

Note that future Java primitive classes don't have monitors.

karmakaze · on July 7, 2021

Seems like a vague way of saying that locks 'don't scale' or aren't composable, which is certainly the case but straying from the topic of memory models.

mahmoudimus · on July 7, 2021

Fascinating article. I've been doing research in this area and I wonder if there was exploration for JinjaThreads - which operate on Jinja (a Java-like language) that does a formal DRF proof guarantee (coincidentally using Isabelle/HOL).

You can read more about this here if you're interested: https://www.isa-afp.org/entries/JinjaThreads.html

romesmoke · on July 6, 2021

I'm wondering: is the fact that a CS PhD finds resources like this as much amusing as educational/pedagogical gold telling something for the Academia, the Culture, or the Self?

AKA why can't I stumble upon such stuff more often. Thanks OP!

jqpabc123 · on July 6, 2021

If thread 2 copies done into a register before thread 1 executes, it may keep using that register for the entire loop, never noticing that thread 1 later modifies done.

Alternative solution: Forget all the "atomic" semantics and simply avoid "optimization" of global variables. Access to any global variable should always occur direct from memory. Sure, this will be less than optimal in some cases but such is the price of using globals. Their use should be discouraged anyway.

In other words, make "atomic" the sensible and logical default with globals. Assignment is an "atomic" operation, just don't circumvent it by using a local copy as an "optimization".

mumblemumble · on July 6, 2021

This problem isn't specific to global variables; it happens with all shared mutable state. I would assume that the author only used global variables because that lets them keep the working examples as short as possible, and minimize irrelevant details.

And yes, you can put a full memory fence around every access to a variable that is shared across threads. But doing so would just destroy the performance of your program. Compared to using a register, accessing main memory typically takes something on the order of 100 times as long. Given that we're talking about concerns that are specific to a relatively low-level approach to parallelism, I think it's safe to assume that performance is the whole point, so that would be an unacceptable tradeoff.

dragontamer · on July 6, 2021

> And yes, you can put a full memory fence around every access to a variable that is shared across threads. But doing so would just destroy the performance of your program. Given that we're talking about concerns that are specific to a relatively low-level approach to parallelism, I think it's safe to assume that performance is the whole point, so that would be an unacceptable tradeoff.

Indeed.

Just a reminder to everyone: your pthreads_mutex_lock() and pthreads_mutex_unlock() functions already contain the appropriate compiler / cache memory barriers in the correct locations.

This "Memory Model" discussion is only for people who want to build faster systems: for people searching for a "better spinlock", or for writing lock-free algorithms / lock-free data structures.

This is the stuff of cutting edge research right now: its a niche subject. Your typical programmer _SHOULD_ just stick a typical pthread_mutex_t onto an otherwise single-threaded data-structure and call it. Locks work. They're not "the best", but "the best" is constantly being researched / developed right now. I'm pretty sure that any new lockfree data-structure with decent performance is pretty much an instant PH.D thesis material.

-----------

Anyway, the reason why "single-threaded data-structure behind a mutex" works is because your data-structure still keeps all of its performance benefits (from sticking to L1 cache, or letting the compiler "manually cache" data to registers when appropriate), and then you only lose performance when associated with the lock() or unlock() calls (which will innately have memory barriers to publish the results)

That's 2 memory barriers (one barrier for lock() and one barrier for unlock()). The thing about lock-free algorithms is that they __might__ get you down to __1__ memory barrier per operation if you're a really, really good programmer. But its not exactly easy. (Or: they might still have 2 memory barriers but the lockfree aspect of "always forward progress" and/or deadlock free might be easier to prove)

Writing a low-performance but otherwise correct lock free algorithm isn't actually that hard. Writing a lock free algorithm that beats your typical mutex + data-structure however, is devilishly hard.

voidnullnil · on July 6, 2021

> This "Memory Model" discussion is only for people who want to build faster systems: for people searching for a "better spinlock", or for writing lock-free algorithms / lock-free data structures.

Actually, most practitioner code has bugs from their implicit assumptions that shared variable writes are visible or ordered the way they think they are.

dragontamer · on July 6, 2021

But the practitioner doesn't need to know the memory model (aside from "memory models are complicated").

To solve that problem, the practitioner only needs to know that "mutex.lock()" and "mutex.unlock()" orders reads/writes in a clearly defined manner. If the practitioner is wondering about the difference between load-acquire and load-relaxed, they've probably gone too deep.

voidnullnil · on July 6, 2021

> To solve that problem, the practitioner only needs to know that "mutex.lock()"

This is true, but they do not know that. If you do not give some kind of substantiation, they will shrug it off and go back to "nah this thing doesn't need a mutex", like with a polling variable (contrived example).

Kranar · on July 6, 2021

Can you explain what you mean by a "polling variable" needing a mutex? Usually polling is done using atomic instructions instead of a mutex. Are you referring to condition variables?

mumblemumble · on July 6, 2021

In a lot of code I've seen, there are threads polling some variable without using any sort of special guard. The assumption (based, I assume, on how you really could get away with this back in the days of single-core, single-CPU computers) is that you only need to worry about race conditions when writing to primitive variables, and that simply reading them is always safe.

Kranar · on July 6, 2021

Okay but the poster mentioned a mutex, which would not be a good way to go about polling a variable in Java. All you need to guarantee synchronization of primitive values in Java is the use of volatile [1]. If you need to compose atomic operations together, then you can use an atomic or a mutex, but it would not occur to me to use a mutex to perform a single atomic read or write on a variable in Java.

[1] https://docs.oracle.com/javase/specs/jls/se8/html/jls-8.html...

mumblemumble · on July 7, 2021

> All you need to guarantee synchronization of primitive values in Java is the use of volatile

I think I know what you mean, but that's a very dangerous way to word it when speaking in public. It would be more correct to say that "all you need to guarantee reads are protected by memory barriers is volatile."

The distinction matters because, to someone who doesn't already know all about volatile, the way you worded it might lead them to believe that `x++;` is an atomic statement if x is volatile, which is not true. That's a specific example of where things like atomic types are necessary.

(For the curious: https://www.baeldung.com/java-atomic-variables)

I think maybe what you're missing about what I'm saying is that I'm trying to mainly talk for the benefit of people who don't have a solid understanding of how to do safe and performant multithreading. Which is the vast majority of programmers. For that sort of audience, I tend to agree with dragontamer that "just use a mutex" is probably the safest advice to start out. Producing results faster doesn't count for much if you're producing wrong results faster.

dragontamer · on July 6, 2021

Java is somewhat cheating, because it got its memory model figured out years before other languages like C or C++.

In C++, you'd have to use OS-specific + compiler-specific routines like InterlockedIncrement64 to get guarantees about when or how it was safe to read/write variables.

Not anymore of course: C++11 provides us with atomic-load and atomic-store routines with the proper acquire / release barriers (and seq-cst default access very similar to Java's volatile semantics).

-----------

Anyway, put yourself into the mindset of a 2009-era C++ programmer for a sec. InterlockedIncrement works on Windows but not in Linux. You got atomics on GCC, but they don't work the same as Visual Studio atomics.

Answer: Mutex lock and mutex-unlock. And then condition variables for polling. Yeah, its slower than InterlockedIncrement / seq-cst atomic variables with proper memory barriers, but it works and is reasonably portable. (Well, CriticalSections on Windows because I never found a good pthreads library for Windows)

------

Its still relevant because you still see these thread issues come up in old C++ code.

Kranar · on July 7, 2021

I don't understand the relevance of your point. The point I originally asked for clarification about was the use of a mutex for a "polling variable".

Java has had volatile variables since the year 2000, I don't see how it's cheating that Java provided a standardized way of accessing a synchronized value before C and C++ did. Can you elaborate on your point that it's cheating?

In C and C++, for 10 years now, there is a standard library providing atomic data types and atomic instructions. Prior to the standardization one used platform specific atomic facilities. boost has provided cross-platform atomic operations that work on virtually every platform since 2002. Prior to 2002 there were no multicore x86 processors. There would have been mainframe computers that were multicore, is it your argument that code written for those mainframes are of relevant use today by fairly typical C and C++ developers?

At any rate, at no point did any of Java, C, or C++ require the use of a mutex in order to properly synchronize access to a "polling variable". Atomic operations were widely available to all three languages in various ways and would have been the preferred method.

gpderetta · on July 7, 2021

While agree with your general point, there were multiprocessor x86 systems well before 2002. Dual and four socket systems were relatively common and the like of SGI and HP would have been happy to sell you x86 systems with even higher socket counts.

dragontamer · on July 6, 2021

Removing optimizations on "global" variables will leave the bug in Singleton objects (which are very similar to global variables, but the compiler doesn't know that they're global)

---------

"Volatile" is close but not good enough semantically to describe what we want. That's why these new Atomic-variables are being declared with seqcst semantics (be it in Java, C++, C, or whatever you program in).

That's the thing: we need a new class of variables that wasn't known 20 years ago. Variables that follow the sequential-consistency requirements, for this use case.

---------

Note: on ARM systems, even if the compiler doesn't mess up, the L1 cache could mess you up. ARM has multiple load/store semantics available. If you have relaxed (default) semantics on a load, it may be on a "stale" value from DDR4.

That is to say: your loop may load a value into L1 cache, then your core will read the variable over and over from L1 cache (not realizing that L3 cache has been updated to a new value). Not only does your compiler need to know to "not store the value in a register", the memory-subsystem also needs to know to read the data from L3 cache over-and-over again (never using L1 cache).

Rearranging loads/stores on x86 is not allowed in this manner. But ARM is more "Relaxed" than x86. If reading the "Freshest" value is important, you need to have the appropriate memory barriers on ARM (or PowerPC).

jqpabc123 · on July 6, 2021

which are very similar to global variables, but the compiler doesn't know that they're global

Since as you say, they are very similar, wouldn't it be reasonable to assume for access purposes that they are effectively global?

dragontamer · on July 6, 2021

Lets do this example in Java (but it should be simple enough that C#, Python, Javascript and other programmers would understand it).

    public void myFunction(FooObject o){
        o.doSomething();
    }

How does the compiler know if "FooObject o" is a singleton or not? That's the thing about the "Singleton" pattern, you have an effective "global-ish" variable, but all of your code is written with normal pass-the-object style.

EDIT: If you're not aware, the way this works is that you call myFunction(getTheSingleton());, where "getTheSingleton()" fetches the Singleton object. myFunction() has no way of "knowing" its actually interacting with the singleton. This is a useful pattern, because you can create unit-tests over the "global" variable by simply mocking out the Singleton for a mock object (or maybe an object with preset state for better unit testing). Among other benefits (but also similar downsides to using a global variable: difficult to reason because you have this "shared state" being used all over the place)

mumblemumble · on July 6, 2021

Is there anything special about singletons here?

In Java, there's no real difference between a singleton and any other object. A singleton is an object that just happens to have a single instance. Practically speaking, they're typically used as a clever design pattern to "work around" Java's lack of language-level support for global variables, so there's that. But I think that that fact might not be relevant to the issue at hand?

The more basic issue is, if you have two different threads concurrently executing `myFunction`, what happens when they're both operating on the same instance of `FooObject`?

dragontamer · on July 6, 2021

> Is there anything special about singletons here?

No, aside from the fact that the root commenter clearly understands the issue with global variables, but not necessarily singletons.

I'm trying to use the singleton concept as a "teaching bridge" moment, as the Singleton is clearly "like a global variable" in terms of the data-race, but generalizes to any object in your code.

The commenter I'm replying seems to think that global-variables are the only kind of variable where this problem occurs. He's wrong. All objects and all variables have this problem.

ekiwi · on July 6, 2021

> Access to any global variable should always occur direct from memory.

What if your function takes a pointer that might be pointing to a global variable? Does that mean that all accesses through a pointer are now excempt from optimization unless the compiler can prove that the pointer will never point to a global variable?

cbsmith · on July 6, 2021

The great thing about simple memory models is they work really well until you think about it. ;-)

jqpabc123 · on July 6, 2021

What if your function takes a pointer that might be pointing to an "atomic" variable?

Pointers can be used to circumvent most safety measures. If you obscure the access, you should assume responsibility for the result.

gpderetta · on July 7, 2021

At least in C++, atomicity is part of the type system, and you would have to explicitly reinterpret cast it away.

voidnullnil · on July 6, 2021

These "memory models" are too complex for languages intended for dilettante developers. It was a disaster in Java/C#. Not even more than a handful of programmers in existence know in depth how it works, as in, can they understand any given trivial program in their language. At best they only know some vague stuff like that locking prevents any non visibility issues. It goes far deeper than that though (which is also the fault of complex language designs like Java and C#).

The common programmer does not understand that you've just transformed their program - for which they were taught merely that multiple threads needs synchronization - into a new game, which has an entire separate specification, where every shared variable obeys a set of abstruse rules revolving around the happens-before relationship. Locks, mutexes, atomic variables are all one thing. Fences are a completely different thing. At least in the way most people intuit programs to work.

Go tries to appeal to programmers as consumers (that is, when given a choice between cleaner design and pleasing the user who just wants to "get stuff done", they choose the latter), but yet also adds in traditional complexities like this. Yes, there is performance trade off to having shared memory behave intuitively, but that's much better than bugs that 99% of your CHOSEN userbase do not know how to avoid. Also remember Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption (in the C / assembly sense, not merely within that array) despite the rest of the language being memory-safe. Multiply that by the "memory model".

Edit: forgot spaces between paragraphs.

filleduchaos · on July 6, 2021

> These "memory models" are too complex for languages intended for dilettante developers.

It would be nice if sometime we stopped pretending that beginners are too slow to know/understand things and instead faced the fact that their instructors and mentors are bad at teaching.

rsc · on July 7, 2021

If "not even more than a handful of programmers" understand something, then it's objectively wrong to refer to all programmers - approximately all programmers - as "dilletantes".

Also, maybe you are different, but I can only keep so much in my head at a time. If I can keep something simple or abstract it away so I can focus on other details, that doesn't make me a dilletante. It makes me more effective at what I'm actually trying to do.

voidnullnil · on July 7, 2021

Not sure why people keep deducing this point I have not made. I said Go markets to dilletante programmers, which is a reason to make it simpler, not more complex. Any language with "the memory model" is complex. I do not fully grasp the memory model, either, as it requires full time investment.

temac · on July 6, 2021

They are not too complex for "languages intended for dilettante developers": they can (and should) use sequential consistency or locks everywhere.

voidnullnil · on July 6, 2021

How many programmers do you know put locks around polling varibles for seemingly no reason (as they are not cognizant of the memory model)?

pjmlp · on July 6, 2021

Well, while it may appear gatekeeping, maybe those dilettante developers should be using something else instead, like BASIC?

formerly_proven · on July 6, 2021

CPython has a pleasant memory model from what I've heard :)

eternalban · on July 6, 2021

Aha. The other day you were arguing for education as a silver bullet /g

pjmlp · on July 7, 2021

Were I?

So here is my argument, maybe those developers should bother to actually learn about what they are trying to do in first place.

gpderetta · on July 7, 2021

All expert developers were dilettante at some point. The only way to become an expert in some specific area is to study and practice it. It might make you a better developer even if you don't end up using it in anger often (or at all).

chakkepolja · on July 7, 2021

All dilettante developers should only use javascript and dart.

sagichmal · on July 7, 2021

> Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption (in the C / assembly sense, not merely within that array)

Source?

rsc · on July 7, 2021

https://research.swtch.com/gorace

voidnullnil · on July 7, 2021

https://blog.stalkr.net/2015/04/golang-data-races-to-break-m...

The "Exploiting the slice type" section.

sagichmal · on July 7, 2021

Er, your summary does not at all describe what is going on there. Like, all of that code violates the memory model, so whatever it accomplishes is irrelevant.

voidnullnil · on July 8, 2021

> Like, all of that code violates the memory model,

and?

> so whatever it accomplishes is irrelevant.

I have no idea what point you are making. _Of course_ there has to be a bug in the code for there to be a buffer overflow vuln. Or are you objecting that they put contrived code to make the race work better (this is the concept of a PoC)? None of the patterns in that code are unlikely in practice.

sagichmal · on July 9, 2021

> I have no idea what point you are making. _Of course_ there has to be a bug in the code for there to be a buffer overflow vuln. Or are you objecting that they put contrived code to make the race work better (this is the concept of a PoC)?

The original claim was that "Go has lots of weird edge cases, like sharing a slice across threads can lead to memory corruption." But that's not the whole picture, you have to violate the memory model, too. And that's not interesting, because if you violate the memory model, literally any consequence is fair game.

Maybe your point is (a) it's easy to violate the memory model, and/or (b) bugs that violate the memory model have surprising consequences? I don't agree with (a); the situation can always be improved, but it's easy to spot and fix data races, and Go provides plenty of tooling for that purpose. And I guess I agree with (b) in the basic sense, but that's just a truism, for the reasons stated above.

voidnullnil · on July 10, 2021

> because if you violate the memory model, literally any consequence is fair game.

Any consequence is _not_ fair game. "Memory Models" only involve stuff like tossing out sequential consistency [1]. They never say or imply something like "if you have a data race, anything can happen [including executing code on the stack]". Go slices exposing implementation details in a way that makes the language memory-unsafe is a completely different issue. If Go was sequentially consistent (so it had no "Memory Model" to violate), it would still not make the language memory-safe, because it would still write the array pointer and be pre-empted before writing the length.

> And that's not interesting

It matters because all programs have bugs (apparently), and so we'd like them to fail in a less harmful way than executing shellcode submitted by a client.

> it's easy to spot and fix data races, and Go provides plenty of tooling for that purpose.

Never used the data race detector but it probably can only identify low hanging fruit, and is not a substitute for the developer education problem.

Okay I think I see your confusion: You can actually avoid slices causing buffer overflows because the language requires you to have a happens-before relationship for all data shared across threads in the first place. That is, even if you share a boolean or across threads, you would be sure to establish a happens-before relationship if you are in the know. However, this does not rebuke my original argument, which assumes that most devs are not in the know. They do not know about slices being unsafe, nor do they know about happens-before. So they are not educated to prevent this mistake. Also, avoiding data races is hard.

1. https://en.wikipedia.org/wiki/Sequential_consistency

sagichmal · on July 10, 2021

> They never say or imply something like "if you have a data race, anything can happen [including executing code on the stack]"

They absolutely do.

https://software.intel.com/content/www/us/en/develop/blogs/b...

Violating the memory model gets you undefined behavior.

> However, this does not rebuke my original argument, which assumes that most devs are not in the know. They do not know about slices being unsafe, nor do they know about happens-before

I just don't agree. Go programmers know that nothing is safe for concurrent access unless explicitly noted otherwise. They don't have any confusion about slices requiring synchronization.

Concurrent programming isn't trivial but neither is it impossible. And data races are critical bugs that can be subtle, but are straightforward to identify, and straightforward to fix.

voidnullnil · on July 11, 2021

Okay, well your (wrong) semantic argument that "GMM violation = UB" is irrelevant, so I'll stop arguing against it other than hinting that there is not one single mention of the word "undefined" in the GMM spec. Back on topic: Go is not memory-safe. Your belief that "GMM violation = UB", where UB = memory corruption literally and by definition implies that Go is not memory-safe. Java (which is where the term "memory model" comes from) is memory-safe, C# is memory-safe. Go is not.

Dilettante programmers certainly do not know the following:

1. Go slices, strings, and interface values are unsafely non-atomic. It's documented on some obscure page (even the spec does not document it AFAIK, which is also broken).

2. What a data race is

Even if they know #1, they will still write code like: modifying a slice within a structure and setting a thread-shared pointer to point to that structure.

Again, most programmers are taught "things need locks, for reasons". At best, they will pointlessly lock things, then another programmer will come "debunk" them and remove the lock because "the thing being locked is atomic". Note how none of this involves any thought of the memory model. That's because they do not know it exists.

As for people who know #2, yes that is enough to avoid memory corruption without needing to know #1, however, they are not sufficiently informed how much data races matter (as executing shellcode is not an expected outcome of writes to your data being non observable).

sagichmal · on July 11, 2021

> GMM violation = UB

This is definitionally correct (shrug)

> Dilettante programmers certainly do not know [that] Go slices, strings, and interface values are unsafely non-atomic.

Yes. They do. As soon as a Go programmer learns that there is such a thing as concurrency and "thread safety" they learn that nothing in Go is "thread safe" by default.

> Go is not memory-safe.

"Memory-safe" is not a precisely defined concept. Go is memory safe by some definitions, not by others.

voidnullnil · on July 11, 2021

Thinking on: "[data races] are straightforward to identify"

It's a the same as the problem of knowing what data is used in what thread, which is hard and unsolvable by automation, so I doubt it's an easy problem.

bullen · on July 6, 2021

In a 100 years the main languages used will still be C on the client (with a C++ compiler) and Java on the server.

Go has no VM but it has a GC. WASM has a VM but no GC.

Eveything has been tried and Java still kicks everythings ass to the moon on the server.

Fragmentation is bad, lets stop using bad languages and focus on the products we build instead.

"While I'm on the topic of concurrency I should mention my far too brief chat with Doug Lea. He commented that multi-threaded Java these days far outperforms C, due to the memory management and a garbage collector. If I recall correctly he said "only 12 times faster than C means you haven't started optimizing"." - Martin Fowler https://martinfowler.com/bliki/OOPSLA2005.html

"Many lock-free structures offer atomic-free read paths, notably concurrent containers in garbage collected languages, such as ConcurrentHashMap in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a hard problem..." - Travis Downs https://travisdowns.github.io/blog/2020/07/06/concurrency-co...

skohan · on July 6, 2021

I'm sorry is this comment from 1998? I've been working in software for over a decade, and I haven't seen server work being done in Java in ages.

From my perspective, Go in the context of serverless programming seems to currently be the best choice for server-side programming.

In the next 20 years I expect Go will be supplanted by a language which is a lot like go (automatic memory management, simple, easy to learn & write and performant enough) but with the addition of algebraic data types, named parameters, and a slightly higher level of abstraction.

valbaca · on July 6, 2021

> In the next 20 years I expect Go will be supplanted by a language which is a lot like go (automatic memory management, simple, easy to learn & write and performant enough) but with the addition of algebraic data types, named parameters, and a slightly higher level of abstraction.

I'd love for this to be Crystal: https://crystal-lang.org/

> I haven't seen server work being done in Java in ages.

In the meantime, I've been doing a large amount of Java backend server work for the past 10 years.

skohan · on July 6, 2021

I have a feeling it's going to have C-like syntax and frankly I hope so because using an `end` keyword instead of braces makes no sense to me.

filleduchaos · on July 6, 2021

Arguably using curly braces to delineate blocks makes no inherent sense either. We just do it because that's what everybody else does.

skohan · on July 7, 2021

So if I can give my extremely pedantic rebuff: `end` is 3 characters rather than two with `{}` - that's objectively more work to type, and it makes your programs take more space on disk.

Also it's dead simple to write parsers and developer tools which can match open and close braces. Handling `end` with an arbitrary opening token (maybe it's `if <...>`, `while <...>` what have you) is objectively more work for your CPU to work with.

Subjectively, it looks dumb to have code which looks like this:

            end
          end
        end
      end
    end
  end

bullen · on July 6, 2021

Infinite growth does not exist, everything peaks at some point. You have to wonder if a memory model from 2005 still kicks go's ass in 2021 how your your prediction that "there will always be something new and shiny to distract us from the focus we need to leverage the real value of the internet" will play out?

What have you built with go that is interesting?

skohan · on July 6, 2021

It's not about new and shiny. Programming is still a relatively new field when compared to other fields. For instance, mathematics and physics took centuries to land on the right way to formalize things.

C is maybe the only good programming language invented so far. Java was a failed attempt at improving C. I think we're rapidly converging on the second good programming language, and it's not going to have null pointer exceptions.

pjmlp · on July 6, 2021

And I haven't seen server work done in C++ since 2006, we keep replacing those systems with Java and .NET ones.

To each its own.

deterministic · on July 7, 2021

I have seen new server work being done in C++ every year the last 10 years. So yes we all have different experiences.

pphysch · on July 6, 2021

Pretty sure Java is/was popular (and has enormous momentum) because "it just works" at a time when the Internet is taking off, not because it is some linguistic or technological marvel. It will definitely stick around, just like COBOL and C and Go.

bullen · on July 6, 2021

COBOL is not around in anything interesting, and trust me go is not going to be used to build anything that we'll use in 20 years.

ithkuil · on July 7, 2021

I agree that C is not going to die soon.

But don't dismiss Go so easily; it hits an interesting sweet spot that may not go away any time soon. It's a simple language with a simple spec, so simple that people are complaining it's too simple a language, yet also simple to use thanks to the GC. But also compiled and fast enough.

But most important of all, it's memory safe and not plagued by undefined behaviour.

Soon (already?) security will mean real money and life or death situation for companies; keeping that much code in a language where nobody can promise a memory corruption will not be introduced in the next commit, is eventually not going to be considered acceptable anymore.

yes, Go is sponsored by a mega corp, yes, and some people cringe at that, but realistically it's less a walled garden than c#, swift or stuff like that.

Rust is likely going to fill the niche currently occupied by C++ but it's quite hard to learn and use.

So yes, it's quite possible that we'll all flock to something new and shiny in 10 years time and forget Go before 20 years have passed. But, whatever replaces Go needs to fill its niche, which if you think about it doesn't have that much free design space left; yes you can improve a few things here and there, but then you have to fight with the massive code base and libraries, that stays relevant due to the absolutely fantastic backward compatibility promises. I've seen C code rot due to compilers getting "better" over time (yes, sure, the C code in question was obviously "wrong", but nobody noticed, because writing correct C code is an exercise in divination)

pjmlp · on July 7, 2021

Rust still has lots to catch up regarding replacing C++ in compiler toolchains, GUI frameworks, game engines middleware and console SDKs, GPGPU, Machine Learning frameworks, HFT, HPC, ....

It is now where C++ was in the early-1990's.

deterministic · on July 7, 2021

The world banking system is running on COBOL. Perhaps not “interesting” for you but without it pretty much everything would stop working.

pphysch · on July 6, 2021

Do I hear "containers are a temporary fad"?

bullen · on July 6, 2021

butterisgood · on July 7, 2021

That’s like, just your opinion man.