Minor quibble: you *can* linearize small amounts of memory using atomic access. ...

mevric · 2024-05-11T22:32:12 1715466732

I am curious to understand how the following is achieved? Is there a material on this?

"storing two uint32 as a uint64"

JonChesterfield · 2024-05-11T22:36:37 1715466997

Put them next to each other, 8 byte align the first one, use a compiler mechanism to disable alias analysis, do the uint64 store. Attribute((may_alias)) is the local override, fno-strict-aliasing the global one.

I think C++ can now do "these bytes are now that type", called something like start_lifetime_as. C probably can't, though using a union might be legitimate. The language rules in this area are a mess.

MaxBarraclough · 2024-05-12T12:00:37 1715515237

There's no need to flirt with undefined behaviour and non-standard compiler flags. Just convert both uint32_t values to uint64_t type, then combine them into a single uint64_t value using bitwise shift then bitwise inclusive OR.

Rob Pike has blogged about this kind of thing. [0]

Perhaps also of interest: both C and C++ provide a (portable and standard) means of determining whether atomic operations on uint64_t are assured to be lock-free. [1][2] (Assuming of course that the uint64_t type exists - it's in the standard but it's optional.)

[0] https://commandcenter.blogspot.com/2012/04/byte-order-fallac... ( discussion: https://news.ycombinator.com/item?id=3796378 )

[1] https://en.cppreference.com/w/c/atomic/atomic_is_lock_free

[2] https://en.cppreference.com/w/cpp/atomic/atomic_is_lock_free

JonChesterfield · 2024-05-12T17:37:34 1715535454

If you do the loads as uint32, you lose the single atomic operations on two different values which was the whole point of this exercise.

Using a single uint64 as the memory type works, but you no longer have two different names fields and have to pack/unpack them by hand.

There's no ub if you use the compiler extension, just totally clear code that does the right thing

MaxBarraclough · 2024-05-12T18:47:42 1715539662

> If you do the loads as uint32, you lose the single atomic operations on two different values which was the whole point of this exercise.

There's no need for any flirting with undefined behaviour through type-punning.

When doing the atomic write, you prepare the uint64_t value to write by using bitwise operations, and then perform the atomic write of the resultant uint64_t value.

When doing the atomic read, you atomically read the uint64_t value, then use bitwise operations to unpack the original pair of uint32_t values.

Put differently, writing is done by pack-then-atomically-write, and reading is done by atomically-read-then-unpack.

Turns out we're both overthinking it though, there's a more direct way: use a struct containing an array of 2 uint32_t elements, or declare a struct with 2 uint32_t members. Both C and C++ support atomic reads and writes of user-defined types. For a C++ example showing this see [0]. This will be atomic and, presumably, should be lock-free where possible (hard to imagine the compiler would introduce padding in the struct type that would sabotage this).

> Using a single uint64 as the memory type works, but you no longer have two different names fields and have to pack/unpack them by hand.

Yes, the stored variable would hold 2 different meaningful values, which is a little ugly.

> There's no ub if you use the compiler extension, just totally clear code that does the right thing

Anyone with a deep knowledge of the language will quickly recognise it as incorrect per the language standard. I wouldn't call that totally clear code that does the right thing.

Your proposed solution is only assured to behave as expected if the correct compiler-specific flags are used, otherwise it will introduce undefined behaviour. There's no guarantee that a compiler will even offer such a flag. It's also likely to trigger compiler warnings.

[0] https://en.cppreference.com/w/cpp/atomic/atomic

gpderetta · 2024-05-12T00:12:21 1715472741

Note that writing 64 bits and reading 32 (or viceversa) is not a way to get around fences on x86. It is explicitly documented as begin undefined. In most cases it will fail to store-forward that will stall and act as an implicit fence, but in some cases the CPU can do partial store forwarding, breaking it.

AFAIK this trick does work on SPARC though.

moonchild · 2024-05-12T00:29:48 1715473788

It's not documented as being undefined; it's simply not documented at all.

Intel's latest uarch does partial store forwarding.

There was a paper from a few years ago trying to define semantics for mixed-size accesses (not for x86 though) https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf

I don't think the parent was talking about this, though; they were just talking about using a single large physical location, which logically contains multiple smaller values. Accesses to a single location happen order, so there is indeed no need for fencing between accesses to it. Usually you get a full 128 bits (at least amd64/aarch64/ppc64; not riscv yet but I expect they will get there).

That said—mixed-size can be useful despite the lack of semantics (I think linux uses them in a few places?). sooo

gpderetta · 2024-05-12T10:04:39 1715508279

Ah, right, it was about guaranteed total order on all stores in a single memory location.

Re colocation and x86, IIRC the intel memory model has wordings regarding read and writes to a a memory location having to be of the same size to take advantage of the memory model guarantees.

moonchild · 2024-05-12T18:11:17 1715537477

total order on all accesses to a given location—loads from a single location can't be reordered w.r.t. each other either

i don't remember seeing any wording relating to mixed-size accesses in the intel manual (not withstanding that the official models are ... ambiguous, to say the least, compared with what 3rd-party researchers have done)

gpderetta · 2024-05-12T22:28:09 1715552889

> i don't remember seeing any wording relating to mixed-size accesses in the intel manual (not withstanding that the official models are ... ambiguous, to say the least, compared with what 3rd-party researchers have done)

I was probably misremembering the details. The manual has to say this regarding #LOCK prefixed operations:

"Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access, other processors should not access the semaphore using a byte access"

which is already vague enough, but regarding general atomic load and stores I couldn't find anything.

gpderetta · 2024-05-12T18:17:26 1715537846

For a total store order to be meaningful of course it implies that loads are also non visibly reordered. If a store falls in the forest but nobody is around to load it, was it really ordered :)

moonchild · 2024-05-13T02:00:22 1715565622

Tbf you could say stores happen in order, and loads can happen out of order unless you fence. Personally I don't understand why we need such strong ordering constraints for weakly ordered reads—istm you can go much weaker and maintain sanity.

mevric · 2024-05-11T23:19:13 1715469553

Thank you!!