> Acquire/release is about the ordering of instructions within a thread of execu...

> Acquire/release is about the ordering of instructions within a thread of execution

acquire/release is about visibility. Acquire and release always go in pair. You can't really reason purely about a release and an acquire in isolation and that's why simply thinking about instruction reordering is not enough.

> So if I have a writer writing X, then writing Y, then I need write-release to make sure that the compiler actually puts the instructions for Y after the instructions for X in the machine code. If I want to guarantee that the results of those writes are visible to another thread, then I need a memory fence to force flushing of the caches out to main memory basically

Whether an explicit memory fence is needed or not depends on the architecture (for example you do not need them on x86). But you do not need to care, if you use the atomic operation with the correct semantic, the compiler will insert any required fence for you.

As an aside, typically fences have nothing to do with caches. One a store or a load operation hits the cache, the coherence system takes care that everything works correctly. If fences had to flush the cache, they would be orders of magnitude slower.

Instead fences (explicit or otherwise) make sure that either memory operations commit (i.e. are visible at the cache layer) in the expected order or that an application can't tell otherwise, i.e. reordering is still permitted across fences as long as conflicts can be detected and repaired, typically this can only happen for loads that can be retried without side effects.

> Whenever I see these concepts discussed, it is in the context of the C++ stdatomic library. If I were writing a program in C99, I would assume it would still be possible to communicate the same intent / restrictions to the compiler

formally in C99 multithreaded programs are UB. Of course other standards (POSIX, openmp) and implementations (the old GCC __sync_builtins) could give additional guarantees; but only C11 gave a model defined well enough to reason in depth about the overall CPU+compiler system; before that people just had to make a lot of assumptions.

> Finally, does target architecture influence the compiler's behavior in this regard at all? For example, if we take x86/x86_64 as having acquire/release semantics without any further work, does telling the compiler that my target architecture is x86/x86_64 imply that those semantics should be used throughout the program?

It does, but note that the compiler will only respect acquire/release semantics for atomic objects operations with the required ordering, not normal load and stores.