Memory vs synchronization effects

At the attribute level, LLVM currently distinguishes memory and synchronization effects. The former are specified by the memory(...) attribute, the latter by absence of the nosync attribute. In reality, nosync is essentially a LangRef-only fiction: None of our core infrastructure (say, BasicAA) makes use of it. Instead, synchronizing instructions must also declare memory read/write effects to be handled correctly.

I’m trying to understand both what the purpose of nosync is, and what the future plans for it are, because the current state is rather confusing to me. To that effect I have a couple of questions:

  1. Does it ever make sense to have a non-nosync instruction that is not also (implicitly or explicitly) memory(readwrite)? If the synchronizing instruction can make a write from a different thread visible to the current one, then it must be modeled as a memory write effect. If it can make a write from the current thread visible to a different one, then it must be modeled as a memory read effect. Are there synchronizing instructions that don’t make memory effects visible (what are they synchronizing instead?)

  2. What additional optimization restrictions would a non-nosync function impose relative to a memory(readwrite) function?

  3. How is nosync supposed to be modelled at the API level? For example, AA currently operates in terms of Mod and Ref effects. Do we need an additional Sync effect, and if so, what semantics would it have that are disjoint from Mod and Ref? Does Sync make sense as a per location concept?

1 Like

Thanks for starting this discussion. I tried multiple times over the years to bring this up. I also was the reason we have nosync and I am really hoping for more memory “kinds”. TLDR; I want us to model synchronization properly, fine grained eventually, and not shoehorn everything into “some memory effects”.

This is not true, assuming core means run by default on O3. Attributor uses it to do many things, e.g, argue store-load propagation through (internal) global or stack memory is valid in a threaded environment. This is essential to remove GPU runtime state. We derive noalias based on nosync, we do “heap-2-stack” based on nosync, …

That said, we also have all sorts of special handling that circumvents memory effects but should not without honoring sync/nosync, e.g. ⚙ D123531 [GlobalsModRef][FIX] Ensure we honor synchronizing effects of intrinsics.

This has been the general answer with a few fun exceptions. It is also not a great answer if we don’t introduce “interesting”/"complicated’ categories into memory. For one, GPU shfl intrinsics are the existing exception. They are readnone and sync, which describes them reasonably as they “synchronize” only “registers”. The second thing that comes to mind are values/memory for which we know what threads can share them. So a sync might not affect “threadonly” memory. Third, we can very well model barriers as a concise set of their memory effects to enable low-level and high-level transformations, see Section 3 A) in https://arxiv.org/pdf/2207.00257 (not in tree yet).

We can (and do) argue about the interaction of threads (of execution) in the presence of nosync functions but not in the presence of non-nosync functions, regardless of their memory effect. We actually do that, e.g., to eliminate aligned GPU barriers, and propagate values through memory. Checkout Section IV b) and c) in Co-Designing an OpenMP GPU Runtime and Optimizations for Near-Zero Overhead Execution (Conference) | OSTI.GOV (in tree).

I think the first link gives some answer to this. We generally also want a sync effect. In a lot of places we currently are either wrong or lucky to not do the wrong thing in the absence of nosync. The latest (and still existing bug) was reported a few weeks ago. Basically, the OpenMP parallel (outlined) function is currently marked norecurse because it cannot call itself. That makes GlobalOpt think it can move a global variable onto the stack even though it is shared between threads. Here is the code that breaks under LTO, with a few extra statics it should break w/o LTO: Compiler Explorer

  1. Does it ever make sense to have a non-nosync instruction that is not also (implicitly or explicitly) memory(readwrite)? If the synchronizing instruction can make a write from a different thread visible to the current one, then it must be modeled as a memory write effect. If it can make a write from the current thread visible to a different one, then it must be modeled as a memory read effect. Are there synchronizing instructions that don’t make memory effects visible (what are they synchronizing instead?)

A prefetch instruction can touch memory but is not strictly in the read or write class.

A synchronization event transpires strictly within a single thread with a number of 3rd party observers (all the other threads and certain HW devices). {I use the word event to include LL and SC where the ATOMIC event transpires over a handful (or more) of instructions.}

  1. What additional optimization restrictions would a non-nosync function impose relative to a memory(readwrite) function?

ATOMIC stuff, and synchronization in general, falls into the category of “it needs to be correct” and speed of the instructions is “not very important” (subject to the first criteria.)

  1. How is nosync supposed to be modelled at the API level? For example, AA currently operates in terms of Mod and Ref effects. Do we need an additional Sync effect, and if so, what semantics would it have that are disjoint from Mod and Ref? Does Sync make sense as a per location concept?

There are processor ISAs that allow for several memory locations to participate in a single ATOMIC event. All of these locations appear to be modified instantaneously or not at all. No 3rd party observer can see any of the intermediate state. There may also be memory locations that are touched that are not participating in the ATOMIC event but are merely necessary to code the event properly. {A timestamp register, for example.}