[RFC] Add memory scope to GPU barrier

Motivation

Currently gpu.barrier is very simple and covers all use cases, but I would like to option to specify the scope of memory synchronization that is required. Specifically I would like to specify if memory accesses made by one invocation can be observed by all invocations in the same work-group/block or only by the invocation in the same sub-group/warp. This could reduce the cost of synchronization.

Implementation

An enum will be added for the scope:

def GPU_Scope : I32EnumAttr<"Scope",
    "Specifies the grouping of invocations that can observe memory accesses from other invocation in the same group.",
    [
      I32EnumAttrCase<"workgroup", 0>,
      I32EnumAttrCase<"subgroup", 1>,
    ]>{
  let genSpecializedAttr = 0;
  let cppNamespace = "::mlir::gpu";
}
def GPU_ScopeAttr : EnumAttr<GPU_Dialect, GPU_Scope, "scope">;

and this enum will be added as an argument to gpu.barrier with the default value of workgroup.

I misunderstood the use-case I was designing for and this is not what I want currently. Actually what I need is a memory scope for the level of synchronization done with a gpu.barrier. I’ve create a proof of concept here: Gpu barrier memfence by FMarno · Pull Request #3 · FMarno/llvm-project · GitHub

Right now there are a couple of issues:

  1. GPU_StorageClass seems to overlap a lot with address space
  2. local memory is a bit overloaded in terminology and has different meanings in CUDA and OpenCL terminology

Yes, storage class and address space aren’t really orthogonal, but that’s how a lot (most?) GPU APIs tend to lean.

The term “local” seems fine to me. Yes, this thing is called many different things depending on who you talk to: local / LDS / shared / groupshared. That’s just the way things are, as long as the dialect stays internally consistent and you put some comments / docs there to clarify how it relates to terminology in other contexts it should be fine.

Do you perhaps want gpu.address_space? We use that enum for distinguishing global/private/shared and it could be extended if needed

PR added here [mlir][gpu] Add address space modifier to Barrier by FMarno · Pull Request #110527 · llvm/llvm-project · GitHub

My use case

The use case I would like would be SPIR-Vs OpControlBarrier with Workgroup for the Execution operand, Workgroup for the Memory Operand and WorkgroupMemory or WorkgroupMemory | CrossWorkgroupMemory for the Semantics operand. Hopefully this also fits what others want.

OpControlBarrier waits for all active invocations (threads) to within the Execution scope to reach the operation, so it’s acting as a thread barrier, but it also acts like there is an additional OpMemoryBarrier when the Semantics operand is not None. SPIR-Vs OpMemoryBarrier

Ensures that memory accesses issued before this instruction are observed before memory accesses issued after this instruction

OpMemoryBarrier has two arguments, Memory and Semantics. Memory will define the scope of invocation that will observer the memory changes, so far changing this has not been discussed and is the value is constrained to Workgroup (and I would like to continue with that constraint). The Semantics operand is a flag with many options, but importantly SubgroupMemory and WorkgroupMemory. Semantics controls the address space that changes are observed in, which is what I would like to control.

The two options I’m interested in could be written as:

  • gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>] or just gpu.barrier
  • gpu.barrier memfence [#gpu.address_space<workgroup>]

I think it’s important to note that the semantics of the default gpu.barrier is the same as the existing semantics, but with the option to weaken the memory fence.

The design of the PR I put up also allowed for these operations:

  • gpu.barrier memfence []
  • gpu.barrier memfence [#gpu.address_space<private>]

They both describe a thread barrier only operation with no memory fence, since a memory fence within the private address space should have no effect. These inclusions are incidental.

Conflation of thread barriers and memfence

gpu.barrier already conflates the idea of thread barriers and memory fencing, but I don’t think that is necessarily bad since barriers are often used for communication between threads.
I agree that gpu.barrier matches the semantics of __syncthreads from CUDA, but not amdgpu.lds_barrier since that does not control how global memory is observed.

Add a memfence operation

I generally support the inclusion of a memfence operation, but that alone is not enough for my situation. I would both a workgroup wide thread barrier and to enforce that memory accesses to workgroup memory (shared memory for CUDA) by an invocation/thread/work-item in a workgroup, before the operation, can be observed by memory accesses by invocations/threads/work-items in the same workgroup after the operation, importantly, with no guarantee about global memory.

Lowering for backends

I would like guarantees about the observability of specified address spaces, but I’m not concerned about what happens in other address spaces, so the existing lowering of gpu.barrier will suffice in all cases. Different lowering with may be preferable, like
amdgpu.lds_barrier, when applicable. I was unable to find a CUDA operation equivalent to lds_barrier.

1 Like

Thank you for providing more context!
Quick question, if your use case is SPIR-V specific, why not use
spirv.ControlBarrier?

My main issue is that gpu.barrier is not a low-level thread barrier only instruction, it models a very specific parallel programming model concept, ie. __syncthreads both from CUDA and HIP*. Therefore, its memfence semantics are very explicit and people expect those.

What about adding a gpu.thread_barrier and gpu.memfence operations?

1 Like

I’ll note that if you look at the lowering of gpu.barrier for AMD, it also only adds a workgroup-scope fence (though the reason we have amdgpu.lds_barrier is to work around issues with that lowering).

Now, as to your usecase, I’d be open to having gpu.barrier memspace [...], with the caveat that the default isn’t “no fences”, it’s either “workgroup” or “platform defaults”.

That is, assuming that we have a gpu.memfence $scope [memories] and a gpu.thread_barrier (which only synchronizes control flow and makes no guarantees about memory), we’d have the lowerings

gpu.barrier memspace[M1, M2]

=>

gpu.memfence release [M1, M2]
gpu.thread_barrier
gpu.memfence acquire [M1, M2]

and

gpu.barries memspace []

=>

gpu.thread_barrier

but

gpu.barrier

=>

// Whatever the platform implements as its default barrier fencing, which is typically
gpu.memfence release [workgroup]
gpu.thread_barrier
gpu.memfence acquire [workgroup]

Now, I do think that this is something you’d want in a single operation - possibly called something other than gpu.barrier if we want to avoid confusion, since some platforms might lower “barrier with fences” as a single construct, and you don’t necessarily want the fence and barrier being pulled away from each other by program transformations.

Do you know why only a workgroup fence is added here? The docs for the gpu.barrier say

all memory accesses

Is there another construct you are using for a memfence on cross-workgroup memory or is that just not needed?

I don’t really like the idea of a platform defined barrier fencing since it makes writing platform agnostic code harder in my opinion. I would think that the default should be all address spaces.

1 Like

I don’t think there is any reason this needs to be the case but what I’m suggesting would still have that behaviour as the default.

That might actually fit our use case. I’ll give it a shot and let you know!

I assume when you say “workgroup fence” you mean a fence instruction with a workgroup syncscope.

This means that the memory semantics of the fence make memory available/visible to other threads in the same workgroup, but not necessarily to threads in other workgroups (or other dispatches). Which is what you want for __syncthreads.

The syncscope does not limit the set of memory that is affected. Limiting the set of affected memory can be done using the amdgpu-as memory model relaxation annotation (MMRA).

Sorry, no I mean that a memfence only affecting the workgroup memory (Local Data Store in AMD terms I think?)

I thought that @krzysz00 was saying that they would like the default lowering for gpu.barrier to not necessarily make changes to global memory observable? Could you confirm that

Perhaps @krzysz00 could clarify what he meant when he wrote:

I’ll note that if you look at the lowering of gpu.barrier for AMD, it also only adds a workgroup-scope fence

Does “workgroup-scope fence” mean “LLVM IR fence instruction with workgroup syncscope”? Or does it mean something else.

2 Likes

Sorry about the delay. I did mean a LLVM fence with “workgroup” scope, as seen by the HIP sources (the flag will be true) clr/hipamd/include/hip/amd_detail/amd_device_functions.h at 939c7887793f8280a3196cebc81ba1d07743f068 · ROCm/clr · GitHub

This does mean that gpu.barrier (which is the MLIR equivalent to __syncthreads() , from where I’m standing - especially since, if you look at the ROCDL dialect to LLVM translation, it includes a literal LLVM fence) is a workgroup-scope memory fence and a synchronization point, but doesn’t make any guarantee about the state of global memory (which is something I actually rely on for performance)

I think we need to be really careful about terminology here.

In the terminology that I’m familiar with, the sequence:

fence release syncscope("workgroup")
call void @llvm.amdgcn.s.barrier()
fence acquire syncscope("workgroup")

does make guarantees about the state of “global” (addrspace(1)) memory.

The guarantee is that all memory accesses to global (addrspace(1)) memory in any of the threads of the workgroup before the barrier happen before all such memory accesses in any of the threads of the workgroup after the barrier.

Perhaps we could say that it’s a “local” guarantee about “global” memory, where the “local” refers to the execution scope (set of affected threads) and the “global” refers to the kind of memory, if that makes sense :slight_smile:

(It does not make any guarantees about the state of VRAM, which is probably what you meant.)

I’ve been looking into that, and it seems to be working for me but requires this change [mlir][spirv] Add spirv-to-llvm conversion for OpControlBarrier by FMarno · Pull Request #111864 · llvm/llvm-project · GitHub

(having gotten back from vacation) Yep, I agree with you that that code does, in reality, make global memory accesses from before the barrier happen before the ones that come after, but doesn’t make any guarantees about VRAM state, which was what I was thinking of.

That being said, from the perspective of the abstract machine, I think you can take the following code

%v1 = load i32, ptr addrspace(1) %p
fence release syncscope("workgroup")
call void @llvm.amdgcn.s.barrier()
fence acquire syncscope("workgroup")
%v2 = load i32, ptr addrspace(1) %q

and execute it as if it were

%v1 = load i32, ptr addrspace(1) %p
%v2 = load i32, ptr addrspace(1) %q
fence release syncscope("workgroup")
call void @llvm.amdgcn.s.barrier()
fence acquire syncscope("workgroup")

or as if both global loads were after the barrier.

If it weren’t for the async loads/vmcount being what it is, you could even do

%v2 = load i32, ptr addrspace(1) %q
%v1 = load i32, ptr addrspace(1) %p
fence release syncscope("workgroup")
call void @llvm.amdgcn.s.barrier()
fence acquire syncscope("workgroup")

which might even be allowed with a sufficiently clever out of order execution widget.

That is, with my CPU-side “the processor can execute the code however it wants, so long as the programmer doesn’t notice” hat on, neither a s_barrier or a workgroup-scope memory fence (in terms of our ISA, s_waitcnt lgkmcnt(0)) implies you can’t reorder global memory accesses across the barrier.

If that’s all the code that’s being executed within this workgroup, then yes, the outcome is the same. However, the compiler would first have to prove that. As soon as there’s a global (or flat/generic) store in there, you can no longer reorder. For example, if the code sequence was:

%v1 = load i32, ptr addrspace(1) %p
store i32 %w, ptr addrspace(1) %r
fence release syncscope("workgroup")
call void @llvm.amdgcn.s.barrier()
fence acquire syncscope("workgroup")
%v2 = load i32, ptr addrspace(1) %q

… then you couldn’t move the loads in the way you indicated (unless perhaps if you had some extremely strong cross-thread alias analysis, which we don’t have).

None of that is really specific to the scopes or address spaces in question, though. You could also move LDS/shared loads across a barrier if you could prove that there are no potentially conflicting stores in flight.

Having gone and re-read the manual, … you’re right, store; barrier; load to the same address imposes a happens-before relationship between the store and the load from the perspective of the workgroup. Or, equivalently, after a barrier, all workitems will see any stores that took place before the barrier on that some workgroup (but other workgroup can be doing … who knows what, and getting in the way)