[RFC] MMRAs - Memory Model Relaxation Annotations

Pierre-vh · January 18, 2024, 1:33pm

Hello everyone!

(MMRAs co-author: @ssahasra)

Problem

In order to implement some GPU-specific features in the AMDGPU backend, we need a way to optionally and safely relax the memory model. Relaxation needs to happen primarily at runtime, by modifying the semantics of synchronizing and memory operations to break happens-before, but it also needs to be statically represented in the IR. Otherwise, the information cannot get to the backend, and analysis passes can’t take advantage of the relaxed model.

Proposal

We’re proposing a system called “Memory Model Relaxation Annotations”, or MMRAs. The full specification and source code for the implementation is available here: [RFC] Memory Model Relaxation Annotations by Pierre-vh · Pull Request #78569 · llvm/llvm-project · GitHub

The overview below is very succinct and doesn’t cover all aspects of MMRAs - please read the full specification if you want to contribute to the discussion.

Quick overview

MMRAs are a series of tags attached to memory or synchronizing operations to change their semantics AND establish compatibility rules between them. These rules can eventually be used by optimizations to determine when reordering is safe.

Each instruction can have zero or more tags - currently represented using metadata -, and each tag is divided in a prefix and suffix. As metadata are optional in nature, the system has been designed so that it’s always safe to drop all of the !mmra metadata on an instruction. It can only affect performance, but never correctness.

For instance, a load with the foo:bar tag would look like this:

Here is an example:

%ld.atomic = load atomic i8, ptr %ptr acquire, align 4, !mmra !0

!0 = {!"foo", !"bar"}

foo:bar is a fully opaque tag, and only the intended target can make sense of it.
For instance, the target can decide that foo:bar means “skip the cache if it’s a full moon” if it wants.

Optimizations - and target-independent IR consumers - don’t need to know what foo:bar does, they should only care about compatibility rules between operations. (The full set of compatibility rules is available in the specification).

Let’s take a simple example:

   A: store ptr addrspace(1) %ptr2                 # sync-as:1
   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:2

A and B have incompatible tags. The prefixes match, but the suffixes don’t.
This means that those two operations don’t need to be ordered relative to each other (no happens-before) and we could reorder them freely.

Example: Vulkan Memory Model

Implemented in [RFC][AMDGPU] Add vulkan:private/nonprivate MMRAs support by Pierre-vh · Pull Request #78573 · llvm/llvm-project · GitHub
We’ve also confirmed that this system works and passes the Vulkan conformance tests when LLPC emits the metadata (patch for that is not yet available).

The primary use case for this system is implementing the Vulkan memory model for our open-source driver stack. This allows us to generate much better code for both vulkan:private and vulkan:nonprivate operations. It also allows the IR to represent the difference between private and non-private operations so eventually optimizations can take advantage of that.

Example: OpenCL Address Space Fencing

Implemented in [RFC][AMDGPU] Add OpenCL-specific fence address space masks by Pierre-vh · Pull Request #78572 · llvm/llvm-project · GitHub
The added builtins have been tested through the OpenCL conformance tests and work.

MMRAs offer a way to add opaque annotations that carry over all the way to the MIR layer. We’re taking advantage of them in this case to pass opencl-fence-mem tags to the backend, which allows front-ends and libraries to emit more targeted fences that only affect the image, global or local address space (or a combination of those).

While this functionality could also be implemented using a series of intrinsics, MMRAs are a better fit because we can keep FenceInst; we don’t need to teach any passes or frontend about some new magic fence intrinsics.

Open Issues

Optimizer Awareness

I’ve been busy with the AMDGPU-specific use cases and didn’t dedicate much time to making the optimizer aware of compatibility rules to enable more optimizations to occur. I’ve just ensured that the metadata is dropped as little as possible, but that’s about it.

As I’m unfamiliar with the optimizer as a whole, I could use some help. What are some passes that could benefit from MMRA compatibility rules to more aggressively reorder instructions? Do such passes exist?

Metadata-based

While the system has been designed to allow dropping metadata safely, we can’t avoid performance issue if too much metadata is lost.

In the case of Vulkan, the cost can be high, as vulkan: annotations essentially control whether an operation is cached or not.

I would like to ask if MMRAs wouldn’t be better implemented through an instruction operand, like syncscope is implemented? This would make them impossible to drop. The obvious tradeoff is that this is a bigger, more intrusive IR change that not everyone may agree on.

This is why I started with metadata - it’s not ideal, but it’s less intrusive so backend/passes that don’t care about MMRAs don’t need to be aware of them at all.

Next Steps

I’m starting this conversation upstream to gather more feedback on MMRAs. For instance, here’s a few questions we have for other backends and optimizations maintainers:

Can your target benefit from MMRAs somehow? I’m curious to learn about other potential use cases for them.
Do you know an optimization (theoretical or implemented) that’d benefit from MMRAs?
Would you prefer to see MMRAs implemented as metadata, or should they offer stronger guarantees by being more tightly integrated with LLVM IR?

The implementation itself is 95% complete and just needs some finishing touches and more testing coverage. I would avoid reviewing it fully until I update the diff to remove the |WIP] tag.

jyknight · January 18, 2024, 5:09pm

I haven’t fully digested this, and am not an expert on GPU memory models. But, at first glance, this new mechanism sounds like the same thing as syncscope.

The main difference I’ve understood so far is that you can apply these tags to non-atomic load/store instructions. Perhaps we could just permit syncscope there?

Neither your proposal nor the document in the commit discuss the differences between these mechanisms. I suspect I’m simply not understanding the purpose correctly, so I’d love to see a comparison between these mechanisms and an explanation of why both are required.

ssahasra · January 19, 2024, 4:03am

The difference is in the “Ordering” section of the document:

Ordering
When two instructions’ metadata are not compatible, any program order
between them are not in happens-before.

MMRAs are used to break happens-before edges in a much more general way than syncscopes. In some sense, syncscopes are only “horizontal”, because they work across threads:

If an atomic operation is marked syncscope("<target-scope>") , where <target-scope> is a target specific synchronization scope, then it is target dependent if it synchronizes with and participates in the seq_cst total orderings of other operations.

MMRAs go one step further and talk about happens-before instead. The really interesting part is that it brings in program order, and one could say MMRAs are “vertical”, acting within the thread. They allow optimizations by saying that two operations in the same thread need not be in happens-before. This is useful for IR optimizations as well as CodeGen.

One could easily implement syncscopes as MMRAs, and we had initially sketched that too, but it felt like we were reaching out farther than necessary for an initial spec.

nhaehnle · January 19, 2024, 2:35pm

To make what @ssahasra explained very explicit with an example, if you have a sequence as follows in LLVM’s memory model:

  store i32 999, ptr %data
  store atomic i32 1, ptr %signal release, align 4

… then another thread which does a load atomic acquire on %signal and sees the 1 in its past is guaranteed to also see the store of 999 to %data in its past.

One use case of this is that we have to support the Vulkan memory model which allows marking the first store in a way such that this guarantee does not hold. Programmers using the Vulkan memory model expect to be able to rely on this for performance.

Pierre-vh · January 22, 2024, 11:14am

I’ve looked around a bit and with a minor change to MemoryDependenceAnalysis, I can enable the following transformation which is not possible without MMRAs.

define i32 @test_fenced(ptr %in, ptr %out) {
; CHECK-LABEL: define i32 @test_fenced(
; CHECK-SAME: ptr nocapture readonly [[IN:%.*]], ptr nocapture writeonly [[OUT:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
; CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[IN]], align 4, !mmra !0
; CHECK-NEXT:    store i32 [[TMP1]], ptr [[OUT]], align 4
; CHECK-NEXT:    fence acq_rel, !mmra !1
; CHECK-NEXT:    ret i32 [[TMP1]]
;
  %1 = load i32, ptr %in
  store i32 %1, ptr %out
  fence acq_rel, !mmra !{!"vulkan", !"nonprivate"}
  %y = load i32, ptr %in, !mmra !{!"vulkan", !"private"}
  ret i32 %y
}

The idea is that a fence isn’t considered as a dependence to load %in because %y and the fence have incompatible MMRAs, so we can safely load it earlier (and eventually merge it with %1). I think this is safe but more experimentation is needed.

This also highlights another important point in MMRA’s design: dropping metadata cannot affect correctness as an empty set is always compatible with any set of tags.
Hence, if MMRA is dropped anywhere here, the worst that can happen is restoring compatibility and inhibiting the optimization, but the code will always be correct no matter what.

Pierre-vh · January 31, 2024, 9:29am

I’ve removed the “WIP” tag from the initial review. It still needs a bit of work on the testing side but I think it’s a good time to get some feedback, so I will also add more reviewers to have a look.

Pierre-vh · March 11, 2024, 2:10pm

Pinging this as it’s been a while without reviewer activity.

@nikic - about metadata v. instruction operand, did my last comment on the review ([RFC] Memory Model Relaxation Annotations by Pierre-vh · Pull Request #78569 · llvm/llvm-project · GitHub) address your comments? Do you still think MMRAs need to be redesigned as an instruction operand ?

I’d like to unblock the review as I’d like to see MMRAs land soon, so our Vulkan/OpenCL improvements can land as well.

Pierre-vh · April 19, 2024, 9:48am

The review has been approved. I will land this on Monday unless new issues are raised.

dshanmughan · April 29, 2024, 8:08pm

Hello @Pierre-vh , @ssahasra

I’ve been reviewing the MMRA documentation and have a few questions; some of these points might have been covered in previous comments (but I didn’t fully understand), so would appreciate some additional clarification

Regarding the use of syncscope in FneceInst, I understand from @ssahasra
comments that syncscope occurs across threads, essentially defining the scope of thread synchronization. However, given that syncscope is target-specific, is there a reason why we cannot leverage that here. For example, say specify private/non-private tags as scopes instead for the Vulkan case, for both the fence and the memory load/store instructions? Although these aren’t technically scopes, if the code only prevents reordering within the same scope, shouldn’t this approach still work? What was the motivation for not utilizing syncscope in this context?
I’ve been looking at the recent MMRA code changes and am trying to locate where the compatibility check (i.e., checking if two tags are different so they can be reordered) is implemented. I haven’t seen this being used anywhere. Is this intended to be a follow-up change (If not can you point me to the code where this is implemented)? Also, how would this work in practice? For example, if a fence is marked as side-effecting but includes these tags, will the tags take precedence?
Have we considered other approaches, such as adding an explicit operand in these instructions to indicate the ordering preference? I realize this might be heavy hammer, but it could eliminate concerns about metadata tags being dropped.
Do we expect the instruction selector or any backend pass to depend on this metadata? I believe not, but then how are they translated? For instance, if the private variants are converted to the target instruction “OP_PRIVATE_MEM” and the non-private variants to “OP_NONPRIVATE_MEM”, does this depend on an instruction modifier to also carry the private/non-private tag or other tags?

Thanks
Divya

ssahasra · April 30, 2024, 4:10am

Sure it can “work”, but like you observed, they are not scopes. We don’t really want to pollute the set of known scopes with unrelated concepts. Also, these tags are orthogonal to scopes, so we will end up with combinations of actual scopes and tags, creating a really large tag space. Making sure that we can still express the right kind of incompatibility will be a serious headache. In principle, the opposite is possible. It is entirely possible to eliminate the syncscope argument and express scopes as MMRA tags.

I don’t think the first PR includes any use of MMRA incompatibility. That will come with future work.

The usual and reasonable way to treat metadata is that it does take precedence when it is present, although it’s equally okay to not give it precedence. We expect the same treatment for MMRA, even if it eventually graduates to being an operand on the instruction.

Yeah MMRA can be an operand, but then it really is a heavy hammer. Right now, this is an experiment relevant only to AMDGPU and we want to keep the impact within bounds. Operands can be considered when more targets are interested.

Which target do these instructions refer to? The private/nonprivate MMRA is intended to be equivalent to NonPrivatePointer operand available on various memory ops. On AMDGPU, they simply get translated to the cache control bits like glc and slc. For fences on AMDGPU, the MMRA will affect the generation of waitcnt instructions including which counts to wait for, or even whether to wait at all.

Sameer.

dshanmughan · April 30, 2024, 5:46pm

Thanks @ssahasra

I don’t think the first PR includes any use of MMRA incompatibility. That will come with future work.

Okay, so currently if I just annotate the instructions with these tags, then there is no relaxed ordering capability present right?
Additionally, if we go ahead with this implementation, would the compatibility checks be integrated across various passes like hoisting and others that may reorder code, or are we considering implementing this as part of a MemoryDep pass?
Will this affect alias analysis pass in any way, given that the tags can also represent different address spaces?

Which target do these instructions refer to?

No, I was just wondering in a generic case. Essentially, my question is whether the isel will rely on this metadata to determine the specific target instruction it should convert to.? I suppose that will never be the case

ssahasra · May 1, 2024, 5:26am

That’s right. Currently there is no pass in LLVM that actually uses the MMRAs to reorder instructions in the program. But as soon as the AMDGPU backend starts using MMRA to generate suitable ISA, we expect to observe reordering happening during execution on the target GPU.

MMRAs are strictly about specifying which program order need not exist in the happens-before order. It is not intended to replace the address space operand on the memory instructions, although such a use is conceivable. I am not sure memory dependencies are the right place to use MMRA at all.

Target-specific instruction lowering will definitely use MMRA and “depend” on it for emitting high-performance instructions. If some of these uses eventually become general enough to work with target-independent instruction selection, that would be neat! But that’s not the immediate goal.

Sameer.

jurahul · May 1, 2024, 8:16pm

@ssahasra Can you clarify what you mean by “observe reordering happening during execution on the target GPU”. Will the generated code still be emitted in the original order but dynamically certain instructions will be able to be reordered w.r.t others (maybe this an AMDGPU specific detail, but would be good to understand at a high level atleast).

Also, with MMRAs, would it be legal to reorder instructions at the LLVM IR level by either changing passes to look at MMRA annotations or changing the relevant analysis passes to prune the dependencies that the presence of MMRA annotation express as not in happens before?

ssahasra · May 2, 2024, 4:21am

I mean that in the memory model sense. The word “happens” carries a lot of semantics with it; the phrase “observed to happen” is more useful. To specifically address my earlier comment, yes it is likely that we will emit instructions in their original order, and it is also likely that they will be started in that same order. But they may appear to finish in a different order, maybe because they really got reordered in flight, or the operations hit different levels of cache, etc. Specifics are important only if you are trying to identify and avoid them. But if you use MMRA to relax happens-before, then expect to see operations happening in any order that is not explicitly prohibited.

Both approaches should be fine … that’s just an implementation detail inside the compiler, right? But I am still wary of that use of the word “dependencies”. MMRAs are strictly about the memory model. They allow certain edges to be removed by happens-before. Dependencies may have some overlap with happens-before, but they are clearly not the same thing.

Sameer.

jurahul · May 2, 2024, 4:06pm

Thanks Sameer for the clarification.

JonChesterfield · May 5, 2024, 10:12pm

What are the semantics of this? I’ve looked at [RFC] Memory Model Relaxation Annotations by Pierre-vh · Pull Request #78569 · llvm/llvm-project · GitHub but can’t dig them out of the github thread mode successfully. Inventing concurrency primitives is prone to later turning out to be unsound so treading carefully would be prudent.

For the above store, atomic-store-release pair, I would guess that vulkan wants a way to group memory operations into sets such that a fence acts on a given set of operations and not on any others. That seems likely to be amenable to optimisations. The direct implementation would be a metadata tag on fence, load, store, rmw etc, where the set of operations subject to ordering are those with the same tag.

I can’t guess what the prefix/suffix distinction would be for. I’m not confident the above guess is consistent with this thread either.

Could you link a pdf version of the specification / intended behaviour? Or is this authoritative, found it after posting this reply: llvm-project/llvm/docs/MemoryModelRelaxationAnnotations.rst at 6a982be73301041bef199ac7a9fc6f4f8a406432 · Pierre-vh/llvm-project · GitHub

Topic		Replies	Views
Memref dialect: volatile and atomic accesses MLIR	1	228	October 2, 2023
[RFC] Vector Masking Representation in MLIR MLIR	12	1245	September 26, 2022
MLIR News, 62nd edition (14th Feb 2024) Newsletter llvm-weekly	0	467	February 13, 2024
Requests for review (Tosa partinioning, GPU launch dimension annotations) MLIR	0	340	December 15, 2022
Open MLIR Meeting 8/26/2021: High performance code generation for GPU tensor cores Announcements	7	1223	January 27, 2022

[RFC] MMRAs - Memory Model Relaxation Annotations

Problem

Proposal

Quick overview

Example: Vulkan Memory Model

Example: OpenCL Address Space Fencing

Open Issues

Optimizer Awareness

Metadata-based

Next Steps

Related Topics