Hello everyone!
(MMRAs co-author: @ssahasra)
Problem
In order to implement some GPU-specific features in the AMDGPU backend, we need a way to optionally and safely relax the memory model. Relaxation needs to happen primarily at runtime, by modifying the semantics of synchronizing and memory operations to break happens-before, but it also needs to be statically represented in the IR. Otherwise, the information cannot get to the backend, and analysis passes can’t take advantage of the relaxed model.
Proposal
We’re proposing a system called “Memory Model Relaxation Annotations”, or MMRAs. The full specification and source code for the implementation is available here: [RFC] Memory Model Relaxation Annotations by Pierre-vh · Pull Request #78569 · llvm/llvm-project · GitHub
The overview below is very succinct and doesn’t cover all aspects of MMRAs - please read the full specification if you want to contribute to the discussion.
Quick overview
MMRAs are a series of tags attached to memory or synchronizing operations to change their semantics AND establish compatibility rules between them. These rules can eventually be used by optimizations to determine when reordering is safe.
Each instruction can have zero or more tags - currently represented using metadata -, and each tag is divided in a prefix and suffix. As metadata are optional in nature, the system has been designed so that it’s always safe to drop all of the !mmra
metadata on an instruction. It can only affect performance, but never correctness.
For instance, a load with the foo:bar
tag would look like this:
Here is an example:
%ld.atomic = load atomic i8, ptr %ptr acquire, align 4, !mmra !0
!0 = {!"foo", !"bar"}
foo:bar
is a fully opaque tag, and only the intended target can make sense of it.
For instance, the target can decide that foo:bar
means “skip the cache if it’s a full moon” if it wants.
Optimizations - and target-independent IR consumers - don’t need to know what foo:bar
does, they should only care about compatibility rules between operations. (The full set of compatibility rules is available in the specification).
Let’s take a simple example:
A: store ptr addrspace(1) %ptr2 # sync-as:1
B: store atomic release ptr addrspace(1) %ptr3 # sync-as:2
A and B have incompatible tags. The prefixes match, but the suffixes don’t.
This means that those two operations don’t need to be ordered relative to each other (no happens-before) and we could reorder them freely.
Example: Vulkan Memory Model
Implemented in [RFC][AMDGPU] Add vulkan:private/nonprivate MMRAs support by Pierre-vh · Pull Request #78573 · llvm/llvm-project · GitHub
We’ve also confirmed that this system works and passes the Vulkan conformance tests when LLPC emits the metadata (patch for that is not yet available).
The primary use case for this system is implementing the Vulkan memory model for our open-source driver stack. This allows us to generate much better code for both vulkan:private
and vulkan:nonprivate
operations. It also allows the IR to represent the difference between private and non-private operations so eventually optimizations can take advantage of that.
Example: OpenCL Address Space Fencing
Implemented in [RFC][AMDGPU] Add OpenCL-specific fence address space masks by Pierre-vh · Pull Request #78572 · llvm/llvm-project · GitHub
The added builtins have been tested through the OpenCL conformance tests and work.
MMRAs offer a way to add opaque annotations that carry over all the way to the MIR layer. We’re taking advantage of them in this case to pass opencl-fence-mem
tags to the backend, which allows front-ends and libraries to emit more targeted fences that only affect the image
, global
or local
address space (or a combination of those).
While this functionality could also be implemented using a series of intrinsics, MMRAs are a better fit because we can keep FenceInst
; we don’t need to teach any passes or frontend about some new magic fence intrinsics.
Open Issues
Optimizer Awareness
I’ve been busy with the AMDGPU-specific use cases and didn’t dedicate much time to making the optimizer aware of compatibility rules to enable more optimizations to occur. I’ve just ensured that the metadata is dropped as little as possible, but that’s about it.
As I’m unfamiliar with the optimizer as a whole, I could use some help. What are some passes that could benefit from MMRA compatibility rules to more aggressively reorder instructions? Do such passes exist?
Metadata-based
While the system has been designed to allow dropping metadata safely, we can’t avoid performance issue if too much metadata is lost.
In the case of Vulkan, the cost can be high, as vulkan:
annotations essentially control whether an operation is cached or not.
I would like to ask if MMRAs wouldn’t be better implemented through an instruction operand, like syncscope
is implemented? This would make them impossible to drop. The obvious tradeoff is that this is a bigger, more intrusive IR change that not everyone may agree on.
This is why I started with metadata - it’s not ideal, but it’s less intrusive so backend/passes that don’t care about MMRAs don’t need to be aware of them at all.
Next Steps
I’m starting this conversation upstream to gather more feedback on MMRAs. For instance, here’s a few questions we have for other backends and optimizations maintainers:
- Can your target benefit from MMRAs somehow? I’m curious to learn about other potential use cases for them.
- Do you know an optimization (theoretical or implemented) that’d benefit from MMRAs?
- Would you prefer to see MMRAs implemented as metadata, or should they offer stronger guarantees by being more tightly integrated with LLVM IR?
The implementation itself is 95% complete and just needs some finishing touches and more testing coverage. I would avoid reviewing it fully until I update the diff to remove the |WIP] tag.