Representing buffer descriptors in the AMDGPU target - call for suggestions

Yes, that’s what I was thinking of. Though now that you mention it… it’s probably possible in SPIR-V to have a local array of pointers with dynamic indexing that SROA then doesn’t optimize away. You can’t write such code in GLSL or HLSL which would explain why we haven’t seen it, but yeah, it sounds like to fully support SPIR-V we have to support loads and stores of addrspace(7) pointers as values after all…

I’ve started working on one of the two components of this change (migrating buffer intrinsics from <4 x i32> to ptr addrspace(8)). I still haven’t done the AutoUpgrade implementation, but I’ve gotten all the tests updated.

The branch is at Comparing llvm:main...krzysz00:buffer-improvements · llvm/llvm-project · GitHub .

One change (a final test failure, which I haven’t committed) that I’m very confused by is the following required test update

diff --git a/llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll b/llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll
index b2ca2c0a273f..f558f467434c 100644
--- a/llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll
+++ b/llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll
@@ -38,7 +38,7 @@ define amdgpu_ps float @_amdgpu_ps_main() #0 {
 ; GCN-NEXT:    v_sub_f32_e32 v8, s0, v1
 ; GCN-NEXT:    v_fma_f32 v7, -s2, v6, s6
 ; GCN-NEXT:    v_fma_f32 v5, v6, v5, 1.0
-; GCN-NEXT:    v_mad_f32 v10, s2, v6, v2
+; GCN-NEXT:    v_fma_f32 v10, s2, v6, v2
 ; GCN-NEXT:    s_mov_b32 s0, 0x3c23d70a
 ; GCN-NEXT:    v_fmac_f32_e32 v1, v6, v8
 ; GCN-NEXT:    v_fmac_f32_e32 v10, v7, v6

Do any folks here have some idea of what could be going wrong here? Or how I could debug this further?

I also haven’t committed to whether addrspace(8) pointers are integral or not - the branch currently says yes, but that’s mostly because we’ve declared that you can’t gep/load/store/… them. (I’m wondering if there’s a way to make that a hard error when the target triple is known…), and, without those operations, the only integer-related optimizations that might come up are stuff like folding inttoptr/ptrtoint pairs, which are correct for buffer resources.

How do you implement this auto-upgrade? Do you insert something that trivially converts from <4 x i32> to ptr addrspace(8)?

I’m concerned that changing the existing intrinsics like this will make life painful for LLPC and perhaps other clients like Mesa, unless there is a really simple drop-in replacement that preserves the current behaviour.

The auto-upgrade is

call @llvm.amdgcn.*buffer*(..., <4 x i32> %rsrc, ...)

goes to

%rsrc.int = bitcast <4 x i32> %rsrc to i128
%rsrc.ptr = inttoptr i128 %rsrc.int to ptr addrspace(8)
call @llvm.amdgcn.*buffer*(..., ptr addrspace(8) %rsrc.ptr, ...)

This feels like a reasonable auto-upgrade.

It’s not really related to your patch. It’s a side effect of running DAG combines again on a DAG which has not changed since the last time – i.e. the DAG combine phase is not idempotent. I don’t know if that counts as a bug or not.

You can see exactly the same effect if you just make this change on top of main:

diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
index 8d1cc04607bf..df7dd758d041 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
@@ -782,6 +782,7 @@ void SelectionDAGISel::CodeGenAndEmitDAG() {
     NamedRegionTimer T("combine1", "DAG Combining 1", GroupName,
                        GroupDescription, TimePassesIsEnabled);
     CurDAG->Combine(BeforeLegalizeTypes, AA, OptLevel);
+    CurDAG->Combine(BeforeLegalizeTypes, AA, OptLevel);
   }
 
   LLVM_DEBUG(dbgs() << "Optimized lowered selection DAG: "

@jayfoad Thanks for the info on what’s going on with that one test.

I’ve added the auto-upgrade code as planned and pushed it up to the branch.

As far as I’m concerned, this is most of the codegen-side phase of the project, and the next steps would be implementing address space 7 in terms of these new address space pointers.

The main things to do further down the stack would probably be to create a PseudoSourceValue that holds %rsrc, %index, %offset, ... bundles for machine-level alias analysis purposes and then to actually do more complete alias analysis on buffer operations.

I’m also (having also seen @nhaehnle’s comments over on [RFC] Replacing getelementptr with ptradd - #15 by jrtc27 as well) thinking that what may be useful - instead of marking these addrspace(8) pointers non-integral, we could mark them … “non-linear” or “non-flat”, but that’s a separate discussion.

Also, since I’m here, why was s.buffer.load a ReadNone intrinsic before?

Making assumptions about graphics workloads and fear of losing optimizations. Could always have done callsite attributes instead

callsite attributes?

(In this branch, I make s_buffer_load use a memory read)

If you wanted to hack and make it always readnone, you can apply it to the specific instance’s callsite without applying it to the definition / declaration

Makes sense, I just don’t know who’s relying on that behavior - doesn’t seem to be any of our tests, for one thing

All of the old weird things may be relied on by Mesa

Would you know how to test that?

Also, should I go ahead and reorganize the branch into a few commits and put them up on Phabricator?

Vaguely, run piglit? Don’t know if that actually covers performance

Yes

The revisions are

⚙ D145441 [AMDGPU] Define data layout entries for buffers - updating the data layout string
⚙ D146761 [AMDGPU] Make buffer intrinsics pointer-valued - making intrinsics pointer-valued
⚙ D146762 [Verifier] Ban GEP, load, store of addrspace(8) on AMDGPU - updating the verifier

I’ve got a vague recollection that it is tagged as readnone because alias analysis won’t work for buffer operations, and s_buffer_load is only really used with const buffers.

I agree it’s not great, and also that we should move to callsite attributes if that’s still required.

Presumably the work to properly represent buffer descriptors will mean that alias analysis will work correctly anyway?

We could run some performance analysis with Vulkan to see if there’s a regression as a result of not marking it readnone.

One unfortunate thing I found from CI runs on those revisions is that - at the very least - MLIR won’t send intrinsic calls it’s creating through auto-upgrade, since it just pulls the new function definition. So not breaking clients will be tricky and might involve use of defm to create <4 x i32> and ptr addrspace(8) versions of the intrinsics or something … not sure exactly what we want to do here.

Users have to be updated in some way or another. I thought we already had type mangling for the resource descriptor for <8 x i32> vs <4 x i32>, but it looks like the new intrinsics hardcode to 4x. If the resource descriptor were type mangled, you wouldn’t need to have a separately named intrinsic. If you want to do it in 2 steps it might help to introduce type mangling for the descriptor

I don’t think any of the buffer intrinsics have type mangling? My understanding is that the “might be 8xi32, might be 4xi32” thing is only for texture descriptors on image intrinsics, which I’m not touching (though I’d be wiling to argue for ptr addrspace(9) for images, especially since they’ve got fun alignment properties).

My theory of updating users is that they could be transparently updated, since, as far as the underlying value is concerned, <4 x i32> -> i128 -> ptr addrspace(8) is a no-op and just some type casts that the compiler could do … and it looks like the compiler can automatically insert those when presented with IR using the old forms.

… and, having checked, the image intrinsics are hardcoded to <8 x i32> as well, so any type mangling is gone now? (which leaves me low on examples on how to make this change backwards-compatibly)

In positive news, my branch makes alias analysis work on buffer resource functions, at least in a basic capacity llvm-project/buffer-rsrc-ptrs.ll at buffer-improvements · krzysz00/llvm-project · GitHub

We might end up wanting a ptr addrspace(7/8) @llvm.amdgcn.pointer.to.resource(ptr addrspace(*) base, iN metadata) to help the analysis along, but that’s future us, I think