Representing buffer descriptors in the AMDGPU target - call for suggestions

My understanding is that it doesn’t. It simply remembers the original LLVM IR pointer via the MachineMemOperand. That’s what the (dereferenceable load (s32) from ...) bit is about in the MIR dump I quoted.

The only real extension on top of that is a constant offset, so that the backend can track cases where a big load/store in LLVM IR is split up into multiple machine instructions.

What I’m suggesting is that we just don’t do that. Instead, we make sure that all the p7 virtual registers get legalized into separate virtual registers, one that is <4 x s32> and one that is s32, before anybody ever asks questions like “how many registers do you need for this”.

Right, and it’s a pretty important improvement. That’s why I think addrspace(8) is a pretty good compromise while we still need to deal with SelectionDAG.

The immediate problem is that the calls in question are happening in LLVM IR → G_MIR translation, so we’d need to intercept up there.

I also, if I’ve understood your suggestion correctly, think this is equivalent to legalizing p7 into struct {p8, i32} and then doing my p7 → p8 rewrite rather early in G_MIR.

I’m slightly unsure what that’ll give us, but I can see what roadblocks I’d run into with that approach.

(and I don’t think we can make that decomposition work with SelectionDAG at all)

I was actually testing this right now, and I think it’s really not bad at all. A simple example like the following will not ever call the method:

define amdgpu_cs <2 x float> @test() {
entry:
  %load = load <2 x float>, ptr addrspace(3) @lds
  ret <2 x float> %load
}

And it looks like the only place where it may get called is during AMDGPUCallLowering::lowerFormalArguments, which we control.

Yes, that’s close to what I’m proposing, except that if we were able to do this, I don’t think there’d be enough real value in having a p8. You could just legalize to {<4 x i32>, i32} and get the same codegen result except with less conceptual complexity to how it all works.

What it gives us is that all the relevant memory instructions will still have the reference to the LLVM IR addrspace(7) value in their MachineMemOperand (because that gets setup by the IR translator and wouldn’t be changed by this legalization – or, to be precise, when legalizing a G_LOAD into some G_AMDGPU_RAW_BUFFER_LOAD or whatever, preserving the MMO is trivial), and so they retain the full fidelity of the alias analysis that was available to LLVM IR.

But yeah, I don’t think it’s feasible to make that work for SelectionDAG, and if we want to be able to fallback from GlobalISel to SelectionDAG then that probably means we can’t use this approach for GlobalISel, either.

If you apply the following patch (pulled off a WIP branch, hence the p8 in there too)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 64dc8604e76a..c80fc7a91e2c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -515,9 +515,9 @@ static StringRef computeDataLayout(const Triple &TT) {
   // 32-bit private, local, and region pointers. 64-bit global, constant and
   // flat, non-integral buffer fat pointers.
   return "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32"
-         "-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128"
-         "-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1"
-         "-ni:7";
+         "-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:"
+         "128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-"
+         "G1-ni:7";
 }

 LLVM_READNONE

(or, more simply, add a p7:160:256:256:32) and then feed the following code

target triple = "amdgcn-amd-amdhsa";

define ptr addrspace(7) @trivial(ptr addrspace(7) %x) {
    ret ptr addrspace(7) %x
}

through ./bin/llc -global-isel, you’ll get a segfault.

That segfault is tracable to CallLowering::getReturnInfo over in lib/CodeGen/GlobalISel/CallLowering.cpp calling ComputeValueVTs, which eventually calls getPointerType() which return a null pointer because there’s no MVT::i160. And from what I can tell, there’re no interceptable hooks in that chain at the moment.

Hmm, you’re right. I wonder what it would take to change getPointerTy() from an MVT to an EVT? Most uses seem to be via getValueType(), which already returns an EVT.

My look at getPointerTy() shows a lot of uses in files like DAGToDAG, which, unless there’s a different getPointerTy(), means we can’t really change it to an EVT.

@nhaehnle How does the existing LLPC address space 7 implementation handle the alignment of these 160-bit descriptors in memory? Are they being treated as packed structs, or do they get 256-bit alignment?

I think it would be less effort to just add this to MVT

But then we’re trying to do the address space 7 → address space 8 + offset rewrites in instruction selection (where, in global ISel, probably not that bad, in SelectionDAG, no idea how we’d get around to it). Probably viable, though doing stuff like “now that we’ve split, we have a loop-uniform part we can pull out” might be hard in MIR (I don’t know if someone’s written those sorts of passes for the various machine IRs yet)

And I’m going to push back some on the proposal to have 128-bit descriptors that work with load and store but not getelementptr or the general “you can’t GEP non-integral things” logic, since that gets needlessly wonky to deal with when, for example, I’m generating LLVM IR from MLIR.

We could if we wanted define a family of address spaces, some of which imply properties of the descriptor (e.g addrspace(10) implies a stride of 4). We could then have a catch all address space which may alias with buffers of any stride

It is unsupported. It’s an error to try to load or store address space 7 pointers. SPIR-V forbids the equivalent action (it also forbids things like storing those pointers into arrays), so that’s okay for our use case

There is a MachineLICM pass. Though in practice, I think we don’t need LICM as much as just phi cleanups: there may be a phi of an addrspace(7) pointer in a loop, which the lowering turns into two phis: an offset phi which must remain, and a descriptor phi which becomes trivial and could be removed. The load of the descriptor will typically already be outside of the loop.

The bigger problem on which I agree with you is that doing the lowering in SelectionDAG isn’t feasible. Mostly because of the divergence analysis problem.

I’m not sure 2^14-ish address spaces is a good idea…

Then again, when we’re doing alias analysis, I think we have access to the entire instruction, so we can try to be clever (ex. seeing if the stride component is a constant)

Ok, so doing the thing I think LLVM requires me to do - aligning these 160-bit values to 256-bit, and preserving that alignment when I lower the loads and stores - won’t break y’all because it would be defining undefined behavior.

1 Like

I’ve put up a revision that kicks off the “address space 7 is 160 bits, address space 8 is 128 bits” approach at ⚙ D145441 [AMDGPU] Define data layout entries for buffers

It is unsupported. It’s an error to try to load or store address space 7 pointers. SPIR-V forbids the equivalent action (it also forbids things like storing those pointers into arrays), so that’s okay for our use case

SPIR-V allows a “variable pointer” to be stored in a Local/Private variable. So you could have:

%local = alloca ptr addrspace(7), align 8, addrspace(5)
store ptr addrspace(7) %var_ptr, ptr addrspace(5) %local, align 8

I think LLPC gets away with not handling it in any special way because those loads/stores are typically optimized out.

Yes, that’s what I was thinking of. Though now that you mention it… it’s probably possible in SPIR-V to have a local array of pointers with dynamic indexing that SROA then doesn’t optimize away. You can’t write such code in GLSL or HLSL which would explain why we haven’t seen it, but yeah, it sounds like to fully support SPIR-V we have to support loads and stores of addrspace(7) pointers as values after all…

I’ve started working on one of the two components of this change (migrating buffer intrinsics from <4 x i32> to ptr addrspace(8)). I still haven’t done the AutoUpgrade implementation, but I’ve gotten all the tests updated.

The branch is at Comparing llvm:main...krzysz00:buffer-improvements · llvm/llvm-project · GitHub .

One change (a final test failure, which I haven’t committed) that I’m very confused by is the following required test update

diff --git a/llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll b/llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll
index b2ca2c0a273f..f558f467434c 100644
--- a/llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll
+++ b/llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll
@@ -38,7 +38,7 @@ define amdgpu_ps float @_amdgpu_ps_main() #0 {
 ; GCN-NEXT:    v_sub_f32_e32 v8, s0, v1
 ; GCN-NEXT:    v_fma_f32 v7, -s2, v6, s6
 ; GCN-NEXT:    v_fma_f32 v5, v6, v5, 1.0
-; GCN-NEXT:    v_mad_f32 v10, s2, v6, v2
+; GCN-NEXT:    v_fma_f32 v10, s2, v6, v2
 ; GCN-NEXT:    s_mov_b32 s0, 0x3c23d70a
 ; GCN-NEXT:    v_fmac_f32_e32 v1, v6, v8
 ; GCN-NEXT:    v_fmac_f32_e32 v10, v7, v6

Do any folks here have some idea of what could be going wrong here? Or how I could debug this further?

I also haven’t committed to whether addrspace(8) pointers are integral or not - the branch currently says yes, but that’s mostly because we’ve declared that you can’t gep/load/store/… them. (I’m wondering if there’s a way to make that a hard error when the target triple is known…), and, without those operations, the only integer-related optimizations that might come up are stuff like folding inttoptr/ptrtoint pairs, which are correct for buffer resources.

How do you implement this auto-upgrade? Do you insert something that trivially converts from <4 x i32> to ptr addrspace(8)?

I’m concerned that changing the existing intrinsics like this will make life painful for LLPC and perhaps other clients like Mesa, unless there is a really simple drop-in replacement that preserves the current behaviour.

The auto-upgrade is

call @llvm.amdgcn.*buffer*(..., <4 x i32> %rsrc, ...)

goes to

%rsrc.int = bitcast <4 x i32> %rsrc to i128
%rsrc.ptr = inttoptr i128 %rsrc.int to ptr addrspace(8)
call @llvm.amdgcn.*buffer*(..., ptr addrspace(8) %rsrc.ptr, ...)

This feels like a reasonable auto-upgrade.

It’s not really related to your patch. It’s a side effect of running DAG combines again on a DAG which has not changed since the last time – i.e. the DAG combine phase is not idempotent. I don’t know if that counts as a bug or not.

You can see exactly the same effect if you just make this change on top of main:

diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
index 8d1cc04607bf..df7dd758d041 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
@@ -782,6 +782,7 @@ void SelectionDAGISel::CodeGenAndEmitDAG() {
     NamedRegionTimer T("combine1", "DAG Combining 1", GroupName,
                        GroupDescription, TimePassesIsEnabled);
     CurDAG->Combine(BeforeLegalizeTypes, AA, OptLevel);
+    CurDAG->Combine(BeforeLegalizeTypes, AA, OptLevel);
   }
 
   LLVM_DEBUG(dbgs() << "Optimized lowered selection DAG: "