Representing buffer descriptors in the AMDGPU target - call for suggestions

tl;dr: I’m trying to design a better pointer representation for AMDGPU’s buffer descriptors, my design seems like it might be too complicated, and I’m looking for feedback.

Context

On AMD GPUs (GCN onward, at the very least), there is a data structure known as a buffer descriptor or buffer resource (or V#) that has special hardware support. A buffer descriptor is a fat pointer that has, approximately, the following form

struct Buffer {
   address : 48,
   stride : 14,
   swizzleFlags : 2,
   numRecords : 32,
   flags : 32
}

These descriptors can be passed to instructions such as BUFFER_LOAD_[size], BUFFER_STORE_[size], and various atomic operations. These instructions take a buffer resource (which must be located in scalar/lane-uniform registers) and one or two offset arguments, which may vary between threads.

When working with buffer descriptors, it is useful to distinguish between raw and structured buffers, which the current AMDGPU buffers does. A raw buffer is one where the stride field is 0 (as are swizzling flags), and numRecords is interpreted as a byte length encoding the extent of the buffer. A structured buffer, on the other hand, supports a complex 2D indexing scheme where memory operations take both an index (which moves in units of stride) and an offset within the record selected by the index. (In addition, the buffer descriptor can specify advanced swizzling).

Structured buffers can be used to compactly represent data access patterns often seen in graphics workloads, but have an indexing system that is not compatible with LLVM’s GEP system.

On the other hand, a raw buffer is a pointer combined with metadata. The most significant use of this metadata is the availability of implicit bounds checking. Loading from a buffer whose numRecords is N with an offset of k >= N bytes returns 0 instead of a page fault, and many applications rely on this behavior. (Similarly, out-of-bounds writes or atomics are silently dropped by the hardware).

The state of the compiler

The only form in which these instructions are exposed to programmers is through the AMDGPU buffer intrinsics. These intrinsics come in struct and raw variants, with the struct forms containing an additional argument to enable specifying both the index and the offset arguments.

The main problem with these intrinsics is that they take the buffer resource argument as a <4 x i32>, which, because it is not a pointer, means that these operations cannot be analyzed as the memory operations they are. Useful analyses such as alias analysis or elimination of dead stores cannot operate on buffer operations.

I am attempting to develop a more principled solution to the representation of buffer descriptors in the backend.

Existing work: LPC

The LPC compiler is, among other things, responsible for translating Vulkan code into LLVM IR so it can be compiled for the AMDGPU target and executed. In this compiler frontend, buffer descriptors are represented using address space 7, a non-integral address space that has 160-bit pointers: 128 bits of buffer descriptor and 32 bits of offset.

Vector <4 x i32> values can be wrapped into a ptr addrspace(7) by an intrinsic which gives them an initial offset of 0. Then, LLVM operations like getelementptr modify the offset, while leaving the buffer descriptor itself alone. Finally, late in LPC’s pipeline, the address space 7 values are split into the buffer resource part (a <4 x i32>) and a 32-bit offset part (which starts as the null pointer of a 32-bit address space), and LLVM operations like load and store are rewritten into raw.buffer.load and raw.buffer.store with the buffer descriptor and the offset passed in as separate arguments.

This system is correct, but, because it doesn’t reach into the main compiler, other consumers of the AMDGPU backend can’t use it. In addition, it limits opportunities for optimization later in the compilation process.

Proposal

Address space 8

I propose defining a new address space 8, with 128-bit pointers. Address space 8 can represent an arbitrary buffer resource, whether raw or structured, and, as such, would not support address computations with getelementpointer and similar methods. Values in this address space could be created (via inttoptr and the like) and passed around (ex. stored into memory), but couldn’t be the arguments to LLVM’s memory operations.

Instead, they’d only be acceptable arguments to the buffer intrinsics, which would be auto-upgraded from taking <4 x i32> arguments to ptr addrspace(8) ones.

I’m not 100% sure “a pointer can’t non-trivially GEP” is something permitted by LLVM, but, if it is, it solves the problem that the buffer intrinsics should take some sort of pointer as an argument and that those pointers have strange indexing semantics.

One other advantage to introducing a pointer type for buffer descriptors is that the rewrite for fat buffer descriptors proposed below won’t destroy alias information.

Address space 7

We’d also have address space 7, mirroring LPC’s solution, which would be a 160-bit pointer with 32-bit indexing.

This pointer would semantically be

packed struct FatBuffer {
  ptr addrspace(8) buffer,
  i32 offset
}

where buffer needs to be a raw buffer resource on pain of undefined behavior.

(These folks would need a 256-bit alignment, which hopefully won’t cause issues)

Unlike address space 8, pointers in address space 7 can and should be used with standard LLVM operations such as GEP, load, and so on. They are, however, non-integral, so that people don’t try to do integer arithmetic on the complex data structure packed into a i160.

Lowering address space 7

Late in the compilation pipeline, before machine IR creation, we’d lower address space 7 operations to operations on address space 8.

Loads, stores, etc. would be replaced by the corresponding raw buffer intrinsic, with the buffer part and the offset part passed in as separate arguments. GEPs would be replaced by computations on the offset part, and so on.

I can’t tell if it would make sense to replace each ptr addrspace(7) with two SSA values (one for the pointer or one for the offset), with a struct {ptr addrspace(8), i32}, or if a combination of the two approaches makes sense.

Why not…?

Have ptr addrspace(7) be a 128-bit value

As far as I can tell, there is no way to implement GEP on buffer resources directly in the presence of negative offsets.

Consider the raw buffer descriptor ptr addrspace(7) %x, with %x.numRecords = N.

Then, the following code

%y1 = getelementptr i8, ptr addrspace(7) %x, i32 N + 1
store ptr addrspace(7) %y1, ptr %somewhere
%y2 = load ptr addrspace(7), ptr %somewhere
%y = getelementptr i8, ptr addrspace(7) %y2, i32 -1
%v = load %y ; page fault

would have different results than the equivalent code

%y = getelementptr i8, ptr addrspace(7) %x, i32 N
%v = load %y ; %v = 0

It also needs to be the case that

%yLong = getelementptr i8, ptr addrspace(7) %x, i32 N + 1
%v = load %yLong ; %v = 0

Therefore, unless I’ve missed something, the offset on a buffer descriptor needs to be tracked separately.

Pass addrspace(7) pointers into the backend

Even if we restrict ourselves to global ISel, actually getting a 160-bit pointer into the backend would require extending MVT with an i160 type, which seems silly given that we’d need to do the rewrite I proposed above at the MIR level anyway.

Use an opaque type for buffer resources

We’d lose all the pointer-based analyses we’re trying to get access to.

What if you kept a n-bit storage representation of the buffer resource either as a vector or a pointer in some arbitrary address space, and then added an intrinsic to extract the 64-bit pointer value from the resource that could then be passed to normal loads/stores. The intrinsic could handle the address calculation using the fields of the buffer resource and return the actual raw pointer that the hardware would end up using.

The thing is, the hardware itself takes these 128-bit resources as arguments to instructions, so extracting out just the address part and using it for a load/store could well be a miscompile.

Could you add a pattern match in the instruction selector that looks back from the load/store up through the intrinsic to access the original buffer resource?

This doesn’t say you can’t lower, this example only shows that you can’t reassociate. You already aren’t supposed to reassociate non-integral pointer arithmetic, so I don’t see the problem here. I do think we should probably have a different pointer offsetting instruction for non-integral address spaces which is assumed non associative.

I don’t think it’s true that you can’t reassociate in non-integer address spaces. Could you point me to where that’s part of the semantics? As far as I can tell, all that makes an address space non-integral is that ptrtoint/inttoptr are potentially non-deterministic or otherwise underdefined.

@tstellar I’m having trouble understanding your suggestion. Can you expand on it please?

This was always my interpretation. There is definitely code preventing reassociating addrspacecast around them.

Plus we’re filling in the semantics here, we can make what we want happen.

I’m not seeing any code that bars reassociation from a quick grep for NonIntegral in LLVM. And I don’t think we can go around imposing semantics like that - or at least not with just this thread in our corner.

For example:

%buffer_ptr ptr = @llvm.amdgcn.extract.resource.ptr(<4 x i32> %buffer_rsrc)
%val = load ptr addrspace(1) %buffer_ptr

Could be selected by

def : Pat <
  (load (llvm_amdgcn_extract_resource (v4i32 $buffer_rsrc))),
  (BUFER_LOAD_I32 $buffer_rsrc, 0, 0, 0, )
>;

On that note, @preames @efriedma-quic @sanjoyd for the discussion of non-integral address spaces above. (since I found y’all’s names from git blame-ing the language reference)

I don’t think this solves our problem any - we want, for example, to be able to say that buffer resource P doesn’t alias buffer resource Q, and therefore (given the above address space 7) allowing for things like

%b0 = load f32, ptr addrspace(7) %p ; doesn't alias %q
%c0 = fadd f32 %a0, %b0
store f32 %c0, ptr addrspace(7) %q ; doesn't alias %p
%p1 = getelementptr f32, ptr addrspace(7) %p, i32 1
%q1 = getelementptr f32, ptr addrspace(7) %q, i32 1
%b1 = load f32, ptr addrspace(7) %p1
%c1 = fadd f32 %a1, %b1
store f32 %c1, ptr addrspace(7) %q1
...

to be reorderable to

%b0 = load f32, ptr addrspace(7) %p ; doesn't alias %q
%p1 = getelementptr f32, ptr addrspace(7) %p, i32 1
%b1 = load f32, ptr addrspace(7) %p1
...
%c0 = fadd f32 %a0, %b0
%c1 = fadd f32 %a1, %b1
...
store f32 %c0, ptr addrspace(7) %q ; doesn't alias %p
%q1 = getelementptr f32, ptr addrspace(7) %q, i32 1
store f32 %c1, ptr addrspace(7) %q1
...

(up to register pressure)

As noted in the LangRef these are a WIP. The whole point is to disable pointer optimizations. We can clarify and extend as necessary. Part of this whole endeavor is really refining the nonintegral pointer semantics

I’m not convinced that their purpose is to disable pointer optimizations in general. My sense is that they’re meant to disable, specifically, optimizations that rely on pointers having a coherent address that lies in [0, i[ptrWidth]::max] and that respects ordinary arithmetic. (For example, CHERI pointers don’t have this property because there’s an entirely separate “capability” register that comes along with a pointer and doesn’t survive a ptrtoint)

However, I don’t see any reason why a non-integral pointer wouldn’t respect pointer arithmetic rules. That is, even for some out-there native GC machine, I’d still expect (ptr + N + 1) - 1 to point to the same object as ptr + N, and I’m not convinced imposing the rule that this is not necessarily true is either required for our usecase or useful. (And it’ll drift NI pointers further away from other potential users, like CHERI)

1 Like

On the other hand, I’d at least be willing to argue that GEPing to an invalid address in a non-integral address space (ex. going past the end of an allocated range) could result in undefined behavior. Which would get rid of the negative offsets trouble above … but that still feels like weird semantics to have.

A couple of added notes about the AMDGPU ISA and from our experience with LLPC.

Buffer access instructions are a bit more complex than the initial comment describes in that they take a 128-bit buffer descriptor and a 32-bit offset, and out of bounds accesses are not always undefined behavior.

For most purposes, the 128-bit buffer descriptor should not be thought of as a normal pointer since, as Krzysztof explained, you can’t do pointer arithmetic with it, and you shouldn’t do pointer arithmetic with it.

  • You can’t do pointer arithmetic with it primarily due to the out of bounds behavior.
  • You shouldn’t do pointer arithmetic with it because in practice, the 128-bit buffer descriptors are scalar (uniform) values while the 32-bit offsets are vector (divergent) values. Pointer arithmetic on buffer descriptors would in most cases turn them into vectors (divergent values), which would completely and irretrievably destroy performance.

This is why addrspace(7) exists in the first place as a conceptually 160-bit pointer space.

We’ve lived for many years with buffer access intrinsics that don’t use pointers at all. There are two reasons for changing that:

  • Exposing buffer pointers as a high-level language concept
  • Improving the alias information available to the backend

I’m going to focus on the alias info point since that’s what I know and care more about. For alias info to survive into MIR, we need to produce suitable MachineMemOperands that refer back to a pointer in the original LLVM IR. For full fidelity, we inherently need the descriptor+offset representation, i.e. addrspace(7).

There are two difficulties with bringing addrspace(7) into MIR directly:

  1. Dealing with 160-bit quantities will likely cause some challenges especially in SelectionDAG, e.g. we may have to change MVT.
  2. As part of the lowering, the 160-bit quantity needs to be split into a 128-bit and a 32-bit part, and these need to be treated separately for divergence analysis, which SelectionDAG cannot do.

I see the addrspace(8) proposal primarily as a way of solving these technical challenges. If we give up (for now) on carrying addrspace(7) into MIR, we can at least change the buffer intrinsics to take an addrspace(8) pointer and carry that into MIR. This has less alias info fidelity, but it would allow us to maintain relevant alias info about aliasing between different buffers, which is 90% of what we need on practical applications.

You may have noticed that the difficulties with addrspace(7) are mostly SelectionDAG difficulties. I do think that once we no longer need a GlobalISel-to-SelectionDAG fallback, we should consider preserving addrspace(7) into MIR on the GlobalISel path. But that is likely many years into the future, and so the addrspace(8) approach feels like a good compromise for today.

The simple pattern you showed works of course, but you can’t cover all the necessary cases in this way:

  • Real code needs the addrspace(7) capabilities with GEPs
  • We need to look across basic blocks

If we were in a GlobalISel-only world, I think what we should do is

  • Use native load / store / etc. instructions on addrspace(7)
  • This gets translated into G-MIR in the standard way, with MachineMemOperands referring to the addrspace(7) pointers and therefore preserving alias info perfectly
  • As part of legalization, split the ptr addrspace(7)-valued virtual registers into their 128-bit descriptor and 32-bit offset parts.
    • Translate G_LOAD into G_AMDGPU_RAW_BUFFER_LOAD
    • Translate G_PTRADD into G_ADD on the 32-bit parts
    • etc.
  • MIR-level divergence analysis and register bank selection runs after that

This is equivalent to the addrspace(7) lowering that currently lives in LLPC and we have made good experience with. Unfortunately, we’re still quite far from a GlobalISel-only world :frowning:

(Also, nitpick: it’s really not “extract resource ptr” but more of a “convert to resource ptr”. In LLPC, we have

%ptr = call ptr addrspace(7) @lgc.buffer.desc.from.ptr(<4 x i32> %desc)

)

That would break support of graphics APIs.

Re global ISel, the main tricky thing is that getting the addrspace(7) into there is virtual MVT getPointerTy(const DataLayout &DL, uint32_t AS = 0) const, which there doesn’t seem to be a good way around.

Re alias analysis on address space 8 in MIR

  1. If we have a raw buffer access, we could probably work out the offset by looking at various operands to the load instructions
  2. I don’t think that, given an arbitrary buffer descriptor, we can easily determine whether two instructions alias. For instance, if I have a load from a non-raw buffer with index=0 and offset=5, does that alias index=1, offset=1? We don’t know: if the stride of the buffer is 4 then yes, otherwise no. And since the buffer in this case is an unknown value … yeah.

Re global ISel, the main tricky thing is that getting the addrspace(7) into there is virtual MVT getPointerTy(const DataLayout &DL, uint32_t AS = 0) const, which there doesn’t seem to be a good way around.

The IR translator produces G-MIR like

%9:_(s32) = G_LOAD %10:_(p3) :: (dereferenceable load (s32) from ...)

In other words, the IR pointers are translated to virtual registers with pointer VTs.

I have a hard time finding at a glance where this method you point to is called, but it looks like it’s totally possible for us to insert our own custom legalization that prevents this method from ever being called for addrspace(7). I don’t know how to best do that, but it’s clear that in the worst case we would be able to insert our own custom legalization pass dedicated to addrspace(7). (I doubt that’s the best solution, but it is possible.)

Yes, we can. The question is how feasible it is to feed that into the alias analysis, since it goes against the grain of what that is built to do.

It actually depends. There are out-of-bound check modes in which you’re not allowed to use a (voffset + inst_offset) that pushes you beyond the stride. So in that OOB check mode, we know that different indices can’t alias – at least as long as soffset is equal. So it’s complicated, but there are cases where alias results can be proved for structured buffer accesses. That feels like an even longer-term project though :slight_smile:

Re machine alias analysis: I’d need to look into how it works, but I’m curious how, for example, X86 addressing modes get alias-analyzed.

With G_MIR, the problem is that that, when you’re computing how many registers you need for p3 (or, in our case, p7), you need to get the storage type of the pointer, which calls getPointerType(), which returns a MVT, which doesn’t have a i160 case. There might be some methods we could make virtual to work around this, but that definitely won’t work for regular ISel, so it’s entirely possible that the addrspace 7 → addrspace 8 rewrite before ISel may be the least bad option right now.

And even if we can’t get stuff like buffer[0] doesn’t alias buffer[1] surfaced at the machine IR level, we can still get buffer1 doesn’t alias buffer2, which is still an improvement.