tl;dr: I’m trying to design a better pointer representation for AMDGPU’s buffer descriptors, my design seems like it might be too complicated, and I’m looking for feedback.
Context
On AMD GPUs (GCN onward, at the very least), there is a data structure known as a buffer descriptor or buffer resource (or V#
) that has special hardware support. A buffer descriptor is a fat pointer that has, approximately, the following form
struct Buffer {
address : 48,
stride : 14,
swizzleFlags : 2,
numRecords : 32,
flags : 32
}
These descriptors can be passed to instructions such as BUFFER_LOAD_[size]
, BUFFER_STORE_[size]
, and various atomic operations. These instructions take a buffer resource (which must be located in scalar/lane-uniform registers) and one or two offset arguments, which may vary between threads.
When working with buffer descriptors, it is useful to distinguish between raw and structured buffers, which the current AMDGPU buffers does. A raw buffer is one where the stride
field is 0
(as are swizzling flags), and numRecords
is interpreted as a byte length encoding the extent of the buffer. A structured buffer, on the other hand, supports a complex 2D indexing scheme where memory operations take both an index (which moves in units of stride
) and an offset within the record selected by the index. (In addition, the buffer descriptor can specify advanced swizzling).
Structured buffers can be used to compactly represent data access patterns often seen in graphics workloads, but have an indexing system that is not compatible with LLVM’s GEP system.
On the other hand, a raw buffer is a pointer combined with metadata. The most significant use of this metadata is the availability of implicit bounds checking. Loading from a buffer whose numRecords
is N
with an offset of k >= N
bytes returns 0
instead of a page fault, and many applications rely on this behavior. (Similarly, out-of-bounds writes or atomics are silently dropped by the hardware).
The state of the compiler
The only form in which these instructions are exposed to programmers is through the AMDGPU buffer intrinsics. These intrinsics come in struct
and raw
variants, with the struct
forms containing an additional argument to enable specifying both the index and the offset arguments.
The main problem with these intrinsics is that they take the buffer resource argument as a <4 x i32>
, which, because it is not a pointer, means that these operations cannot be analyzed as the memory operations they are. Useful analyses such as alias analysis or elimination of dead stores cannot operate on buffer operations.
I am attempting to develop a more principled solution to the representation of buffer descriptors in the backend.
Existing work: LPC
The LPC compiler is, among other things, responsible for translating Vulkan code into LLVM IR so it can be compiled for the AMDGPU target and executed. In this compiler frontend, buffer descriptors are represented using address space 7, a non-integral address space that has 160-bit pointers: 128 bits of buffer descriptor and 32 bits of offset.
Vector <4 x i32>
values can be wrapped into a ptr addrspace(7)
by an intrinsic which gives them an initial offset of 0. Then, LLVM operations like getelementptr
modify the offset, while leaving the buffer descriptor itself alone. Finally, late in LPC’s pipeline, the address space 7 values are split into the buffer resource part (a <4 x i32>
) and a 32-bit offset part (which starts as the null pointer of a 32-bit address space), and LLVM operations like load
and store
are rewritten into raw.buffer.load
and raw.buffer.store
with the buffer descriptor and the offset passed in as separate arguments.
This system is correct, but, because it doesn’t reach into the main compiler, other consumers of the AMDGPU backend can’t use it. In addition, it limits opportunities for optimization later in the compilation process.
Proposal
Address space 8
I propose defining a new address space 8, with 128-bit pointers. Address space 8 can represent an arbitrary buffer resource, whether raw or structured, and, as such, would not support address computations with getelementpointer and similar methods. Values in this address space could be created (via inttoptr
and the like) and passed around (ex. stored into memory), but couldn’t be the arguments to LLVM’s memory operations.
Instead, they’d only be acceptable arguments to the buffer intrinsics, which would be auto-upgraded from taking <4 x i32>
arguments to ptr addrspace(8)
ones.
I’m not 100% sure “a pointer can’t non-trivially GEP” is something permitted by LLVM, but, if it is, it solves the problem that the buffer intrinsics should take some sort of pointer as an argument and that those pointers have strange indexing semantics.
One other advantage to introducing a pointer type for buffer descriptors is that the rewrite for fat buffer descriptors proposed below won’t destroy alias information.
Address space 7
We’d also have address space 7, mirroring LPC’s solution, which would be a 160-bit pointer with 32-bit indexing.
This pointer would semantically be
packed struct FatBuffer {
ptr addrspace(8) buffer,
i32 offset
}
where buffer
needs to be a raw buffer resource on pain of undefined behavior.
(These folks would need a 256-bit alignment, which hopefully won’t cause issues)
Unlike address space 8, pointers in address space 7 can and should be used with standard LLVM operations such as GEP, load
, and so on. They are, however, non-integral, so that people don’t try to do integer arithmetic on the complex data structure packed into a i160
.
Lowering address space 7
Late in the compilation pipeline, before machine IR creation, we’d lower address space 7 operations to operations on address space 8.
Loads, stores, etc. would be replaced by the corresponding raw buffer intrinsic, with the buffer part and the offset part passed in as separate arguments. GEPs would be replaced by computations on the offset part, and so on.
I can’t tell if it would make sense to replace each ptr addrspace(7)
with two SSA values (one for the pointer or one for the offset), with a struct {ptr addrspace(8), i32}
, or if a combination of the two approaches makes sense.
Why not…?
Have ptr addrspace(7)
be a 128-bit value
As far as I can tell, there is no way to implement GEP on buffer resources directly in the presence of negative offsets.
Consider the raw buffer descriptor ptr addrspace(7) %x
, with %x.numRecords = N
.
Then, the following code
%y1 = getelementptr i8, ptr addrspace(7) %x, i32 N + 1
store ptr addrspace(7) %y1, ptr %somewhere
%y2 = load ptr addrspace(7), ptr %somewhere
%y = getelementptr i8, ptr addrspace(7) %y2, i32 -1
%v = load %y ; page fault
would have different results than the equivalent code
%y = getelementptr i8, ptr addrspace(7) %x, i32 N
%v = load %y ; %v = 0
It also needs to be the case that
%yLong = getelementptr i8, ptr addrspace(7) %x, i32 N + 1
%v = load %yLong ; %v = 0
Therefore, unless I’ve missed something, the offset on a buffer descriptor needs to be tracked separately.
Pass addrspace(7) pointers into the backend
Even if we restrict ourselves to global ISel, actually getting a 160-bit pointer into the backend would require extending MVT with an i160
type, which seems silly given that we’d need to do the rewrite I proposed above at the MIR level anyway.
Use an opaque type for buffer resources
We’d lose all the pointer-based analyses we’re trying to get access to.