[RFC] Simplifying Dynamic Shared Memory Access in GPU

jungpark · November 3, 2023, 3:52pm

It’s not the matter of GPU vendor, but restricted/supported by the compute API, e.g., AMD OpenCL should support it.

qcolombet · November 3, 2023, 4:17pm

The global is an implementation detail: we could materialize it from the argument in the lowering to nvgpu or other. And the other way around, we could “promote” the global to a function argument from what is currently produced by existing frontends.

That being said, that’s probably a big leap (and potentially an unwanted one?) and improving the way we get the global make sense to me.

Haven’t read @grypp’s updates yet.

fabianmc · November 3, 2023, 4:31pm

Yeah, I agree it’s an implementation detail to an extent. If you check the link I posted, it shows Intel’s SYCL lowering to CUDA, where the kernel gets rewritten so that the shared ptr arg gets transformed into a int offset and a read from the global shared variable.

However, I’m a strong -1 on using a kernel parameter for that, as kernel parameters are a scarce resource.

mehdi_amini · November 3, 2023, 5:55pm

I’m not sure I understand what kind of verifier you have in mind, are you sure that wouldn’t be against: Developer Guide - MLIR ?

qcolombet · November 3, 2023, 6:13pm

I don’t understand the argument around the kernel argument resource being scarce.

The argument would only be a transient representation during mlir lowering. When you get to lower levels, it would actually be a global variable and the argument would be eliminated, I.e., no change is argument passing.

What am I missing?

Essentially what I’m saying is the goal is to have the same final codegen, no additional arguments, no nothing. The arguments are just an abstraction (or proposed abstraction more precisely) to have a “saner” model.

fabianmc · November 3, 2023, 7:10pm

I assumed your idea was having something similar to OpenCL, which would have at least required passing offsets as arguments, hence my argument, my mistake.

Assuming we don’t add extra arguments. We still have the issue that:

int* __device__ getMem() {
  extern __shared__ int tmp[];
  return tmp;
}

Wouldn’t have a valid representation in gpu. The semantically correct way to represent the code would be through the op, not with arguments.

We may end up saying we don’t want to represent the above code, in which case a kernel argument would still be not ideal; a block argument in gpu.func, gpu.launch would be preferable in that case in my opinion.

jungpark · November 3, 2023, 8:23pm

No, we’re not discussing the design of the compute API here. What we can use for the kernel arguments are already defined by the compute APIs.

I believe having shared ptr in the kernel argument like OpenCL models the dynamic allocation in the higher-level. And I don’t agree it’s less efficient. It still can do the same implementation as the single dynamic allocation with passing a single pointer and can embed all static offsets in the kernel. On the contrary, in order to have multiple different sized shared memory instances in CUDA, you also need to pass the offsets via kernel argument.

Lets say you want to dynamically allocate 3 shared memory instances,
shared int a[L], b[M], c[N];
Where L, M, N are all determined by host at launch time, you can only set total_size=L+M+N as a launch parameter but also need to pass L, M as kernel arguments for the offsets of b, c. - I’m not 100% sure this is true, my cuda knowledge is actually scarce .

My opinion is, to have the new ops in gpu.func op when kernel arguments get available, not in gpu.launch op. In that way, we can satisfy both programming model.
And considering the topology of gpu.launch op, it still has visibility to the parent region, suppose it’s not yet confirmed what’s accessible from the gpu and what’s not. The new ops are to define a interface between host and gpu in the low level, I think it’s best to be handled between gpu.launch and gpu.func. It’s totally fair gpu.func has option to include the new ops.

mehdi_amini · November 3, 2023, 11:58pm

I still don’t quite get how this can work actually.

In this model the kernel receives multiple pointers to shared memory:

gpu.func @test_kernel(
    %arg0: memref<?xf32, #gpu.address_space<workgroup>>,
    %arg1: memref<?xf32, #gpu.address_space<workgroup>>,
    %arg2: memref<?xf32, #gpu.address_space<workgroup>>) {
  ...

These are separately allocated on the host and passed in, so looking as a launch it will look like:

    gpu.launch_func  @test_kernel::@test_kernel blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1)
    args(%alloc0 : memref<?xf32, #gpu.address_space<workgroup>>,
             %alloc1 : memref<?xf32, #gpu.address_space<workgroup>>,
             %alloc2 : memref<?xf32, #gpu.address_space<workgroup>>)

First question is are we only allowing dense 1D memref arguments here or any kind of memref? (This impacts the lowering below)

Lowering this call to a cuda-compatible launch requires us to do something like the following I think:

    // Compute shared memory allocation size
    %dim_shared0 = %alloc0.dim(0) : index
    %dim_shared1 = %alloc1.dim(0) : index
    %dim_shared2 = %alloc2.dim(0) : index
    %total_alloc01 = arith.addi %dim_shared0, %dim_shared1 : index
    %total_alloc = arith.addi %total_alloc01, %dim_shared2 : index

    // Compute shared memory offsets (let's guarantee the first argument is always offset0
    %offset0 = arith.constant 0 : index
    %offset1 = arith.addi %dim_shared0, %offset0
    %offset2 = arith.addi %dim_shared1, %offset1

    // Launch with shared memory size and offsets.
    gpu.launch_func  @test_kernel::@test_kernel blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1)
   dynamic_shared_memory_size(%total_alloc) // <- wasn't there before
     // replaced with indices, first one can be elided and always implicitly 0.
    args(%offset1 : index, %offset2 : index)

Now doing this transformation is doable, however that raises a few questions for me:

If we do this on gpu.launch_func, it is now having two modes: one with dynamic_shared_memory_size explicitly given and the other where it is implicit from the various memref argument (in the workgroup space).
It’s not clear to me why this shouldn’t be done by the lowering above, that is: why don’t we let the code emitting the GPU launch itself to do this transformation? Isn’t this just higher-level kind of lowering?
This requires to actually construct memref descriptors on the host just to carry the size. This comes from the fact that we have these “fake” memref.alloc() for these shared memory blocks which are actually only intended to carry a size here.
Can’t easily propagate static offset anymore into the kernel at this point, we have an “ABI” where we passed N-1 offsets.
I don’t see how to manage nicely non-overlapping uses of shared memory with this model (without reverting to the pattern of a single global alloc), things like that in pseudo-code:

   // Assume 128B shared memory.

   // [0:63]
   %alloc0 = gpu.dynamic.shared.memory offset = 0 : memref<8x8xi8, 3>
   // [64:127]
   %alloc1 = gpu.dynamic.shared.memory offset = 64 : memref<8x8xi8, 3>
   
   // do stuff...

   // done with alloc0 and alloc1
   syncthread();

   // [0:96] Overlap alloc0 and alloc1, reusing memory.
   %alloc2 = gpu.dynamic.shared.memory offset = 0 : memref<96xi8, 3>

   // do something else with %alloc2...

Here you can’t just pass 3 allocs to the kernels (actually you can, but you will use more memory than needed)

grypp · November 6, 2023, 8:04am

@mehdi_amini - I can rewrite your example like below with the updated proposal (see Update section), here I use dynamic SSA or constant values to construct a getelementpointer, we can rewrite the example provided by as follows:

// [0:63]
%i0 = arith.constant 0 : index
%alloc0 = gpu.dynamic.shared.memory [%i0,0,0] : memref<8x8xi8, 3>
// [64:127]
%i1 = arith.addi %i0, %c1 : index
%alloc1 = gpu.dynamic.shared.memory [%i1,0,0] : memref<8x8xi8, 3>

// do stuff...

// done with alloc0 and alloc1
syncthread();

// [0:96] Overlap alloc0 and alloc1, reusing memory.
%alloc2 = gpu.dynamic.shared.memory [0] : memref<96xi8, 3>

I think @jungpark’s idea is quite intuitive in the beginning, and IREE also implements the concept. As I mentioned or @mehdi_amini demonstrated with examples, the reuse of shared memory can quickly become a problem. Exactly for this very reason, having a canonical way to utilize dynamic shared memory is essential.

For the following IR, I had considered the possibility of checking %0 >= %1 * 32 * 64 * sizeof(f32) when SSA values are compile-time constant (or at least %0 in this case). My approach doesn’t rely on use-def chains, It is local to GPU kernel, but not the IR. I’m unsure if my approach aligns with the guidelines.

gpu.launch dynamic_shared_memory_size %0 {
 %2 = gpu.dynamic.shared.memory [%1,0,0] : memref<32x64xf32,3>

qcolombet · November 6, 2023, 8:30am

int* __device__ getMem() {
  extern __shared__ int tmp[];
  return tmp;
}

What is the semantic of this snippet?

grypp · November 6, 2023, 8:50am

It is how to use dynamic shared memory in a device function (not kernel). Dynamic shared memory is a global symbol (0-sized), one can directly access it from any function (see godbolt for llvm ir)

zero9178 · November 6, 2023, 10:40am

This approach is actually what the guideline is warning against. It does traverse the use-def chain as checking whether %0 and %1 are constant requires checking what Op defined it and whether it has the ConstantLike trait.
Just like the example in the guide, an arith.addi that gets constant folded could cause a verification failure, even if the gpu.launch is potentially in dead code.

jungpark · November 6, 2023, 11:05am

Thanks @mehdi_amini, good questions. My claim wasn’t with very clear design across the lowering and it helps broadening my view.

It looks easier to limit dense 1D memref only, honestly I don’t know whether/where it could go wrong with the multi-dimensional memref however local ptr support in kernel argument is a special API on the compute API, there might be unknown conflict between the assumption in MLIR and the ABI with such support.

Yes, this is totally valid claim which I’ve been trying to find better justification.

My biggest concerning is, if earlier lowerings only lower single allocation+offset to the gpu dialect, which is still legal, but multiple allocation option will be practically ignored. Some use cases could be much smoothly lowered using multiple allocation.
It’s just my proposal this time to introduce explicit topological border between ‘gpu.launch’ and ‘gpu.func’ op for deciding the option whether it uses single or multiple dynamic allocation.
Major implementation difference between those two programming models are in gpu launching interface including kernel arguments which is first introduced from gpu.launch_func+gpu.func, not in gpu.launch op. So, it makes sense to diverge from that point.
I’m trying to justify this but also it could be failed if it turns out not feasible. Sorry to claim partially different idea without a concrete design, I’m still trying.

I think this is fine, we don’t always lower alloc op to the real function. For example, memref.alloc of the static shared memory in gpu kernel or gpu private space alloc only holds size within MLIR compilation.
Host side ‘fake’ memref.alloc will be holding the sizes and a link to the kernel arguments when we call host API to set it. i.e., clSetKernelArg

Sorry, I’m not sure I understand this correctly, trying my best.
I think this is only required by single&default shared memory model? which is still supported option from gpu.func op with my proposal.

Merging two separate allocation is mostly not possible and I believe this is assuming the single allocation from the beginning. So, I think %alloc0 and %alloc1 shouldn’t be separate allocations but just view/subview from the same allocation and %alloc2 as well.
I might be wrong but I suppose this kind of code can be only generate by a special algorithm and at least first introduced from gpu.dialect stage, not lowered from any equivalent operations. - I’m trying to understand if there is an unavoidable ambiguity in lowering to determine this requires single allocation or multiple allocations.

Let me try to make an examples,
Now, it has a memref.alloc with dynamic operand in the gpu.launch op. This is different from my previous post, sorry for the confusion. It’s still valid to have the dynamic allocation in gpu.launch op since gpu.launc op still has a visibility to the host part of the variables and the border between host and gpu is not clearly set.

  func.func @test(..., %arg4: index) {
    // %arg4 = const index 128
    gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %arg4 {
      %alloc = memref.alloc(%arg4) : memref<?xi8, #gpu.address_space<workgroup>>
      // Assume 128B shared memory.
      // [0:63]
      %alloc0 = memref.subview %alloc[0][64][1] : memref<?xi8, 3> to memref<64xi8, 3>
      // [64:127]
      %alloc1 = memref.subview %alloc[64][64][1] : memref<?xi8, 3> to memref<64xi8, 3>
      // done with alloc0 and alloc1
      syncthread();
      // [0:96] Overlap alloc0 and alloc1, reusing memory.
      %alloc2 = memref.subview %alloc[0][96][1] : memref<?xi8, 3> to memref<96xi8, 3>

As dynamic_shared_memory_size is given, it should have a single dynamic shared allocation with the same size.
Assuming using @grypp 's proposal, outlining should remove the alloc op and fold subview to new ops.

  func.func @test(..., %arg4: index) {
    gpu.launch_func  @test_kernel::@test_kernel blocks in (...) threads in (...)  dynamic_shared_memory_size %arg4 args()
    return
  }
  gpu.module @test_kernel {
    gpu.func @test_kernel() kernel attributes {...} {
      %alloc0 = gpu.dynamic.shared.memory offset = 0 : memref<64xi8, 3>
      %alloc1 = gpu.dynamic.shared.memory offset = 64 : memref<64xi8, 3>
      syncthread();
      %alloc2 = gpu.dynamic.shared.memory offset = 0 : memref<96xi8, 3>

If dynamic_shared_memory_size is not given while shared memory is allocated with dynamic variable, outlining hoist the alloc out of the gpu op and pass it through the kernel argument.

  func.func @test(..., %arg4: index) {
    %alloc = memref.alloc(%arg4) : memref<?xi8, #gpu.address_space<workgroup>>
    gpu.launch_func  @test_kernel::@test_kernel blocks in (...) threads in (...) args(%alloc : memref<?xi8, #gpu.address_space<workgroup>>)
    gpu.func @test_kernel(%alloc : memref<?xi8, #gpu.address_space<workgroup>>) kernel attributes {...} {
      %alloc0 = memref.subview %alloc[0][64][1] : memref<?xi8, 3> to memref<64xi8, 3>
      %alloc1 = memref.subview %alloc[64][64][1] : memref<?xi8, 3> to memref<64xi8, 3>
      syncthread();
      %alloc2 = memref.subview %alloc[0][96][1] : memref<?xi8, 3> to memref<96xi8, 3>

Basically, supporting multiple shared memory conceptually includes single shared memory model and I believe there is no problem lowering from the gpu.func with either model. But I’m still not sure if there’s any use cases code generation of the gpu.launch op is against this.

grypp · November 6, 2023, 11:24am

True but after constant propogation/folding it is still okay to verify 100 >= 1 * 32 * 64 * sizeof(f32) for the following IR?

gpu.launch dynamic_shared_memory_size 100 {
 %2 = gpu.dynamic.shared.memory [1,0,0] : memref<32x64xf32,3>

zero9178 · November 6, 2023, 2:39pm

This will depend on how you define the semantics of having SSA-values vs using attributes for the operands. If you define them as static attributes being semantically equivalent to having constant SSA values, then no, as (I assume) you might then turn the constant SSA values to attributes and run into the same problem as before.

You can define attributes as additionally verifying that property, but this has the implications that:

%100 = arith.constant 100
gpu.launch dynamic_shared_memory_size %100 {
  %2 = gpu.dynamic.shared.memory [1, 0, 0] : memref<32x64xf32,3>

cannot be constant folded to gpu.launch dynamic_shared_memory_size 100 as this creates an ill-formed structure out of something that was previously not.
Whether the conceptual overhead of being able to use both Attributes and SSA-Values as operands is worth it I have no clue as I am too unfamiliar with memref, bufferization etc. where I believe this is more commonly used (and I don’t know why).

For a previous discussion see also [RFC] More OpFoldResult and "mixed indices" in ops that deal with Shaped Values

nicolasvasilache · November 7, 2023, 11:34am

grypp:

Update

I have updated the proposal. The offsets can be multi dimensional and be constant or dynamic SSA values and mixed.

Let’s write the program above with that:

func.func @main() {
    gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 {
    	%i = arith.constant 18 : index
        // Step 1: Obtain shared memory directly
        %7 = gpu.dynamic.shared.memory [%i,0,0] : memref<64x64xf16, 3>
        %i2 = arith.addi %i, %c1
        %8 = gpu.dynamic.shared.memory [%i2,0,0] : memref<64x64xf16, 3>
        
        // Step 2: Utilize the shared memory
        "test.use.shared.memory"(%7) : (memref<64x64xf16, 3>) -> (index)
        "test.use.shared.memory"(%8) : (memref<64x64xf16, 3>) -> (index)
    }
}

Offsets are used generate the base pointer with llvm.getelementpointer:

This form still does not do it for me.
To make it short, you’ve been iterating on similar design criteria as for the memref.view op and you should get to the same conclusions (or to better conclusions that should serve to improve memref.view):

It is counter-intuitive to use multi-dimensional offsets without a multi-dimensional base memref. The only thing that is reasonable to use here is a single value offset (static or dynamic)
your op does not have space for dynamic sizes so you are artificially limiting the shared memory memrefs to be statically sized.

As I have been mentioning offline before this conversation started, you seem to reach out for something very similar to the memref.view op:

//    The "view" operation gives a structured indexing form to a flat 1-D buffer.
//    Unlike "subview" it can perform a type change. 
//    ...

    // Allocate a flat 1D/i8 memref.
    %0 = memref.alloc() : memref<2048xi8>

    // ViewOp with dynamic offset and static sizes.
    %1 = memref.view %0[%offset_1024][] : memref<2048xi8> to memref<64x4xf32>

    // ViewOp with dynamic offset and two dynamic size.
    %2 = memref.view %0[%offset_1024][%size0, %size1] :
      memref<2048xi8> to memref<?x4x?xf32>

The abstraction you are reaching out for here seems very similar to the above with the difference that %0 does not come from an alloc but is a global.
I suggest something resembling:

%0 = gpu.dynamic.shared.memory : memref<?xi8, 3>
// alternatively could return memref<32768xi8, 3> if you want to specialize this to e.g. 32KB smem.
%1 = memref.view %0[%offset_1024][] : memref<?xi8, 3> to memref<64x4xf32, 3>

The type of analyses you describe are what memref.view is designed for, you can just update the op / helpers / analyses if something is missing.

I don’t understand this.

grypp · November 7, 2023, 1:44pm

Thank you, everyone, for your valuable input. I’ve put the PR to materialize the idea, and it’s great to see the questions and discussions it has sparked.

In light of our discussions, it appears that: gpu.dynamic.shared.memory is indeed the kind of Op we need, but we want to use it with the standard memref.view. So we don’t want offset = [] attribute. So we will have the following IR:

%0 = gpu.dynamic.shared.memory : memref<?xi8, 3>
%1 = memref.view %0[%offset_1024][] : memref<?xi8, 3> to memref<64x4xf32, 3>

Additionally, I will work on enabling memref.view to accept attributes like
memref.view %0[%c32768][] or memref.view %0[32768][]

grypp · November 8, 2023, 10:53am

The PR is now available for review.

grypp · November 8, 2023, 11:02am

Alternative Proposal

I had a chat with @qcolombet, and we improved his idea of declaration of dynamic shared memory as a part of the kernel. IMHO, this alternative idea holds promise, it’s is a bit more invasive than the current proposal. I will put the idea here for posterity.

Example:

Here I want to illustrate

Using dynamic_workgroup as new memory attribute
Reusing of same shared memory region
Using dynamic shared memory in a device function

gpu.func @kernel(%arg0: memref<100xf32>) {
      dynamic_workgroup(
      	%arg1: memref<5xf64, offset = 8192, 3>,
      	%arg2: memref<10xf32, offset = 8192, 3>)

      write %arg1: memref<5xf64, 3>
      call @foo(%arg1: memref<5xf64, offset = 8192, 3>,
      			%arg2: memref<10xf32, offset = 8192, 3>)
}
@foo(%arg1: memref<5xf64, offset = 8192, 3>,
	 %arg2: memref<10xf32, offset = 8192, 3>) 
{
	use %arg1 : memref<5xf64, offset = 8192, 3>
	use %arg2 : memref<10xf32, offset = 8192, 3>
}

gpu and memref lowering needs to tackle with following things:

Calculate the the size and set dynamic_shared_memory_size
Need to know target info to reason about address space
Modify the function’s ABI if it includes memref<...,3>.
Generate gpu.dynamic_shared_memory for each memref<...,3>

gpu.func @kernel(%arg0: memref<100xf32>) 
	dynamic_shared_memory_size max(8192+sizeof(memref<5xf64)) 
{
	%shmem = base pointer of shmem
	%arg1 = memref.view[8192][] :memref<5xf64, 3>
 	
 	write %arg1: memref<5xf64, 3>

	call @foo()
}

@foo() {
	%shmem = base pointer of shmem
	%arg1 = memref.view[8192][] :memref<5xf64, 3>
	%arg2 = memref.view[8192][] :memref<10xf32, 3>

	use %arg1 : memref<5xf64, offset = 8192, 3>
	use %arg2 : memref<10xf32, offset = 8192, 3>
}

MaheshRavishankar · November 14, 2023, 2:13am

I am generally -1 on adding any new abstraction for representing memory in the GPU dialect. I think memref.alloca is already a good enough abstraction for representing shared memory. Just like a function does not really have to care about who “allocates” stack memory, you wouldnt need to care about who “allocates” shared memory. It is there for you to use. The mechanism of how this lowers to NVVM is just based on a particular implementation choice of how this is handled in NVVM.
Adding new operations means you need either transformations to be aware of this special representation for memory to be able to target it. I am not sure there is a representational challenge here that needs solving, to motivate creating a new operation.

Topic		Replies	Views
MLIR: dynamic shared memory MLIR	1	225	April 12, 2024
How to allocate memory inside gpu kernel function MLIR gpu	23	1881	November 16, 2023
Allocating workgroup memory for use by mlir-cuda-runner MLIR	15	1088	October 20, 2020
I'm having some problems with share memory in GPU dialect MLIR mlir	2	132	November 19, 2024
Memref.alloca in AMD GPU kernels seem to lower to llvm.alloca with an incorrect address space MLIR gpu	24	1134	January 4, 2023

[RFC] Simplifying Dynamic Shared Memory Access in GPU

Related topics