Modeling GPU async copy (Ampere feature)

I’m working on adding async copy from global to shared memory in the GPU dialect. This is not yet a fully baked RFC yet but I wanted to share what I have so far and get inputs if anybody has some thoughts about it.

For the context async copy is a new feature in Nvidia Ampere GPUs, but other GPUs are likely to support similar feature in the future. It works like a DMA operation that copies data from global to shared memory without going through register and without blocking the execution unit. The big advantage is that it saves register usage and it allows aggressive software pipelining to hide latency. Here is a blog post about async copies although it doesn’t really how this work in the low level. The PTX spec has more low level details.

This feature is meant to be used along with software pipelining. I added some software pipelining transformations some time back for scf.for ops here. Ideally those two features should play well together however the pipelining design can still be changed if needed.

There are two modes in Ampere, one blocking and one none blocking. As I’m mostly interested in the blocking mode this is what I’m concentrating on right now.

What I’m planning to add to the GPU dialect are the following 3 operations:

  • cp_async taking a source and destination memref with indices and the size to copy
  • cp_async_group_commit commits all the pending cp_async into a group. This op would also return an Index or opaque type representing the group.
  • cp_wait_group would take the group index/opaque type to know which group it is supposed to synchronize with.

Those ops would have side-effects to prevent any re-ordering.

This representation differs from how ptx exposed this functionality as the group doesn’t have an explicit id. Instead the wait_group ptx instruction takes an immediate parameter with the number of maximum of groups that can be in flight for the wait to unlock. This is not a great representation for the compiler as it makes it difficult to know which group the wait is referencing and this means we would need to have some intrusive hook into the software pipelining to make it work correctly.

The downside is that we will need some analysis to convert to the llvm/ptx representation and we may have to fallback to a dynamic sub-optimal case (cuda compiler seems to create a dynamic switch case when it cannot resolve statically).

Let me know if anybody has any thoughts. I’m hoping to send some early patch next week.

FYI @herhut @MaheshRavishankar @nicolasvasilache

Thanks, @ThomasRaoux. Support for this would be certainly great to have. A few comments/questions.

  1. I am wondering which operations you actually would be adding to NVVM to support this?

  2. The “index or opaque type” appears unclear. Is it one or the other?

  3. “commit to a group” and other names sound confusing (because the transfer isn’t done yet) – this is all tied to the terminology used in the PTX spec. We could actually abstract out group and only create/introduce it on lowering, i.e., one could have: copy_async_start (returns a tag), and copy_async_wait_all (takes a list of tags). You can create the corresponding groups and map to copy_async, copy_async_add_to_group and copy_async_wait_group while lowering it. Does this lead to any loss of representational power? You just abstract away the explicit notion of a group and with fewer ops.

This support would be great to have. I have some thoughts on conversion to the cp.async.wait_group PTX instruction.

After taking the group IDs as arguments to cp_wait_group we could count the total number cp.commit operations in the block before the wait operation and take its difference with the number of arguments and set it as N for the cp.async.wait_group operation. Do you see this working?

Here are the nvvm intrinsics it would map to:*

I think it should be an opaque type as this doesn’t really map to an index.

I agree it would be good to find a higher level representation. The downside is that breaking up this representation into groups may not be trivial if the copy_async_start consumed by different copy_async_wait_all are interleaved. Having the explicit grouping represented in the original IR would allow the user to directly control this and not rely on an analysis. That being said representing this as hidden side effect is not ideal. What do you think?

It would work in simple cases however it may not be possible when there is control flow and we cannot statically count the number of cp.commit. I don’t have a better solution at this point though. I think we will have to handle static cases like that and fallback to emitting a switch case when it cannot be resolved.

Thanks for the RFC. As you mention, the design of these at PTX level make it really hard to generate code for. So my take here is that we dont need the full generality of all possible sequence of copy/commit/wait operations, but rather narrow down on a smaller subset that is still useful. With that respect, I have the same concerns as Uday. The cp_async_group_commit is problematic to represent. It has the same pitfalls as syncthreads in that sense. It depends on textual order of things. I am also partial towards copy_async_start that returns a token and a copy_async_wait_all that takes a list of tokens to wait on. The load bearing part then becomes the transformation that converts these pairs into sequences of copy_async, copy_async_add_to_group and copy_async_wait_group. Given that we need to make this transformation at NVVM lowering it would then make sense to have both paradigms,

  1. the copy_async_start and copy_async_wait_all paradigm that can be targeted by different lowerings.
  2. The copy_async, copy_async_commit_group (in lieu of copy_async_add_to_group) and copy_async_wait_group that mirror the NVVM intrinsics.

Before lowering to NVVM, there needs to be a transformation from paradigm (1) to paradigm (2). If we dont support paradigm (1) in IR, then what will end up happening is that some lowering targeting paradigm (2) directly is effectively going to have a representation similar to paradigm (1), but instead of it being IR, it will be in the C++ code that is doing the lowering, making the lowering bloated and non-compositional. (Yet another case where the logic being carried in IR is better than having it in C++ code).

We can start with a simplistic transformation from paradigm(1) to paradigm(2) and evolve it as we go along. The transformation can happen in several steps if need be as we hit more complex cases. My claim here is that anything in MLIR that is trying to directly target paradigm (2) will also be able to target paradigm (1), i.e. there is no loss of generality.

1 Like

Thanks for pushing on this @ThomasRaoux .

Any reason you can’t reuse/extend the async dialect 'async' Dialect - MLIR ?
I’d be interested to play with this from the tensor domain, with futures and the benefits related to def-use information. I would see the lowering to NVVM as something specific but the rest would generally be retargetable.

Also @ezhulenev @mehdi_amini here as I expect they’d have good insights.

Are you suggesting having those 2 level of abstraction in the GPU dialect? I was mostly looking at the lower level part but I agree that we will need higher level of abstraction on top of it.

Good point, I’m not familiar with this dialect but it does seem to expose functionality very close to what we need. The GPU dialect part would most likely only work on memref indeed. My understanding is that async dialect is meant to be used for host code generation. Does it make sense to use it for device side code? I assume we would need a lowering of those async ops significantly different than what exists today?

Yes. Thats what I was saying. Have 2 levels. Any dialect targeting GPU dialect will have to use the higher level abstraction.

I am vary of this working out of the box. It is meant for host-side async. This is async within device code from global → shared copy. So its both async, but without knowing details of the implementation, seems like it would be conflating use cases.

1 Like

I don’t think async has to be for “host side” necessarily, but I agree that it is a paradigm of promise/future kind of relationship that does not necessarily map in a straightforward way to dma start/wait: while you could do it, it is also overly generic and so you have to pattern match very specifically for lowering with chances or just failing and rejecting the code.

I quite like Mahesh’s views on the topic here right now!

Generally +1 to @MaheshRavishankar suggestion of modelling this at a higher level in the GPU dialect and then lowering it to the specific form that is used in PTX. Below are some thoughts from thinking about this a little.

Even going with a format of copy_async_start and copy_async_wait_all does not remove the dependency on the textual order. How do we intend to handle control flow (if at all)? Is it allowed to use copy_async_start in a function and the copy_async_wait_all in the caller? We could require that the operations have to be in the same block.

I also considered to group copy_async in a region that returns an explicit token to wait on as an alternative. That still does not remove the dependency on textural order but provides some scoping. I don’t think the added restrictions are worthwhile.

We also do not have a very precise way to model the effect of an copy_async_start operation. We can mark it as effecting its operands as read/write. The copy_async_wait_all would need to have much coarser memory effects and would effectively serve as a memory barrier. Or we could equip the copy_async_wait_all with operands that explicitly list the memory that is impacted. Verification would then need to ensure that the operands cover what is required. Not sure this is worthwhile and whether we need this precision.

Lastly, I have come back to the async modelling with futures. If we were to use an approach where copy_async_start returns a future/token and copy_async_wait_all waits for these explicitly, we capture more of the dependency information. And while this format allows interleaving of operations that belong to different waits, it should always be legal to push the operations down to their corresponding waits. Assuming we require that copies dominate their syncs.

I don’t have cycles to think this through, so don’t take these thoughts as blockers for the design. Just putting these out here.

Is this comment conflating the generic async-await-future mechanism with the async.coro ops (that are implementation details for CPU and happen to live in the same dialect)? There is nothing host-specific to the “async-await-future” programming model; it is ubiquitous today from large scale datacenters to embedded devices…

Not sure why, take a high-level piece of IR that resembles:

%group = async.create_group ...: ! 
%token, %result  = async.execute () -> !async.value<tensor<8x16x32xf32>> {
    %1 = ... : tensor<8x16x32xf32>
    async.yield %0 :  tensor<8x16x32xf32>
async.add_to_group %token, %group : !async.token

%v = async.await_all %group: !
// Note the following would work too and can be made arbitrarily fine-grained: 
//   %v = async.await %result: !async.value<tensor<8x16x32xf32>>

The 3-D copy can lower to whatever unrolled form the hardware can support (pointwise, 1-D w/without strides, 2-D w/without strides, etc). Then, after bufferization, an async.execute + copies it contained will be lowered to a some NVVM form of interest. I view this as a variation of virtual-vector / hw-specific-vector dialects thinking but on tensors.

I think these may be symptoms of trying to define new ops that operate on buffers?
It seems easy to extract a slice of copies on tensors and rewrite them in the async form above, wait for lowering to smaller copies / bufferization to happen and then rewrite to the proper NVVM / DMA form.

As often, the use of proper ops with regions and progressive lowering seems more composable and retargetable to me than having to design new intermediate abstractions that will likely want their own offset/size/stride representations and all the foldings and canonicalization that go with it.

IMO gpu-dialect-specifc copy_async_start etc may conflate 2 things:

  1. copy ops that we already have and should reuse and extend as appropriate
  2. async ops that we already have and should reuse and extend as appropriate

The NVVM / HW-specific level abstractions are of course needed.

Indeed it is. If the async dialect has ops that serve the same purpose, there is no need to reinvent this in the GPU dialect. So I am fine with using this for the token-based abstractions.

I think these ops are refering to the those are that are needed for targeting the NVVM intrinsics. I am wary of the complexity of lowering from the async ops to the NVVM intrinsics living purely in the LLVM/NVVM lowering. I would rather be exposed in MLIR and then lowered in a straight-forward way to the NVVM intrinsics. Based on your statement

I think you agree as well. These NVVM/HW-specific abstractions are post bufferization and side-effecting?

The reason I’m uneasy with this, is that you would support inside a gpu kernel only a very specific form of this sequence. The region for the async.execute can contain only a specific op that lowers to the “copy start” (you also need bufferization to handle the “shared memory” allocation somehow).

Basically, my comment reflected my feeling that the async.execute model offers too much degrees of freedom compared to the target abstraction exposed by NVVM. I feel its fragile because we’d have to pattern-match a very specific form of this that can be mapped to the target and nothing else. That seems fragile in the same way as approaches that rely on “raising” (not saying that this is a raising though).

That is a tough requirement to make in MLIR: I don’t think it can be enforced on generic transformations right now. It reminds me of the (unsolved) discussion about “convergence” modeling.
(and I suspect that the more we try to use “advanced” GPU features, the more we’ll keep hitting this “convergence” issue over and over)

yes, no Q there.

1 Like