Mark gpu::LaunchFuncOp Async?

Hi all,

how can one mark a gpu::LaunchFuncOp as async? According to the documentation, the async keyword can be attached and the LaunchFuncOp then returns an async token.

Most other GPU ops take a asyncToken in their build-method to allow to attach the async keyword, e.g., the gpu::AllocOp:

static void build(..., /*optional*/::mlir::Type asyncToken, ::mlir::ValueRange asyncDependencies,...);

However, the only build-method of LaunchFuncOp does not take such a token/Type and the default build methods are not present, as they are marked skipDefaultBuilders in the tablegen file.

Is there any other way to attach the async keyword?

Thanks in advance,

This might well be an oversight. The async keyword in our flows is added by the transformation pass here, which uses clone to create the operation with the added flag. Hence, we never had the need for an explicit builder.

I’d welcome a patch that adds support for this, though.

@herhut Thanks for your reply!

I could try to add the builder, but it might take some time until I get around to do it. Do you know why LaunchFuncOp was marked with skipDefaultBuilders?

I also realized that LaunchOp does not support async at all. Is there any specific reason why this is the case? My naive assumption was that an async LaunchOp could be transformed into an async LaunchFuncOp during GPUKernelOutlining.

I can only guess. The custom builder groups the different arguments to a launch into logical groups (threads, blocks, operands) vs. passing just a list of operands. Not having the default builder avoids exposing the underlying modelling.

This avoids the complexity of handling the async case for the launch operation. The region based version is meant for optimizations like moving code in and out of launches and async makes this more diffcult.

Typically, we first go to an outlined version and then introduce async. Out of curiosity, why do you want to do it the other way round?

That makes sense, adding async here would indeed make reasoning about the contents of the launch much more difficult. Thanks for the explanation!

In my case, I partition a large computation into a set of smaller tasks, which are then lowered into gpu.launch. Some of these tasks may be independent of each other, so they could execute concurrently, which I could try to model with GPU async. For every task, I do know at compile time on which other tasks it depends, and if gpu.launch had async-support I could attach this information as depend-tokens.

On the other hand, in my case, it would probably also be possible to lower directly to a GPU function and gpu.launch_func and attach the depend-tokens there.

What happens I combine the async dialect with the gpu dialect? How does the lowering of a gpu.launch inside a async.execute work? Is the async directly attached to the launch (launch_func), if the execute does not contain any operations?

That is the solution I would recommend if you want to do concurrency planning at a higher level, possibly even before lowering to gpu. See this pass for an example that takes an async.execute and turns the inner gpu invocations into async.

Different async regions then end up on separate streams when lowered to the runtime (see here), which encodes the concurrency at the CUDA level.

You can find examples in the tests and integration tests.

@csigg contributed the implementation of this.

@herhut: Thanks for the hints and the pointers, I will look into that!