TLDR;
This RFC proposes the introduction of the gpu.conditional_execution
operation. This operation allows having device and host code in the same operation. This op serves a similar purpose to the CUDA idiom:
__host__ __device__ void hostDevFn(...) {
#ifdef __CUDA_ARCH__
// device code
#else
// non-device code
#endif
}
The selection of whether the host or device code is ultimately executed is determined based in the context surrounding the operation.
This proposal is implemented in:
Why?
-
When the
gpu-kernel-outlining
pass is applied functions called from inside agpu.launch
are cloned into thegpu.module
, while the original function is also left in the host module. This limits the what the called functions can contain, for example, it canât contain gpu operations likegpu.thread_id
because they are not defined for the host.gpu.conditional_execution
is useful here because it allows to always have a well defined function for this use case. -
The
gpu.conditional_execution
is also useful for operations that could potentially run in either host or device, for exampleomp.target
: -
AFAIK thereâs no way to express the GPU idiom listed in the TLDR with upstream dialects, thus this operation adds a classical GPU idiom.
Proposal:
Add the gpu.conditional_execution
operation. This operation executes a region of host or device code depending on the surrounding execution context of the operation. If the operation is inside a GPU module or launch operation, it executes the device region; otherwise, it runs the host region.
This operation can yield a variadic set of results. If the operation yields results, then both regions have to be present. However, if there are no results, then itâs valid to implement only one of the regions.
Add the --gpu-resolve-conditional-execution
pass. This pass resolves whether to execute the host or device region depending on the context.
Example:
Consider the following code, where the function thread_id
returns 0 if called from the host, or gpu.thread_id x
if called from a device.
func.func @thread_id() -> index {
%val = gpu.conditional_execution device {
%id = gpu.thread_id x
gpu.yield %id: index
} host {
%id = arith.constant 0 : index
gpu.yield %id: index
} -> index
return %val : index
}
func.func @launch(%host: memref<index>, %dev: memref<index, 1>) {
%c1 = arith.constant 1 : index
gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1,
%grid_z = %c1)
threads(%tx, %ty, %tz) in (%block_x = %c1, %block_y = %c1,
%block_z = %c1) {
%id = func.call @thread_id() : () -> index
memref.store %id, %dev[] : memref<index, 1>
gpu.terminator
}
%id = func.call @thread_id() : () -> index
memref.store %id, %host[] : memref<index>
return
}
After applying mlir-opt --gpu-kernel-outlining --gpu-resolve-conditional-execution --inline
, we obtain a code where the correct code section was resolved depending on the context:
module attributes {gpu.container_module} {
func.func @thread_id() -> index {
%c0 = arith.constant 0 : index
return %c0 : index
}
func.func @launch(%arg0: memref<index>, %arg1: memref<index, 1>) {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
gpu.launch_func @launch_kernel::@launch_kernel blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1) args(%arg1 : memref<index, 1>)
memref.store %c0, %arg0[] : memref<index>
return
}
gpu.module @launch_kernel {
gpu.func @launch_kernel(%arg0: memref<index, 1>) kernel attributes {gpu.known_block_size = array<i32: 1, 1, 1>, gpu.known_grid_size = array<i32: 1, 1, 1>} {
%0 = func.call @thread_id() : () -> index
memref.store %0, %arg0[] : memref<index, 1>
gpu.return
}
func.func @thread_id() -> index {
%0 = gpu.thread_id x
return %0 : index
}
}
}