[RFC] Addition of GPU conditional execution op

TLDR;

This RFC proposes the introduction of the gpu.conditional_execution operation. This operation allows having device and host code in the same operation. This op serves a similar purpose to the CUDA idiom:

__host__ __device__ void hostDevFn(...) {
#ifdef __CUDA_ARCH__
// device code
#else
// non-device code
#endif
}

The selection of whether the host or device code is ultimately executed is determined based in the context surrounding the operation.

This proposal is implemented in:

Why?

  • When the gpu-kernel-outlining pass is applied functions called from inside a gpu.launch are cloned into the gpu.module , while the original function is also left in the host module. This limits the what the called functions can contain, for example, it can’t contain gpu operations like gpu.thread_id because they are not defined for the host. gpu.conditional_execution is useful here because it allows to always have a well defined function for this use case.

  • The gpu.conditional_execution is also useful for operations that could potentially run in either host or device, for example omp.target:

  • AFAIK there’s no way to express the GPU idiom listed in the TLDR with upstream dialects, thus this operation adds a classical GPU idiom.

Proposal:

Add the gpu.conditional_execution operation. This operation executes a region of host or device code depending on the surrounding execution context of the operation. If the operation is inside a GPU module or launch operation, it executes the device region; otherwise, it runs the host region.

This operation can yield a variadic set of results. If the operation yields results, then both regions have to be present. However, if there are no results, then it’s valid to implement only one of the regions.

Add the --gpu-resolve-conditional-execution pass. This pass resolves whether to execute the host or device region depending on the context.

Example:

Consider the following code, where the function thread_id returns 0 if called from the host, or gpu.thread_id x if called from a device.

func.func @thread_id() -> index {
  %val = gpu.conditional_execution device {
    %id = gpu.thread_id x
    gpu.yield %id: index
  } host {
    %id = arith.constant 0 : index
    gpu.yield %id: index
  } -> index
  return %val : index
}
func.func @launch(%host: memref<index>, %dev: memref<index, 1>) {
  %c1 = arith.constant 1 : index
  gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1,
                                       %grid_z = %c1)
             threads(%tx, %ty, %tz) in (%block_x = %c1, %block_y = %c1,
                                        %block_z = %c1) {
    %id = func.call @thread_id() : () -> index
    memref.store %id, %dev[] : memref<index, 1>
    gpu.terminator
  }
  %id = func.call @thread_id() : () -> index
  memref.store %id, %host[] : memref<index>
  return
}

After applying mlir-opt --gpu-kernel-outlining --gpu-resolve-conditional-execution --inline , we obtain a code where the correct code section was resolved depending on the context:

module attributes {gpu.container_module} {
  func.func @thread_id() -> index {
    %c0 = arith.constant 0 : index
    return %c0 : index
  }
  func.func @launch(%arg0: memref<index>, %arg1: memref<index, 1>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    gpu.launch_func  @launch_kernel::@launch_kernel blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1)  args(%arg1 : memref<index, 1>)
    memref.store %c0, %arg0[] : memref<index>
    return
  }
  gpu.module @launch_kernel {
    gpu.func @launch_kernel(%arg0: memref<index, 1>) kernel attributes {gpu.known_block_size = array<i32: 1, 1, 1>, gpu.known_grid_size = array<i32: 1, 1, 1>} {
      %0 = func.call @thread_id() : () -> index
      memref.store %0, %arg0[] : memref<index, 1>
      gpu.return
    }
    func.func @thread_id() -> index {
      %0 = gpu.thread_id  x
      return %0 : index
    }
  }
}

The use-case is not totally clear to me. You mentioned the example of a function using #ifdef __CUDA_ARCH__ but this is a frontend construct: in the IR this would be mangled in the function name and you’d have two functions already.

It seems that you’re trying to get to a state where gpu.launch could call either host or device code, but is this really necessary? The alternative (which you’re not mentioning?) is that the higher-level layers must mangle/specialize the device code earlier.

There are two use cases:

  • With gpu.launch.

As it currently stands calling functions from inside a gpu.launch is not always well defined, or the functions are limited to no gpu operations inside the function. If the frontend should manage this then the kernel-outlining pass should be fixed to avoid keeping those functions in the host module or disallow them.

  • With omp.target, acc.kernel, or any ops that can potentially run in both host and device contexts.

In this case the op provides a convenient way to express two alternative paths that are resolved later in the compilation chain.
For example, let say that a pass detects that inside omp.target there’s an op that could be optimized with a mma gpu intrinsic, then it could rewrite the op with conditional_execution and resolve which path to compile at a later stage.

Actually I’m trying to fix that because gpu.launch is currently able to reference anything whether it is well defined or not. For example, the following is valid but not translatable code:

func.func @thread_id (...) -> index {
  %id = gpu.thread_id x
  return %id : index
}
func.func @main(...) {
...
gpu.launch ... {
%tid = func.call @thread_id(...)
}
%tid = func.call @thread_id(...)
...
}

The op introduces a mechanism for well-defined calls from gpu.launch.

I didn’t think of that alternative. However, in that case gpu.launch and kernel-outlining should be fixed.

Further to @mehdi_amini original concerns, I’m also struggling to see how this would work in any reasonable way.

I don’t understand how non-GPU execution could have GPU-specific code. Are you planning to implement the CPU version of every GPU dialect operation? If not, then running on the CPU would be an error, which goes back to what you’re trying to fix.

That’s not a fair comparison at all. OpenMP and OpenACC are generic programming models, and in addition to CPU execution, they have GPU offloading.
CUDA/OpenCL are GPU programming models and don’t have to have CPU implementations.

The compiler would only “detect that an op can be optimized with” a GPU op if the target is a GPU. It makes no sense to do so on a CPU target.

If you want to decide which target to dispatch to at runtime, @mehdi_amini answer is still far simpler than what you’re trying to do:

This would be a simple scf.if dispatching two different functions: one for CPU, simply calling it, and another for GPU, dispatching and offloading.

This sounds like a bug that needs to be fixed at the source, not by creating another incomplete implementation (ie. new conditional op with no CPU variants of gpu ops).

Moreover, when we want to target non-CPU/GPU accelerators (FPGA, ASIC), what do we do? Do we create one per device and chain the conditionals? This does not look like an idea that scales very well.

No, this op is not for that. It’s responsibility of the user to provide what to do in each path, the thread_id code is just a simple example.

I’m jumping the gun, but today or tomorrow I’ll be posting a different RFC that allows OMP offload compilation within MLIR. Part of the appeal is to have the possibility to use gpu intrinsics within omp.target, this might include having them in the IR before outlining.

For the omp.target example one might now that both targets are going to be compiled and it might make sense to try to optimize both at the same time.

The op is not about dispatching one or the other, it’s about having well-defined dual paths.

How does one knows which branch to execute? That branch resolution would have to happen at the front-end. Which then the scf.if would become redundant, because the front-end could have emitted the correct branch from the start. But for that we are assuming a separate compilation model, and gpu.launch doesn’t belong to that model.

The op is not for creating variants of GPU operations, it’s to allow defining what to do in each path.

The op is limited to context resolution in GPU and host. If it’s inside a GPU then GPU, otherwise host.

I don’t really know what this means…

Not if the decision is taken at run-time. For example, the kernel implements dynamic shapes and there is a run-time check if (dim > 4096) -> use GPU else use CPU.

It’s the otherwise host part that you’re overlooking. That’s not always true. If could be if GPU, else if XPU, else if TPU, else if IPU, else CPU.

If we add the GPU conditional, what’s stopping all other non-CPU devices from adding their own, and if they do, how do you even compose them?

And you can’t say “it’s undefined”, because you’re proposing to add something that will make using other devices harder. That’s not a very good design.

I’m referring that it’s not a runtime or dynamic decision. It’s a compile time decision, so instead of being an if(isDevice) it would be more akin to an if constexpr (isDevice), and those paths (then and else) are potentially possible until the isDevice constant gets resolved.

That’s a different problem, and that one can be solved dynamically with ifs.

Composition would be well-defined, however, it wouldn’t be pretty. It would be like a chain of:

if constexpr (isX) {
  ...
} else {
  if constexpr (isY) {
    ...
  } else {
    ...
  }
}

I can see the benefit of conditional execution for OpenMP/ACC, because omp.target or acc.kernel can execute any Op for host and device. But why don’t you do outlining for each target before hand and specilize inside these functions? Is early outlining an issue?

I like the conditional execution as an idea, but the Op differentiates only generic GPU and CPU. We’ve drastic diversity within one vendor’s GPU. Writing mma is significantly different between architectures of the same vendor. I think it would be better if we come up with a generic op (like openmp’s metadirective or openacc’s device_type constructs)

Your point about the calling gpu.thread_id x is right, IMHO but the decisions of generating this line should be made in the earlier phases of the compiler.

1 Like

I definitely agree with this, however, I cannot commit to do actual work on this path right now. Maybe in a few months we can revisit this, if it hasn’t been implemented.

What I’m thinking is, lets take flang; It might know whether it has to generate GPU & CPU code from the start. Then one of the hlfir intrinsics could be conditionally lowered before outlining, instead of having to go and rewrite it with two different passes. However, I do admit flang can do the lowering separately, it’s just convenience.

I partially agree, but then gpu.launch shouldn’t exist. Because the idea that all of this should be handled earlier corresponds to a separate compilation model, in which the frontend from the start handles the GPU and host path as separate units -which is fine. However, gpu.launch doesn’t belong to the separate compilation model (it’s embedded in the host and can call host functions), hence the problem.