[RFC] Converting multi-threaded SPIR-V to LLVM dialect: overview

Hi all,

I have been working on SPIR-V to LLVM dialect conversion over summer, focusing on the single-threaded code. I am now interested in investigating how the multi-threaded SPIR-V can target CPUs. Some related links:

This post intends to give a high-level overview of the project ideas, and to spark the related discussion. Feel free to make suggestions/corrections/comments - any contribution is very welcome! :blush: After gathering some feedback I can start prototyping and add more details regarding the implementation side.

Goals
Ultimately, the idea is to be able to convert SPIR-V multi-threaded kernel with synchronisation constructs to LLVM IR. We want to preserve inter-thread and memory coherence in a way that running LLVM’s optimisations would not ruin GPU-specific constraints. The kernel can look like

spv.module @foo Logical GLSL450 ... {
  spv.globalVariable @id built_in("WorkgroupId")

  spv.func @bar() "None" {
    %id = spv.mlir.addressof @id : ...
    // Code #1
    spv.ControlBarrier "Workgroup", "Device", "Acquire|UniformMemory"
    // Code #2
  }
}

Mapping to CPU
The main question is how to model multi-threaded GPU execution on CPUs. For that, we need to transform GPU thread organisation hierarchy to CPUs. I think of the following mapping (inspired by Intel’s OpenCl to CPU compilation):

  • Each workgroup can be considered as a single CPU thread
  • Subgroups and invocations can be packed into a SIMD vectors, as a wavefront

This approach allows to avoid launching hundreds of threads on CPU, as well as uses SIMD to mimic SIMT concepts.

Synchronisation
One of the challenges is to model barriers in LLVM dialect. While spv.MemoryBarrier is essentially LLVM’s fence instruction, spv.ControlBarrier requires extra handling. A naive mapping can be similar to spin lock - blocking until the counter value has reached the number of threads.

Transformations
With this mapping, each multithreaded kernel would have to

  • undergo “unrolling” (workgroup >> single CPU thread/SIMD ops) transformation
  • undergo “analysis” transformation to handle convergence/synchronisation issues (like variables crossing the barriers)

This transformation can be written as a separate stand-alone SPIR-V dialect pass: e.g. -spirv-to-cpu.

Can you use the OpenMP dialect in MLIR?
Can you use the OpenMPIRBuilder in LLVM?

This is to handle a ControlBarrier across multiple workgroup right? Since you mentioned that you would map a workgroup to a single CPU thread, you can’t really spin lock there. Instead it seems to me that you need to do something more complex if the number of invocation in the workgroup don’t match the SIMD width.

Hi George, sorry about the late reply; just went back from a long leave. :slight_smile: It’s great to see that you are still interested in pushing this forward after the GSoC project!

I’d say before jumping into decisions, it would be nice to do some preparation and define steps ahead. I suspect many of the issues we are gonna handle won’t be trivial; they will involve taking inspirations from existing solutions and discussing with the community to weigh trade offs. So making sure we survey enough is important. I believe you have already been doing this (hence this thread!), just without writing all the stuff down explicitly. :slight_smile:

Specifically, I’d suggest to understand more around 1) CPU/GPU concurrency hierarchies and differences, 2) existing concurrency programming primitives/models for CPU (threads, corontines, fibers, SIMD, OpenMP, etc.), 3) CPU emulation of GPU (OpenCL, OpenGL, Vulkan, etc.) kernel execution in existing projects, 4) etc. It would be a great contribution already to write them down somewhere, as it provides a background for discussion and also to get everybody interested to be on the same page.

Then we can start biting the whole task off piece by piece. This could be driven by dismantling some concrete kernel in reality so we have a clear target. We won’t be handling everything in one-go because there are so many cases. For the cases in which we already have direct mapping we can just do it. For others we might need improvement that takes longer to land. Having dedicated threads for discussing specific issues to keep focused is typically more productive. (This isn’t saying the overview thread isn’t useful; quite the opposite—we need it too.) It would be great if you can drive to formalize such steps and plans.

Sorry if this is being a bit vague; but I guess that’s what should be for overviews. :slight_smile:

Thanks,
Lei

Sorry for late reply! I was also on a holiday :slightly_smiling_face:.

That’s right. spv.ControlBarrier has different scopes, particularly:

  1. Device: device can have multiple workgroups that should all reach the barrier to proceed. Since we can model workgroup as a single thread, this is is essentially a thread barrier (e.g. spin lock)

  2. For other scopes smaller than workgroups we pack instruction into SIMD vector. Say the workgroup size is W and SIMD width is S. I see the following options:

  • if W == S, then it is just a vector instruction.
  • if W < S, we can still use vector but with predication, turning off the non-used entries. (Vector predication is currently a work in progress under LLVM as far as I know, I will look more into this). Alternatively, we can assume that S must be a multiple of W and not consider this case.
  • if W > S, we now need to create multiple (vector) instructions and make sure that they are executed before the barrier. Particularly, if we had
vector_op1.1
vector_op1.2
// barrier was here
vector_op2.1
vector_op2.2

we need to make sure that vector_op1.* are executed before vector_op2.*. A possible solution is to combine instructions before/after the barrier into non-inlined functions. Then the compiler should not reorder produced function calls.

I agree! Thanks for the overview. So I guess the plan is the following:

  1. Outline the differences between GPUs/CPUs (SPIR-V/LLVM) to identify what has to be mapped (e.g. barriers, etc.).
  2. Gather links from existing projects (I have mostly looked at OpenCL at the moment).
  3. Decide on the kernel/cases to map (i.e. what SPIR-V/GPU dialect kernel we can consider as MVP).
  4. Decide on actual mappings based on case-by-case basis.

I have been doing some things from the list already, but I guess it is way better to summarise them all in one place before jumping to prototypes, etc. So I will start with that :blush:.

Do you assume you always know the exact values for W and S before doing the conversion? Otherwise the code has to be generated parametrically somehow.

It isn’t clear to me why we need to prevent reordering of side-effect free (outside of convergent specific operations) operations? Are there specificities associated to the GPU model here?

In SPIR-V dialect, we can use LowerABIAttributesPass to set shader interface attributes, and particularly the workgroup size. I assume the values of S should be chosen optimally based on W? I know that LLVM has a ScalableVectorType, so some inspiration can be taken from there. (I will add more on this when I will put together resources, proposal, current solutions, etc. as mentioned by @antiagainst )

Here I meant instructions with side-effects, as per SPIR-V dialect spec for spv.ControlBarrier:

All invocations of this module within Execution scope must reach this point of execution before any invocation will proceed beyond it.

If you want to convert to single-thread CPU code, why would you still need fences? Is the SSA semantics not ensuring the needed sequencing, if the operations of multiple SPV threads are correctly interleaved into the single CPU thread? I saw your answer to @mehdi_amini 's similar question, but it’s not clear to me why you want to preserve more than the functional semantics when converting into CPU code.

The programming model of GPUs relies in the concurrent execution of “threads” inside a workgroup, and it can be semantically not possible to just execute one at a time in sequence if there is a barrier or a convergent/uniform instruction.
Mapping this to a single thread is possible, but you need to still interleave each instruction for all the “threads” in between barriers. Some cooperative instructions need also specific handling.
I’d look into how Intel mapped OpenCL to CPU for inspiration or POCL? Here is one past presentation: https://llvm.org/devmtg/2012-04-12/Slides/Ralf_Karrenberg.pdf and a good description of the algorithm involved in POCL: Tampere University Research Portal

Regarding workgroup size: workgroup size is specified in the SPIR-V module as constants. They can be normal constants or specialization constants. But specialization constant is just a mechanism for second-stage compilation in driver compilers. I’d view the SPIR-V to LLVM conversion as “driver compiler” so specialization constants will be “freezed” and constant folded first. It’s reasonable to me to assume that we know the exact values for it. I’m not entirely up to speed with scalable vectors in LLVM; it would be nice to learn more via this effort. :slight_smile:

Regarding control barriers with device execution scope, I’m not sure how useful it would be to support it. At least for Vulkan, it requires all execution scope to be workgroup or subgroup (see VUID-StandaloneSpirv-None-04636). We can have more workgroups than GPU cores: some issued to GPU cores; some still waiting. It isn’t clear to me how one can sync across all these workgroups in kernel execution. Specific vendors might have extensions to support it in a constrained way; but I don’t think there is a standard cross-vendor solution for it. Again to me we won’t be supporting all features by looking at all the details in the spec in one-step (SPIR-V is for multiple APIs anyway) so having some real cases to guide the way would be nice.

Great reading! I have also found a talk for these slides from LLVMdev. Thanks a lot.

Indeed! I reformulated the proposal (below) so that it is more focused on a single initial feature - transforming a parallel kernel into a nearly-sequential function.

Goal and assumptions
The main goal is to set-up conversion patterns and transformations to make possible the multi-threaded SPIR-V dialect conversion to LLVM dialect.

The focus will be on the simple kernel, with no function calls and no notion of control flow (yet). This is primarily for the reasons that control flow operations may require complicated transformations for satisfactory performance, mostly due to control flow divergence. While this can be tackled by vector predication (essentially mapping GPU control flow masks), there are other approaches which can be more beneficial. These can be considered as future extensions to this project. It is also easier to build these if the core infrastructure has been already set up.

Parallelism
Just to reiterate, the main idea is to map a single workgroup to a single CPU thread. In ideal case (if only) this would transform subgroups to SIMD vectors and invocations to vector elements.

Short summary
I propose the following project outline (for the nearest future at least).

  1. Have a hello world kernel conversion working. For the basic kernel we can consider a kernel that uses ids of work/sub-groups, invocations in computations but does not have any synchronisation (and no control flow per assumption).

  2. Model SPIR-V barrier synchronisation: cases here are spv.ControlBarrier and spv.MemoryBarrier

This will help to tackle GPU to CPU mapping challenges one by one, and will help to support the main features quickly. More details about step 1 can be found below. I also think that a separate thread for step 2 to discuss barriers would me more convenient (Plus step 1 can be prototyped at that point).

1. Hello world kernel

An example of the simple kernel that can be supported is the following:

spv.module @hello_world Logical GLSL450 {
  spv.globalVariable @nw built_in("NumWorkgroups") : !spv.ptr<vector<3xi32>, Input>
  spv.globalVariable @iid built_in("LocalInvocationId") : !spv.ptr<vector<3xi32>, Input>
  spv.globalVariable @wod built_in("WorkgroupId") : !spv.ptr<vector<3xi32>, Input>
  spv.func @ hello_world_kernel(/* some buffer arguments */) {
      // SPIR-V code that represents something like:
      // buffer[wid][iid] = ...
  }
  spv.EntryPoint "GLCompute" @load_store_kernel, @wid, @lid, @nw
  spv.ExecutionMode @ hello_world_kernel "LocalSize", 16, 1, 1
}

1.1 “Unrolling” the kernel
The idea is to make the code “single-threaded” for every workgroup id in dimensions x, y z. I propose to introduce a kernel unrolling pass (Arbitrary naming for now) that will essentially wrap the kernel into for loops (this approach nicely follows OpenCL to CPU compilation: see @mehdi_amini’s links above). The structure (with pseudocode for loops) will the be:

spv.module @hello_world Logical GLSL450 {
  spv.globalVariable @nw built_in("NumWorkgroups") : !spv.ptr<vector<3xi32>, Input>
  spv.globalVariable @iid built_in("LocalInvocationId") : !spv.ptr<vector<3xi32>, Input>
  spv.globalVariable @wod built_in("WorkgroupId") : !spv.ptr<vector<3xi32>, Input>
  spv.func @ hello_world_kernel(/* some buffer arguments */) {
      for every WorkgroupId from (0, 0, 0) to numWorkgroups
        for every subgroup from 0 to subgroup size // Optional
          for every localId from (0, 0, 0) to LocalSize
            // Old kernel code here
  }
  spv.EntryPoint "GLCompute" @load_store_kernel, @wid, @lid, @nw
  spv.ExecutionMode @ hello_world_kernel "LocalSize", 16, 1, 1
}

The number of workgroups and the number of subgroups (if exists) are passed as the global values. The number of local invocations per workgroup is extracted from spv.ExecutionMode.

Having done this, it is now remains to parallelise workgroup for loop to dispatch numWokgroups number of threads, and handle the scalar kernel separately.

1.2 Parallelising the workgroup loop
To create 1D workgroup for loop I propose to reuse the lowering from OpenMP dialect to LLVM dialect. This can be modelled by omp.parallel to specify the number of threads, and omp.WSLoop to actually create the parallel loop.

To create 2D or 3D workgroups we can proceed in the same way, but also setting collapse value in omp.WSLoop so that the nested loops are handled together. I haven’t properly looked at this part yet, so I think that starting with 1D is reasonable.

The actual code inside the workgroup loops can be packed into a function, that additionally takes all ids as parameters.

1.3 Vectorizing the kernel
I propose to use a second pass after “unrolling” the kernel to pack the instructions into vectors. There are number of points here

  1. Vector width should be parametrised and passed as an argument to the pass. Otherwise, analysis is required to get the optimal SIMD width (like in LLVM loop vectoriser) which is again, an extra feature that can be considered separately from the initial goal.
  2. LocalSize % width == 0 to avoid non-trivial cases in the beginning. The starting point will be 1D LocalSize
  3. All local ids should be used directly in the kernel to avoid gather/scatter ops.

Hopefully this makes things clearer!

I encourage you to look into convergent operations in general: it is possible that in your context it is OK to approximate it the way you handle barrier, but I’m not entirely sure either :slight_smile:

Otherwise it seems like a good plan for an initial prototype! I’d encourage to try to ping the folks working on OpenCL on CPU (from POCL or clang?) and get feedback on what we’re doing here, they may be interested somehow.

1 Like

Awesome, thanks George! This looks very clear and targeted. It should lay down a great foundation for other additional tasks!

I guess another nice deliverables with this progressive approach is that we can have very thorough checks in the end: we explicitly enable the support piece by piece so less likely to accept unsupported cases and thus miscompile in the future when things become more advanced. :slight_smile: Overall looks good to me. A few nits:

GPU to CPU distribution pass? CPU thread distribution pass? (Sorry I’m not creative here. But to me unrolling hints that it’s still the same compute unit processing all the workload.)

In SPIR-V resources are passed in via global variables using bindings. So the entry function does not have arguments.

+1!