GPU Compute Basic Algorithms

Thanks Nicolas. That makes sense!

As promised, here is my take on the homogenization point from before. In my next post I want to talk more about composition and reusability.

Hierarchy Levels

I am not sure how to name these exactly, but they should definitely not be called thread, vector, lane, wrap, wavefront, block, grid, etc. These terms are confusing as they can mean different things depending on the vendors, technologies and on CPUs / GPUs.

None the less, It should be possible to define an attribute enum for these 4 levels, which can be used inside the GPU and Vector dialect equally. This attribute would then be used to progressively lower the operations onto their own equivalent of the next lower level.

Single Element (Lowest Level)

  • terms (GPU): invocation (Vulkan), lane, thread
  • terms (CPU): vector element
  • Memory type: Register, temporary thus SSA-only
  • Dimensionless (always 1D)
  • Concurrency: Simultaneous / parallel and implicitly synchronized
  • Possible interactions: Bits inside the element

Single Processor

  • terms (GPU): subgroup (Vulkan), warp, wavefront
  • terms (CPU): SIMD vector, thread
  • Memory type: Register, temporary thus SSA-only
  • Dimensionless (always 1D)
  • Concurrency: Simultaneous / parallel and implicitly synchronized
  • Possible interactions: All elements in the same processor

Multi Processor

  • terms (GPU): workgroup (Vulkan), thread block
  • terms (CPU): (sliding) window
  • Memory type: Shared, temporary thus SSA-only
  • Dimensional (fixed to 3D on GPUs)
  • Concurrency: Simultaneous / parallel but requires explicit synchronization
  • Possible interactions: All elements which are currently being processed

Entire Array (Highest Level)

  • terms (GPU): global work (Vulkan), grid
  • terms (CPU): n-D vector (MLIR Vector Dialect), array
  • Memory type: Global / Buffer, persistable thus memref
  • Dimensional (fixed to 3D on GPUs)
  • Concurrency: Sequential / serial
  • Possible interactions: Only from past to future


The temporal / sequential axis can be lowered from the array level down to the element level in some cases (like a scan followed by some element-wise arithmetic): Instead of doing one pass for the scan kernel, store and load to / from buffers and then another pass for the arithmetic kernel, everything can be fused into one kernel and one pass. This lets the values stay in registers and avoids the slow access to global memory in the middle. But this also means, that part of the temporal / sequential axis then happens inside each element.

Fixed Dimensions in GPUs

The fixed dimensions for the two upper levels of the hierarchy are somewhat problematic:

  • They are fixed to exactly 3 axes. So, a bad fit for generic n-D vectors / arrays.
  • The maximum number of elements along each axis can be different from the other two. This is especially problematic for transpose operations.
  • The y and z axis are sometimes almost unusable / ridiculously limited (on some mobile devices).

For these reasons it might be better to only use the x axis and simulate n-D instead.

The Multi Processor Level on CPUs

I could not find anything on multi processor collaboration in the Vector dialect or anything else CPU related. Is it just not planned / implemented, have I overlooked something or is it unfeasible?

Overall this looks good to me! Seems like it would be nice to have this as doc somewhere (maybe here: 'gpu' Dialect - MLIR ?), I remember being confused by the Vulkan terminology coming from CUDA.

The terminology, we should stay aligned with the GPU dialect. If we redefine the levels we should do it there accordingly.
Your single-processor/multi-processor levels are somehow of a misnomer to me: as a thread block is running on a single multi-processor (in CUDA terminology) as far as I understand. Also you mentioned for the multi-processor " * Possible interactions: All elements which are currently being processed" ; it isn’t clear if this refers to all elements in the entire array: it should be all elements in the block.

What about LockStep and Processor for these two levels?

I am not sure, what you meant by that.
Wouldn’t it be best to find names that match the GPU and CPU (Vector dialect) equally?

Lockstep goes in the right direction, but it is more of a phenomenon than a property. While the second lowest level is definitely always in lockstep, the level above could also happen to be in lockstep, depending on the scheduling and synchronization. And yes, you a right that the term “processor” can mean totally different things depending on the vendor and implementation. So it is probably best not to use it at all.

I wanted to make it clear, that the “block” / “work group” is not just a physical location but also has a temporal aspect. It logically traverses the array (at the highest level) but can stay at the same physical location. That is why I used the word “currently” in “All elements which are currently being processed”.

Maybe the perspective of scheduling (similar to your idea of lockstep) could be used for the naming. As far as I know, on GPUs the dynamic scheduling happens on the third (lowest) level. This would also be the same for the CPU side of things.

Trying to glob CPU/GPU together might not be the best way to implement the algorithms that the original post described. To get efficient performance, the algorithms on CPU and GPU could vary quite significantly. So at this level (close to GPU dialect) I would trade-off portability to performance. I did see that the consensus was to do it in both, and I will go with the consensus, but my overall sense from the post is that the algorithms being implemented are GPU first. Given that I agree with Mehdi that it is better to keep the terminology closer to GPU/GPU dialect. Since GPU dialect is to map to both SPIR-V and LLVM dialect (NVVM actually), OpenCL/Vulkan terminology is preferable (though there are some CUDA terminology in GPU dialect like grid_dim, and block_dim which is legacy and no one has had the time to address it).

Would be better not to re-invent new terminology. Vulkan terminology is agreed upon by all vendor that support Vulkan (which includes NVIDIA). We could go with workitem/subgroup/workgroup which map to thread/warp/block in CUDA.

+1. I remembered we used to have such tables in some internal docs but they were not migrated out together with the open sourcing. Agreed that it would be nice to use this opportunity to have such a terminology map. And @Lichtso already did it nicely! :slight_smile:

+1. We already have 4+ (AMD, NVIDIA, OpenCL, Vulkan, …) columns in the terminology map; it would be nice to avoid creating another one, at least in the GPU domain. Aiming to map to different lower-level IRs, setting down on a set of terms following a standard would be preferrable.

Ah I didn’t get this angle, I see what you mean. That said is the CPU/Vector dialect already have naming for these hierarchy?
The trade-off seems like either introducing a new terminology, or using the GPU one and doing the mental mapping when targeting CPU. I can see pros and cons for both.
I am fairly allergic to Vulkan terminology personally (invocation is a terrible thread for a SIMD lane as “invocation” is used already in many context, including “invoking a kernel” for example), but if we already use it extensively in the GPU dialect it would lower the mental load for everyone to use it everywhere consistently I think.

I have been prototyping these algorithms the last few days to get the composition right. Apart from the 4 primitives which were already discussed: element-wise compute kernel, reduction, exclusive / inclusive scan I found 2 additional ones to be very helpful: Gather and scatter.

Gather is a permutation and scatter is an inverse permutation. Both are the inverse of each other. Additionally, they can also duplicate or discard elements, not just reorder them. For the gather this means that the size of the index array can have less elements that the maximal index in it plus one, resulting in elements being discarded. Likewise, for the scatter a condition mask can be supplied to discard some elements.

They could be implemented as:

  • register moves on the lowest hierarchy level
  • shuffle on the second level
  • load from (gather) / store to (scatter) shared memory on the third level
  • load from (gather) / store to (scatter) buffer / global memory on the forth level

(I think the vector.shuffle is already a gather as I described it)

Gather and scatter together allow to define the exchange of information between elements without using memref, purely in SSA style. Is that what you meant by “the side-effecting memory world” and “the value world”?

Thanks @Lichtso for digging deeper,

At this point I would be interested in seeing some of the prorotypes you mention that you have in progress and see how things would connect, from some level (doesn’t have to be the top-level, prob. better to start with the lowest level), down to execution.

In particular, it seems in your proposal, the vector level is only at the highest level and that you have more specific entities at lower levels (new types ?). Note that MLIR does not have any register guarantee, it has SSA values of different types and memory. Any register promotion / allocation is done in LLVM and one thing we can do from an MLIR-perspective atm is to coarsen the granularity of these values (up to the point where LLVM would spill). A vector.contract that operates on 2-D vectors is an example: starting from 1x1 and coarsening yields better and better arithmetic intensity until it gets too big and starts spilling (e.g. 6x16x8xf32 on avx2 and 8x32x32 on avx512 when lowered to vector.outerproduct (i.e.vector. broadcast + vector.fma)).

I seem to understand that you are proposing ops with semantically bearing names + attributes that describe mapping to hardware as “the representation of these primitives”?

I’ll describe some of the things we have in flight to maybe help bridge the delta between what I think you propose and what we have today. As far as basic compute primitives go, I have been thinking in temrs of target-independent ways of expressing the computation and how it breaks down in smaller pieces (both in some in-memory form and in SSA-values form). Then, mapping at some level of hierarchy would be obtained by applying transformations and targeting a primitive of appropriate granularity (i.e. loops + processor ID information, memory accesses, SSA-values and special instructions). Such transformations could be driven by special mapping attributes (which atm do not exist in core but may be close to what you are proposing).

As an example, let’s take the WIP flow for linalg.matmul to SPIRV. It resembles:

  1. linalg.matmul → tile, promote (with padding) and assign loops to blocks → loops + blockIdx... + copies to shared memory + linalg.matmul (this is done in a few steps but I cram everything in step 1.)
  2. rewrite as vectors into vector.transfer + vector.contract
  3. unroll vector operations to some good sizes for the HW (e.g. what comes in step 4.)
  4. map vector.contract to cooperative matrix instruction (e.g. 1 warp)

As these steps occur, canonicalizations, memory->value forwarding and injection of static behavior happen (e.g. the padding to fixed size which enables vectorization even on boundary tiles).

In this scheme things are not yet connected in a generic fashion: mapping a vector.contract to threads has implications on how other dependent vector ops must be mapped (TL;DR, there seems to be a need to follow individual scalar SSA values within the vector all the way to memory). So the way this ends up materializing is that SPIRV has a special type that propagates into the load/stores and at the memory level we can bitcast and “all is fine”. So what we have right now is very quite custom lowering to get started.

Note that `vector.contract` is more general because it has paralell/reduction iterators and a permutation to map vector dimensions to iterators. More generally, `vector.contract` could/should be augmented with a region and the it becomes a "structured op on vectors", similar to linalg.generic that is a "structured op" on buffers. There are representational caveats however, related to SSA values and reductions that also pop up in the "structured ops on tensors".

However the general problem is by no means solved.
It seems that a “representation of mapping to resources in the type system” that informs transformations is something that would be needed and could match some of your proposal. Historically, using affine maps to represent such information has worked reasonably.

The flow described above is “top-down” and relies on a lower-level representation for vector.contract which carries the mapping to threads / warps (i.e. the SPIRV CooperativeMatrix). That type/op “knows about” the mapping, but does not describe it in a generic way.

If I understand the way you naturally described the primitives, it seems to take a “bottom-up” approach.This is great for building unsurprising abstractions that work and it seems like it would fit the bill for building more general representations for scan/reduce/FFT/transpose/cooperative matrix type “mapped ops”. How the “mapping atribute” is represented, how it is used to connect to actual HW entities (I would think SSA values at some point in the process (e.g. thread/processor id)) and how it connects to the upper levels is still an open problem AFAIK. Do you already know of good representations that are known to work and compose already?

In my experience iterations consisting of switching hats between “top-down” and “bottom-up” while trying to connect pieces that we understand well yields the most satisfying results by decomposing into smaller well-understood problems. Another way to view this is that top-down is transformation/decomposition-driven and bottom-up is maybe more metaprogramming/composition-driven. A lowering is something special that sits at the intersection (e.g. see how vector.contract lowers to vector.broadcast + vector.fma: it is applied as a transformation but it is really metaprogramming a contract with insert/extract/broadcast and fma).
The tradeoff seems to be that tiling, fusion, moving memory around seem more prone to generalizing in a transformation setting whereas mapping to some HW dimension, rewriting X as smaller Ys and data distribution “seem simpler” in a metaprogramming setting.

I’ll stop the digression here because it may be too disconnected from your proposal but I am looking forward to seeing the concrete examples that your lowest-level abstractions would take and how they connect to high-performance execution on HW.

Exciting sutff!

We definitely want gather and scatter however I am afraid vector.shuffle is not it.This instruction closely models LLVM and the constraint is that the pattern is fully static.

This is one of the tricky disconnects and does relate to value vs memory: values are generally indexed with static indices. The exception is (1-D)-flattened LLVM vectors which can be insertelement/extractelement one scalar at a time. This relates to the deeper corners of the deeper dive in the vector dialect.

I think the fundamental reason is that values boil down to registers for which one must know the number statically to emit assembly (e.g. %reg0, %reg1 and not reg[%reg0]) vs memory where load/store from a (sometimes aligned) pointer works.
You can see this materialize into spills in lmem in CUDA when “your implementation of blah” does not fully unroll. CUDA likes to give the impression of an array of registers per thread but will spill when register numbers can’t be determined statically.

The vast majority of gather/scatter semantics go through memory.
Something close is vector.transfer which lower to llvm.masked.load / intrinsics.
Similar instructions for gather scatter that would lower to the proper llvm.masked.* instructions would be great to have.

This is for CPU where I don’t think there is such a thing as a portable in-register gather/scatter (different ISAs may have their special intrinsics though).

For GPU I imagine if you are thinking about cross-warp + ballot at the lowest level of granularity?
Note that despite what it says, I think warp-level synchronizations is just an illusion and still goes through the memory unit (at least it was when I used warp shuffles for FFT back in 2014…).

Maybe it’s all about giving the illusion of flexible memory accesses + register performance to a user (but it seems to scale to 32 for the past ~10 years?).

The idea is to have one abstract type (n-D vector / multidimensional array) and have it lowered to different memory / value types depending on its size, the hierarchy level it occurs in and the target architecture.


  • a level 4 (grid) reduction operates on a memref array and decomposes into
  • a dynamic set of level 3 reductions which split the original array to operate on shared memory
  • a fixed set of level 2 reductions which split the shared memory to operate on SIMD-vectors
  • depending on the target architecture: A fixed set of level 1 reductions which split the SIMD-vectors to operate on registers or just a single SIMD instruction.

All of these levels could still be talking about generic arrays, just different sizes of them to fit in their specific actual memory type. Though, that requires marking either the values, types or operations as being on a specific level in the hierarchy. That might mean that we have to use multiple or polymorphic types to represent this marking / the attributes.

Yes, the operation names describe what it does and the attributes describe how it is done.

The way I am currently prototyping is: I started with a bunch of common high level problems and implemented them with the proposed primitives to get a decomposition. This way I hope to get the interfaces of the primitives and their expected behavior right first, and only then think about their realization on the hardware.
The primitives are inspired by this paper: Single-pass Parallel Prefix Scan with Decoupled Look-back. Take a look in section 5.2 there you will also find some of the algorithms which can be build using these primitives.

Ah, good to know! Indeed, gather and scatter only make sense if they can handle dynamic indices.

Actually, this would be the second level (sub groups) of the hierarchy.

Yes, to some degree. Often the e.g. 64 lanes are actually 16 4-lane SIMD-vectors in sequence or something like that. But, ultimately it does not matter as the ISA hides that away and let’s them appear as happening in lockstep / at once.

I think what I want to find is a definition, which enables the user to describe:

  • an abstract data flow
  • in a bundled matter (as multi dimensional arrays)
  • and express lane-changes / swaps / swizzles / shuffles of these elements
  • without explicitly referring to memory (functional style).

Yup naming is the #1 hard problem in computer science. :wink: OpenCL terminology is also a viable choice here, as invocation/thread/block/etc. are all very overloaded terms. I personally feel it’s clean and consistent. It’s also better established.

On SPIR-V side we also have unlimited number of virtual registers. The real register allocation is happening in GPU driver compilers. Generally we also don’t want to put too much pressure on VGPRs; otherwise it would hurt GPU occupancy and it would limit the GPUs ability to leverage zero-cost context switch to hide memory latency.

Just wanted to say that this is not the only way to lower a tiled small matmul to SPIR-V at the subgroup level. As many things in Vulkan/SPIR-V, cooperative matrix is an extension that requires special capabilities. It’s vendor specific right now. W’d like to use it whenever possible, but for the cases where we don’t have it, we still need to leverage what we have at hand as much as possible. Subgroup operations is core to Vulkan 1.1 and it’s available on both desktop and mobile GPUs. Converting tiled small matmul into these subgroup operations can potentially give us great performance. I believe these subgroup operations are what @Lichtso is trying to use. So this is an awesome path to build out in parallel to the cooperative matrix path.

Not sure I get this; but I might miss something. :slight_smile: My impression is that at subgroup level (level #2) we are modelling subgroups so it should stay with what native subgroup size the hardware provides? (The size itself can be a parameter according to the hardware for sure. In SPIR-V we have constructs allowing querying such information and it is used to drive pattern application. We can have similar fields for subgroup size for example and that information is exposed by the hardware via Vulkan API calls.) Using a multiple of the native subgroup size seems to be a upper level’s concern. And the subgroup operations does have hardware support, for example, see AMD’s. IIRC, these are originated from the needs in graphics: it’s quite common for graphics to do sampling that need to access neighboring pixels. So being fast really matters there.

That is correct.

I recently helped implementing the VkPhysicalDeviceSubgroupProperties structure in MoltenVK. It turns out that while you can get the hardware width, there can also be varying sizes per dispatch: VK_EXT_subgroup_size_control.html.

Do you know what GPUs can have varying subgroups size within a single dispatch? AMD and Nvidia have fixed subgroup size as far as I know (respectively 64 and 32). Intel supports multiple subgroup size but for compute shader it is fixed per dispatch (It can be varying for fragment shader indeed). As far as I can tell Mali and Adreno also have fixed size subgroups.
Unless we find GPUs where the subgroup size can vary within a dispatch for compute shader I think we probably not consider it in the design as I suspect it will add significant complexity.

For GPUs like Intel that do have multiple subgroup size I think we should force the subgroup size using VkPipelineShaderStageRequiredSubgroupSizeCreateInfoEXT. In general there will be a tradeoff between register pressure and amount of sharing possible (using larger subgroups means less registers available) and I think we will want to control this tradeoff rather than letting the driver decides.

I haven’t encountered any hardware with dynamic sizes yet.

However, the subgroup size can also be reduced by software, the dispatcher starts them with some lanes being disabled, as if they encountered a masked control flow. There seem to be two use cases for this, which are implemented in Metal (not sure about Vulkan):

  1. For arrays which are not a multiple of the subgroup size, the driver puts the if(index >= length) discard; in for you, so you have to write less boiler plate to truncate the last subgroup.
  2. If the register pressure is to high, so the driver disables a portion of the subgroup in all subgroups until it fits again. (you already mentioned this)

Correct, that’s a good point. I think case 1. can be avoided by picking a workgroup size multiple of the subgroup size of the hardware. Unlike Metal, Vulkan doesn’t support non-uniform workgroups (grid is always aligned on workgroup size) so picking a good workgroup size is enough.
As discussed case 2. should only happen for drivers supporting multiple subgroup size and we should use requiredsubgroupsize to control it. (again I think we can pick a better tradeoff register/subgroup size than the driver)

I almost forgot, the second iteration of prototyping was done two weeks ago.