GPU Compute Basic Algorithms

Hi, I am new to the MLIR project but have worked with LLVM in the past.

I am interested in implementing the following algorithms for GPUs in a way that utilizes the subgroup and workgroup local features to maximize performance:

  • Scan & Reduce Arithmetic
  • Partition / Select
  • Aggregate / Binning / Histogram
  • (Radix) Sort
  • FFT
  • Convolution (using FFT)

My question: Is the GPU dialect the right place for this? E.g. there is a gpu.all_reduce already (a more fitting name might be gpu.workgroup_reduce). Or is this something which does not belong in MLIR but an external library or compiler front-end?

1 Like

Hi,

thanks for your proposition!

It might be helpful to reflect on this according to the guidelines on contributing new components. I can see most of what is proposed as additional Ops in the GPU dialect (or, eventually, in a separate dialect) that can be lowered to more primitive Ops. A lot of things in the broad “code generation” realm are need-driven; we had a need for allreduce, but haven’t yet needed other things. They may be useful for lowering from higher-level operations like those in Linalg or various frameworks, but one needs to ensure they are sufficiently general and useful for multiple projects (e.g., not specific to someones-favorite-ml-framework) in order to be suitable for core MLIR.

Ping @herhut, @antiagainst.

I tried to order my thoughts around this and came up with the following description of what I would like to do:

Operations

These are descriptions of what the operations should compute not how they should be implemented (see next section for that).

Reduction

Outputs a scalar containing the sum / product / minimum / maximum / bitwise AND / bitwise OR / bitwise XOR of all elements in the input array.

Exclusive / Inclusive Scan

For each element reduce all elements up to that element, including or excluding that element respectively.

Aggregation

Partition (key-value-pair) elements by the provided unsigned integer key. Note: This is not a generic sort as it is not comparison-based and can not handle floats or other types of keys.

Selection

An aggregation with a 1 bit key and one partition is discarded. Thus, it selects the elements by a boolean condition.

Binning

Sorts first using aggregation, then reduces each partition to a bin. So the number of different keys in the input is the same as the number of bins in the output (which is a histogram).

Run Length Encoding

The input is already sorted (e.g. by an aggregation), count the number of elements (which share the same key) in each partition.

FFT

Swap time and frequency domain of a complex input to a complex output, encoded as vec2 / 2D floats. Reverse direction can be done by additionally multiplying with a 1/N scaling factor.

Convolution

Uses one input array as sliding window over another input array and outputs the dot product for each shift offset.

Possible Implementations

  • Reduction: Already exists up to workgroup level
  • Scan, Aggregation, Selection, Binning, Run Length Encoding: Single-pass Parallel Prefix Scan with Decoupled Look-back
  • FFT: Cooley–Tukey algorithm with power-of-two radices.
  • Convolution: Perform one FFT for the two inputs each, multiply them element wise and do another (inverse) FFT on the product to get the output.

Hierarchy

Many of these operations can be build by composition form the lower levels upwards.
It might make sense to expose these lower level versions as well.

  • Low level: Subgroup / warp / wavefront / SIMD-vector
  • Mid level: Workgroup / thread block
  • High level: Kernel dispatch / grid

Higher Dimensions

Scan, FFT and convolution can be done in arbitrary dimensions like this:

  • Run 1D once for each row
  • Transpose
  • Run 1D once of each column
  • Transpose
  • etc …

Plan

I would start out by implementing reduction for the other two levels, then do the scan related operations and finally the FFT. I guess some would be a set of Op-Specific passes inside the GPU dialect and others need lowering to spv, nvvm etc.

Thanks for providing more details!

Like I mentioned above, I think it is helpful to have at least some of these as operations and implement the algorithms you want as conversions of those operations into some primitives. So we would have a gpu.scan and gpu.bin, for example. This might be more involved then just writing the IR by hand as one would for a library, but more impactful.

It would be interesting if we could have operations that are somehow agnostic to the hierarchy, e.g. gpu.allreduce "add" subgroup %0 ... and gpu.allreduce "add" workgroup %0 ... so that we can switch between levels as we lower progressively. This was one of ideas the original design, don’t know if it is actually feasible. (Note: “allreduce” refers to the fact that all workitems will have the reduction result available, as opposed to “simple” reduce where only the first workitem would have the full result and remaining workitems would have partial/undefined results).

More complex operations exist at a higher level, for example, Linalg dialect has a convolution and we may expect some high-level FFT as well (HLO has it, but it’s not in the core repository, the planned Tensor Compute Primitives dialect is also likely to have it). It is worth considering the trade off of introducing a GPU-specific compute primitive for these vs lowering from some higher-level operation directly to lower-level GPU operations.

This idea is similar to the mid-level abstractions that are being developed in the vector dialect (so far, there is an n-D vector transfer and a contraction, but there will be more). There may be some interesting connection there, ping @nicolasvasilache @aartbik.

This also needs an efficient transpose operation. Having that would be valuable by itself!

One of the design goals in the GPU dialect was to try and abstract away some platform-specific instructions when possible, e.g. all platforms seem to have a workgroup barrier or a ballot operation. These make sense as low-level operations in the GPU dialect so as to decouple the “algorithmic” lowering that only needs a barrier from a target-specific lowering that converts the barrier into a specific instruction.

The only two operations which would be lowered explicitly would be scan and reduce (SPIR-V has some built-ins for them). Everything else would then use them as primitives and be a conversion inside the GPU dialect.

Exactly, that is what I meant by exposing these lower level versions as well.
And while we are at it, it just came to my mind that the gpu.subgroup_size, gpu.subgroup_id, gpu.block_dim, gpu.block_id, gpu.grid_dim, gpu.thread_id could also be unified in that way.

True! I saw a transpose somewhere else (in the vector dialect I guess), but yes the GPU dialect needs one too.

What are the next steps?

  • Who should I talk to?
  • Do we need a document / repository for this proposal?
  • Should I try to refactor and experiment on my local copy already?

Would you be interested in discussing the proposal live during tomorrow’s open design meeting (link) ?

This is very interesting, thanks @Lichtso! Having abstractions for these tasks and transformations for algorithms could be very helpful. There are lots of details to work through; but in general I agreed with @ftynse that it would be nice to think about how this interacts with upper layers like Linalg/Vector and how it gets translated into lower layers like NVVM or SPIR-V. It would be nice to have reusable and configurable components that one can pick up depending on the target environment. Some target environment may have direct instructions for tasks then we can directly convert to that; otherwise implementing these algorithms using lower-level primitives (and essentially provide a “library”) is quite valuable. :slight_smile:

What time is it (in UTC)? Because I am living in central Europe.

17:00 UTC (or 19:00 CEST)

Sure, I added the topic to the future topics list.

Cool, thanks much!

@mehdi_amini FYI

+1 looking forward to the discussion @Lichtso

As others have mentioned, the primitives you describe also make a lot of sense at HW agnostic levels + lowering to GPU, SPIRV (ultimately through GPU), LLVM and future CISC-ISAs.
For now, some of us are mostly working on connecting vector.transfers / transpose / contract / pointwise ops to LLVM + intrinsics and SPIR-V so your work is very timely :slight_smile: (see this post).

Just adding my +1 here, as I won’t be able to attend today’s design meeting. I would also be interested to see this project evolve into creating building blocks in the gpu dialect with corresponding lowerings to the targets that would allow expressing the computations by composing hardware independent blocks. Getting there likely is more longer term but keeping it in mind from the get-go seems useful.

Restating what was said above. This is a really timely and important contribution. Two thing I would request

  1. Would be good to consider how these things would get lowered to SPIR-V. @antiagainst, @ThomasRaoux, @hanchung and me can help with that. As mentioned by everyone, having blocks that can be composed to implement the above algorithms would be super useful.
  2. Within IREE specifically, we are looking at targeting SPIRV from Vector/Linalg dialect. For the kernel side code-generation we can go to GPU dialect and then SPIRV too (and in many ways preferable). So would be really interesting to hook up these algorithms to the IREE codegen pipeline.

Cool! We have basic prototypes on Reduction and Convolution within IREE, and as Mahesh said, we’re happy to help on these stuff. We can prototype for the rest, and also get some improvements for convolution or reduction. We recently had a simple discussion regarding the improvement of reduction in IREE. We can also start from that tree-based algorithm or see how to put these optimization together.

Great!

One thing I hope we can discuss deeply is the tradeoffs that appear in the side-effecting memory world (memref on scalar) and the value world (n-D vector).
We will need both in the longer term but note that n-D vector SSA values allow designing abstractions, that map without surprises to efficient HW primitives + unroll-and-jam.
Think of it as llvm.matrix intrinsics on steroids: it becomes quite feasible to create building blocks that operate at peak, and compose, on various HW (not just GPUs :slight_smile: ).

First of all, thanks for the interest and the great discussion (you can find the slides here). I summarized the ideas and questions which arose:

  • Parametrization & Tuning: Can attributes be used to have generic implementations which support different trait-offs? This includes things like memory- / register-pressure vs. performance. Or faster but redundant parallel execution vs. slower but more energy-efficent sequential execution. So far, this is the most uncertain point. This might even by a research topic.
  • Homogenization: Can the operations be defined in an uniform way which is agnostic to the hierarchy, maybe even across different dialects? Attributes should be able to do this, I will come back to this point in my next post.
  • Composition & Reusability: Many of the proposed operations depend / are built on one another. So can we define them in a way, which reflects this? This is to support higher level optimization instead of having low-level-optimized blackbox implementations. This can probably be done by lowering these operations progressively inside the same dialect, but without an implementation it is hard to tell.
  • Hardware Diversity: Is this limited to GPUs or can it be extended to CPUs and other types of hardware? The consensus was that it should be done in both GPU and the Vector dialect, because both seem be possible and also to avoid a design over-fitting.
  • Dialects & Libraries: How are these operations exported to the outside? They probably need to go through the entire stack as they also require memory and buffer bindings as well as optimizations such as fusion. Is is not clear yet if they would fit into the existing dialects or require their own for the higher levels. However, this might become more obvious when the lower levels (GPU and Vector) are implemented.

Please correct me if there are any errors or misunderstandings in the summary.

2 Likes

Some questions.

  1. How does MLIR plan to implement these ops? Is it like as a part of progressive lowering we “decompose” the ops like “scan” into micro-ops in descendent dialect which eventually would realize the op?

  2. How do we target libraries in MLIR? I want to utilize scan implementation from a specialized library, say. How would that work, it we intend to?

It doesn’t always have to be micro-ops in descendent dialects but indeed, progressive lowering + transformations that break up an op into smaller ops seem to be a reasonable way to get started. You could look at how vector.contract gets progressively lowered to either:

  1. vector.matrix_multiply → llvm.matrix_multiply
  2. vector.outerproduct + vector.transpose → insert/extract/fma
  3. directly to insert/extract/fma

From any of these levels it would make sense to go to a HW-specific dialect e.g. GPU.
However there are also implications on chains of ops with sources / sinks to memory (e.g. load/store and vector.transfer); see e.g. @ThomasRaoux’s commits to iree to see a bit of the gradient here.

There is also some WIP that adds a lowering to vector.reduce.

A lot more work is needed to get a meaningful set of useful primitives such as described by @Lichtso .

For example, Linalg supports a primitive lowering of ops to library calls for semantically named ops. This needs to be extended (e.g. like discussed here) but it shows that starting from high-level, semantically charged ops, we can mix transformations, codegen and library calls. This is also why I have been talking about ops whose semantics is captured by attributes: this allows building generic mechanisms to mix codegen + library calls.

I imagine similar mechanisms can be generalized and serve the same purpose. However, it is unclear at this point whether all/most of the ops discussed in the meeting have such attribute-based representations but the least the compiler / analysis and transformations need to know about e.g. my_special_scan_and_conv_op, the better.

Thanks @Lichtso for this nice summary of the discussion!