Using attributes to specify workgroup configuration when lowering to GPU

Hello,

Right now passes converting loops to GPU ops are using a bunch of command-line options to specify the resultant number of workgroups and workgroup sizes:

I’m wondering whether we should switch to use GPU dialect specific attributes for the task. I see a few reasons why that’s preferable:

  • It makes the passes more general as we don’t need to hardcode the workgroup configuration at pass creation time. So the same pass can handle different workgroup configuration cases.
  • We have fewer non-POD global static variables, which is generally good for libraries and stuff.
  • Tests can be more clear right now: the configuration is conveyed in IR instead of command-line options.

Concretely, we can have gpu.num_blocks_to_match, gpu.num_threads_to_match, gpu.num_workgroups, gpu.workgroup_size for the above four command-line options.

WDYT?

1 Like

@ftynse @herhut @nicolasvasilache @MaheshRavishankar FYI

This pass should have been called “TestLoopsToGPU” :slight_smile: The main thing in it is the utility function that does the rewrite. You are free to call it in whatever way you want, the pass is only there to make sure it is tested, but should be extended in different ways.

I’m not 100% convinced on how to communicate the “mapping” decisions, but I agree that we should decompose the decision from the implementation. Using attributes looks pragmatic. FWIW, @herhut is going into the same direction with loop.parallel.

This looks like the right thing to do. In addition to what you list, won’t you have the situation where you have multiple ops with each requiring a different workgroup configuration? Attributes would allow you to specify that, but passes with cmd line options won’t. I was surprised to not see this in your itemized list, or perhaps I’m missing something.

You seem to be thinking ahead :slight_smile: Currently, this only maps a perfect loop nest, so only one configuration is necessary, if we disregard the test pass. But we may indeed want to select sizes even before we go to loops and keep carrying the attribute.

Originally, I was motivated by the small composable rewrite approach we are building in mlir. We could have a rewrite that does parametric tiling to make sure outer loops have fixed sizes, then a loop interchange rewrite and finally a simple loops to gpu.launch rewrite that the current pass implements. This makes each step relatively easy to write and test, and we could reuse tiling and interchange outside of the GPU context. That being said, sizes and loop order are connected so we need to know them in advance, which could be easier to implement with attributes that are progressively removed as transformations apply.

What I would like to warn against, is building a big one-shot mapper configurable by a set of non-trivial attributes.

I started looking into this in the context of loop.parallel as @ftynse mentioned. I see the current code using sequential loops as early experimentation and not something that would be part of the final pipeline. The mapping @ftynse implemented is incorrect in the presence of loop carried dependencies, which the code does not check for.

https://reviews.llvm.org/D73893 has some code that shows my current thinking (which is still evolving). Here is an example form the tests

func @parallel_loop(%arg0 : index, %arg1 : index, %arg2 : index,
                    %arg3 : index, %arg4 : index, 
                    %buf : memref<?x?xf32>,
                    %res : memref<?x?xf32>) {
  %step = constant 2 : index
  loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
                                          step (%arg4, %step)  {
    %val = load %buf[%i0, %i1] : memref<?x?xf32>
    store %val, %res[%i1, %i0] : memref<?x?xf32>
  } { mapping = [
        {processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
        {processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
    ] }
  return
}

The mapping attribute specifies for each dimension of the loop.parallel which hardware id corresponds. The processor specifies blockid (0,1,2) or threadid (3,4,5) or sequential (6). The map specifies how the hardware id is transformed into the index variable of the loop. The bound specifies how the size of the hardware id is computed based on the maximum number of iterations the loop has.

This is all very preliminary and especially the operands to the maps will need to evolve further, as one probably wants to get lower/upper bounds to do the computations. Instead of designing something complex to start with, I am currently exploring some examples.

I also take the view of @ftynse that we should try to implement this as multiple transformation steps. However, these transformations are not independent but will have to be orchestrated by a plan of what code should ultimately be emitted. For example loop-tiling decisions will depend on what final mapping we want to achieve. I don’t think we necessarily need to express the full lowering strategy in attributes.

If I could narrow the discussion here. What I am really looking for from lowering to SPIR-V is that we need the workgroup size to be a compile time constant (SPIR-V/Vulkan spec restrictions). For CUDA this is not needed, but it is possible we can use the information in these attributes to constant fold and simplify the code. So you can have an attribute gpu_workgroup_size = [...] : tensor<3xi32> if the workgroup size is known to be always fixed. If set, the gpu.launch has to abide by this setting (and there can be a verification for that). If not set, then the values passed in through gpu.launch would be the one to use. This is similar to (but not the same as) __launch_bounds__ from CUDA, where its a compile time assertion that can be used to further optimize.

The input to this transformation does not have anything GPU related right?
Can you clarify where would these attributes be attached and where would they come from? I am not concerned about the internal of a testing pass, but rather about anchoring to this in a “real” pipeline.

This is just a matter of replacing the global options with the pass options mechanism (which we should do for every pass by the way).

If this was a real concern, this would go beyond this simple pass though: every other pass work this way.

My TL;DR is that this type of mapping attributes is known to work and will (also) be used in Linalg in the future. What is less clear to me is at what time do we plan to explicitly represent processor IDs with SSA values, and how portable that will be?

Note that I still plan to use the following API for this at some point in the future:

/// ...
/// Example:
/// Assuming a 2-d grid with processorIds = [blockIdx.x, threadIdx.x] and
/// numProcessors = [gridDim.x, blockDim.x], the loop:
///
/// ```
///    loop.for %i = %lb to %ub step %step {
///      ...
///    }
/// ```
///
/// is rewritten into a version resembling the following pseudo-IR:
///
/// ```
///    loop.for %i = %lb + %step * (threadIdx.x + blockIdx.x * blockDim.x)
///       to %ub step %gridDim.x * blockDim.x * %step {
///      ...
///    }
/// ```
void mapLoopToProcessorIds(loop::ForOp forOp, ArrayRef<Value> processorId,
                           ArrayRef<Value> numProcessors);

Acting on the above is is still pushed pretty deep in my stack of priorities but I am curious of what you fine folks think of when the “attribute → SSA value PID materialization” will happen and how (I am talking long-term, not pattern-matching on the current perfect-nested loop implementation)?

I think I can make my thinking more precise by asking:

  1. how big a timespan do we expect this annotation to survive as an attribute in the IR?
  2. how does it interact with the transformations that are expected to occur between the moment the attribute is introduced to the moment it is lowered away into SSA values for PIDs ?

Thanks!

Actually the following point was meant for that:

But maybe it is a bit too vague. :slight_smile:

The input to this transformation does not have anything GPU related right?
Can you clarify where would these attributes be attached and where would they come from? I am not concerned about the internal of a testing pass, but rather about anchoring to this in a “real” pipeline.

I think the “inputs” involve two parts: the source IR that are loop dialect ops and the configuration on how to map those loops to GPU. Without either, we are not able to perform the transformation. The former is not related to GPU but the later is.

I think these attributes can be attached to the loop op for the specific op or the enclosing function for all loop ops inside. Whoever trying to invoke the transformation of loop to GPU should attach the attribute; it’s just that instead of providing such information at pass construction time, one provides it as IR attributes now.

This is just a matter of replacing the global options with the pass options mechanism (which we should do for every pass by the way).

Yup. That’s one way to address this particular point but note that this particular point is not the main reason I think we should consider using attributes; rather it’s just a nice by-product.

If this was a real concern, this would go beyond this simple pass though: every other pass work this way.

I don’t think it is reasonable to change all pass construction parameters to IR attributes. For example, having a pass construction parameter to control whether we want to apply conversions partially or fully makes total sense for me. But for parameters that are not controlling the pattern selection and flow of the pass itself, I think we probably should consider put it in the IR directly. The mapping here is a good example. It can be quite complex and have affect on a per-op basis. Specifying it as pass construction parameter is quite cumbersome if ever possible; it splits the information needed by patterns into two world (CLI and IR) and makes passes overfit to a specific configuration.

@ftynse @herhut:

Agreed that we want small and composable patterns so that we can have better reusability and testability. But I think that’s a bit orthogonal to how “configurations” are passed in, as is my main question in this thread. Each of the pattern can require some IR attribute for its configuration or it can expose CLI options for that.

  1. how big a timespan do we expect this annotation to survive as an attribute in the IR?
  2. how does it interact with the transformations that are expected to occur between the moment the attribute is introduced to the moment it is lowered away into SSA values for PIDs ?

Good questions. My take on is that attributes are largely free-form information that each pattern/pass can define its own for conveying additional configuration information required by the pattern/pass. A pattern/pass doesn’t care about other pattern/pass’s configuration attributes so leaving them untouched; it only consumes the one it knows. Taming this free-form-ness a bit we can prefix attributes and put them in a dialect so we can perform verification. And it’s the responsibility of the pattern/pass consumers to make sure such configuration attributes are there and properly roped if necessary, like what we do when we run a set of patterns/passes via CLI by providing them all necessary arguments.

This would mean that we cannot propagate constant workgroup sizes from a launch into the kernel body, correct? Just capturing this requirement here. We would need to change the semantics of gpu.launch accordingly. I think currently the assumption is that the launch bounds are exactly what gets passed to the launch.

If you need static bounds, wouldn’t that in practice mean that you would have to tile accordingly to begin with? Then the lowering to SPIR-V would require that all parallel loops that are mapped to hardware ids have static bounds (with probably some sequential loops with dynamic bounds inside).

If one builds this as multiple composable transformations then the question is how these communicate. Attributes make the communication visible in the IR and makes experimentation/testing/… easier. If we go with pass options, then the communication is implicit via C++ glue code that passes the right parameters to transformations.

For this reason, I would prefer attributes.

What I am really looking for from lowering to SPIR-V is that we need the workgroup size to be a compile time constant (SPIR-V/Vulkan spec restrictions).

Just wanted to provide more details here. In SPIR-V constants can be normal constants or specialization constants (spec constant). This agrees with SPIR-V’s intermediate language nature; SPIR-V will be further compiled down to ISA in hardware drivers. A spec constant is some value that is unknown when compiling from some high-level language down to SPIR-V but will be a known constant when SPIR-V is sent to the driver for further compilation (so can do further constant folding, etc. in the driver compiler); the real constant value for the spec constant will be provided at pipeline construction time. A spec constant has a SpecId that one can use to locate the spec constant and provide real values. At SPIR-V dialect’s level, spec constant is called a constant but really behaves like a variable since we cannot unique them, constant folding them, etc., except that it does not hold memory.

The workgroup size can be specified as a normal constant or a specialization constant. If the input GPU dialect IR has a non-constant launch workgroup size, we’ll need to use spec constant to represent them. We need to have a way to feed in the SpecId for them then.

This would mean that we cannot propagate constant workgroup sizes from a launch into the kernel body, correct? Just capturing this requirement here. We would need to change the semantics of gpu.launch accordingly. I think currently the assumption is that the launch bounds are exactly what gets passed to the launch.

I’m not sure I understand what do you mean by “propagate constant workgroup sizes” properly; but if it means constant folding gpu.block_dim according to the constant workgroup size then yes we can. I think Mahesh also meant this.

It seems to me that there are fundamentally different kind of “configuration”, and that we can’t generalize such answers. Attributes can be part of the solution, but they have a very specific role: associating some information to a specific operation in the IR. The pass options on the other hands control generic heuristic parameters that you feed to the algorithm when such parameters applies to the entire IR.

Beyond testing: the attribute situation isn’t great though. How do these attributes get placed on a particular IR construct? This can be fragile across the pipeline so if this becomes the pass configuration you need to insert them right before the pass. But then how do they get inserted? If there is a magic way of determining this information (we have an Oracle) then why do we need to materialize as an attribute? The Oracle can be an interface that the pass queries…

My understanding is that you cannot define workgroup sizes at the launch at all. The workgroup size you pass to the launch is ignored (a runtime check could be inserted to assert it matches though) and instead the kernel defines the only possible workgroup size.

The semantic change could be simply that “if a kernel define a constraint on the launch configuration, the behavior is implementation specific in case of mismatch”, do you see anything more complex here?

That is my understanding as well: the parallel loops that are mapped to the work-group dimensions must have static bounds. This becomes a property of the target when targeting SPIR-V and the lowering flow should be able to take this constraint into account like others.
(I don’t think we have a good infrastructure to model “targets” in general, so it has to be a bit ad-hoc, possibly with pass-options here).

Not responding to any specific points above but generally

  1. Attribute should live in the dialect and makes sense for that dialect. For example, gpu.workgroup_size would only belong to that dialect. During lowering, the dialect that is “upstream” of this (i.e. being lowered to GPU dialect) can decide what this value should be. Dialect downstream of GPU dialect (dialect that GPU dialect is lowered to) can use the information in these attribute and translate it as needed into its dialect. So when going out of GPU dialect, the GPU specific attributes should be removed ideally.
  2. There might be attributes that make sense across dialects. I dont immediately have a use case for this, so the ask here is actually not looking for this.

For constant workgroup sizes in gpu.launch it’s easy since we can just fold that into the generated SPIR-V module. For non-constant cases, we should still be able to define workgroup sizes at launch time via specialization constants. Seeing an gpu.launch with non-constant workgroup sizes, I think we should generate a SPIR-V module with specialization constants for the workgroup sizes. Then before launch, which is vkCmdDispatch in Vulkan, the Vulkan runtime should feed in the concrete values for the specialization constants for hardware driver compilers to finally compile the SPIR-V module. This cames at its costs: we need to specialize and recompile the SPIR-V module for every gpu.launch with non-constant workgroup sizes.

For IREE specifically, due to the fact that the runtime side is in IREE and the CodeGen side is in MLIR core, we need extra handshake and contracts to make the above happen. (E.g., the CodeGen need to know the SpecIds for the specialization constants to generate.)