Understanding Parallel loops to GPU lowering

Starting a thread here since it might be of interest to broader community. I started looking at lowering of loop.parallel to gpu.launch . Wanted to get some clarifications on some aspects here.

  1. The lowering is driven by a DictionaryAttr containing three fields processor, map and bound. With patch D76165 I attempted to convert it to use a StructAttr. I am not sure I fully follow what map and bound are for. My current understanding is that
    map is meant to implement a mapping from iteration to processor ID, but most examples I see a mapping of the form affine_map<(d0) -> (d0)>. The bound too has the same affine map in example I saw. So some clarification would be great!

  2. There is in general no real requirement that the number of iterations matches the number of processors. The current implementation seems to make this assumption. The number of blocks for example is computed using affine_map<()[s0, s1, s2] -> ((s0 - s1) ceildiv s2)> where s0 is the upper-bound of the loop, s1 the lower bound, and s2 the step.I think this is derived from the map and bound in some way, but I cant follow how the map and bound are used for this. I started looking at this lowering to replace the use of the createLoopToGPUPass within IREE. The code that is generated on the device side works even if the number of blocks used eventually when running the kernel represented by the gpu.func (created by outlining from a gpu.launch operation) is different from what was used when the gpu.launch operation was created. It achieves this by adding a loop

for (i = lb + blockId * blockSize ; i < ub ; i+= numBlocks * blockSize) { .. }

This way if you use a different number of blocks, the kernel still works. That is a very valuable thing to have in IREE, where the kernel might be compiled using a fixed blockSize, but the runtime is free to chose the number of blocks to use at runtime. I would like to preserve the same behavior when lowering from loop.parallel to gpu.launch operation.

  1. The use of attribute to drive the lowering has a limitation that the affine map used to specify the map and bound cannot contain symbols. I am not sure whether that is an issue, cause I fully dont understand what those are for, but it seems like the map and bound maps might need to have symbols that represent number of blocks, block size etc. which might not be known statically. So this seems like a restriction imposed by the attribute driven mechanism.

What am I looking for is some guidance on these aspects. I can volunteer to adapt the existing lowering to address these concerns, but want to get a clear idea of the intent before I make any major changes here (started with just an NFC of using StructAttr with the patch above for now :slight_smile: )

@herhut, @ftynse, @antiagainst, @nicolasvasilache, @hanchung for visibility.

Thanks for bringing up this discussion and sorry for the lacking documentation. The current implementation was just some experimentation to see what it could look like.

The idea was to give a way to map hardware ids to multiple loops. So for instance, you could use a map like ceilDiv 512 and mod 512 with the same hardware id to map two dimensions of a loop to the same hardware id. Maybe this is not useful and we should remove this indirection again.

There are two affine maps in play for computing bounds. First, the number of iterations the loop will perform is estimated (or computed if possible). This is what affine_map<()[s0, s1, s2] -> ((s0 - s1) ceildiv s2)> computes. The result is then taken and fed into the bound map, if it exists. Coming back to the above example: If you map one processor to two loops, one of the mappings has to provide an upper bound. In the above example, the outer loop should have a bound map of affine_map<(d0) -> (d0 * 512)>.

So in essence you tile the loop by a dynamic tiling factor of blockSize and would then map the outer loop of the tiling sequentially. We do not have a way to associate that dynamic tiling factor with the blocksize, though. So the result would be a gpu.launch with an extra parameter for the actual for the tiling factor. I have not tried this with the actual implementation but conceptually it should work.

The main difference is that this tiling behavior was built into the translation to gpu.launch before.

There was a TODO in the code at some point that the maps should have access to block and grid sizes. I was not sure whether that unlocks all use cases and hence did not add it. If it helps with your use cases, feel free to add it. My idea was to pass those as symbols to the map map.

I have been considering that we should do loop transformations as such and keep the mapping relatively simple. These transformations can still be driven by the same set of annotations, but become easier to test and reuse. For example, if you want to map multiple loops to the same block/thread id, you can coalesce those loops into a single loop and map just it. Same for tiling with dynamic values, we can do it as a transformation and map the outer loop and keep the inner in the kernel, potentially canonicalizing it away if it is known to have a single iteration statically. We already have a mapLoopToProcessorIDs for non-parallel loops that does exactly that.

Thanks @herhut for all the info (I dont know how to respond in an inline manner on discourse, so just writing my reply, hopefully context is clear).

Based on the above description it seems like the mapping attribute is effectively implementing something like this at each processor level.

i = map(pid)*step + lb;
if (pid < min(ub, bound(i))) {

Using the above mapping, the number of processors is computable using the map affine_map<()[lb, ub, step] -> (ceildiv(ub - lb, step))>.

I would like to propose an alternative here. The current lowering makes sense for someone using both the host and the device side lowering for the GPU dialect. This is not the case for IREE. What I am thinking of should work for both use cases (but if it doesnt it is perfectly valid recommendation that IREE should implement its own lowering, but I would rather not do that).

For my use case it would be better to have the number of processors completely independent of the number of loop iterations and let the client of the lowering specify the number of processors as a Value and not try to compute it automatically. The lowering could be changed to use the map, bound and stepmap to generate the following loop

for (i = map(pid, lb, ub, step, numprocessors) ; i < bound(pid, lb, ub, step, numprocess); i += stepmap(pid, lb, ub, step, numprocessors) { .. }

Note that this is the inverse of the loop iteration to processor mapping, and is written this way to avoid doing the inversion ourselves. The existing usage would become

map : affine_map<(d0)[s0, s1, s2, s3] -> (d0 + s2 * s3)>
bound : affine_map<(d0)[s0, s1, s2, s3] -> (s1)>
step : affine_map<(d0)[s0, s1, s2, s3] -> (s3*s2)>

I am happy to implement this as a separate pattern if needed.

(FYI : realized that I had conflated two things ealier. The lowering that is used in IREE and that exists in GPU dialect for loop.for maps both the blocks and the threads simultaneously. We dont need to do that with loop.parallel, so apologies if that caused a confusion).

@ftynse : Overall I agree with you, but it seems like something the “client” of loop.parallel should do. For example, if lowering to loop.parallel from linalg you could tile in linalg and then you get the mapping to blocks and threads separately. Similarly folding parallel dimensions into one can be done at linalg level.

Gentle ping on this. @herhut, @ftynse . Any thoughts on the above proposal? I can adapt the existing lowering to implement this.

Sorry for the delay, I am a bit behind.

Where does this come from then? Would you specify it as a parameter to the lowering itself? We somehow need to encode the ssa value to use for this. Would the pass that annotates the mapping attributes also compute this? AFAIK we cannot have SSA values in attributes.

Is this an extra loop inside of the gpu kernel you want added? I do not see how an entire loop.parallel gets mapped to a gpu kernel and the extra loop you require. Could you do an entire example to make it easier to understand what the inputs to the maps are and how they are encoded?

For your use case, I had something in mind of the form

loop.parallel (%outer, %proc) = (0, 0) to (ceilDiv(%ub, %processors), %processors) step (%processors, 1) {
  %index = %outer * %processors + %proc
  // Computation here. 

where processors is an ssa value you use when tiling. This can then be mapped using existing attributes, mapping the outer to sequential and proc to some hardware id. Using this approach, the number of processors becomes configurable. Would this not work?

The main difference I see is that you have an extra conditional in the body to cater for the ragged shape if it cannot be tiled perfectly. Is that the issue you aim to solve?

You dont need to embed the SSA value in the attribute itself. The map as I defined earlier has 5 symbols, one each for pid, lb, ub, step and numProcessor. During lowering from loop.parallel to gpu.launch you have those values and they can be passed to affine.apply operation

I can give a code example, but this is the use case I have. Lets say your loop.parallel has 10 iterations, i.e. lb = 0, ub = 10, and step=1. In your view of this you are assuming that there are at least 10 processors. So each processor has a single iteration of the loop.parallel to perform. But if for whatever reason (some device constraint) you had only 2 processors, you would need to regenerate the device side code cause you will execute only 2 of the 10 iterations otherwise. Instead if you had a loop you can have a loop

loop.for %i = %lb + %processID, %ub, %num_processors

then in the above example, process 0 will execute the “even” iterations and 1 will execute the odd iterations and the generated kernel would execute all the loop iterations. I hope I am not missing something obvious and making an overkill

The standard term for this is a cyclic mapping. Basically, you have block, cyclic, and block-cyclic mappings (with a cycle size), and this pretty much covers a whole range of structured loop mappings. This also appears to cover everything discussed above. Instead of having users/clients messing with affine maps on the loop.parallel themselves, the right affine maps could be auto-populated if you model ‘block’, ‘cyclic’, ‘block-cyclic’ either as attributes on the loop.parallel op or have helper/utilities to set the affine maps for you given one of those mapping options.

Thanks Uday. This makes sense, and was one way I was thinking about it, but having two separate attributes to specify the mapping might get confusing and hard to gauge expected behavior. Best approach I feel is to have utility functions that use the same underlying mechanism, but are convenience functions to build the maps that implement those. Would it be OK to have cyclic mapping for now, and enhance it to cover block and block-cyclic when needed.

Having helpers to inform the translation was the idea from the beginning. I did not expect users to write the affine maps. Currently, we only have a greedy mapper, which essentially implements block mapping (to use your terminology) and @MaheshRavishankar wanted to add block-cyclic if I understand right.

Having utility functions to set maps suitably sounds good to me, but I haven’t followed the GPU dialect work closely - you’ll have a clearer picture! Starting with some mapping sounds fine since currently there isn’t lowering for any I assume? A cyclic mapping may give you good load balance, but won’t it go against locality in many cases? If you do block cyclic, you get both block and cyclic as special cases (former with block size = N / nprocs, and latter with block size = 1).

If I understand this correctly, the map used in the lowering is parameterized over lower bound, upper bound, step, processor id and num processors. The first three can be derived from the loop.parallel that is being transformed, the processor id comes from an annotation of which processor is being used ([block|grid].[x|y|z]). But where does the lowering get the last value from? I know what the value represents but the code that does the transformation needs to get this value from somewhere.

If I understand this right, you want as an end-state a kernel with an extra parameter and the IREE runtime somehow figures out which parameter is the number of processors? Alternatively, it could be one of the hardware sizes for [block|grid].[x|y|z]. But these are specified by the gpu.launch in the non-IREE case. The lowering would need a way to put a value there. And I wonder where that value comes from.

I get this. I was looking for an example what the annotations are looking like.

@MaheshRavishankar if you are planning to add these utility functions, please add these to LoopLikeOpInterface (instead of ForOp or ParallelOp) because that way it’d be reusable for AffineForOp as well. And for your purpose, it shouldn’t make a difference. On that note, I think loop.parallel should be implementing LoopLikeOpInterface as well.

Sorry for the delay in response (and thanks for the response above). I realize there is a difference in my mental model and what is implemented. It requires some explanation, which is below. But addressing a comment before that

I realize I wasnt fully clear here. You are right that of all the symbols needed, the only thing the lowering cannot know is the number of processors. This needs to be “input” to the pattern itself. So when the pattern is constructed, the Value for the number of processors need to provided. It could be optional, so if not provided it can fallback to implement what you suggested. I admit a little bit of a blind-side here cause I am only looking at the device code generation part of GPU dialect, and there the number of processors is obtained using gpu.grid_dim. The Value that I said needs to be passed into the pattern to lower to gpu.launch will only be used as an argument to the gpu.launch.

Some explanation about what context I am looking at (from IREE). The IREE GPU code-generation has the following steps (some of this WIP)

  1. A kernel is described as one or more sequence of Linalg operations (on buffers)
  2. These are tiled (and fused if there are more than 1). The tiling process generates a loop.parallel that is to be mapped to workgroups/blocks. For now assume that there is only one such “outer” loop.parallel within a kernel.
  3. Within the loop.parallel there will be linalg operations that work on tiles. These are lowered to loop.parallel. This set of loops are mapped to workitems/threads.

I think what is implemented is where a loop is directly mapped to workitems/threads in a block-cyclic fashion. I think there is value in breaking this up into two steps (as described above). This gives the ability to reason about computation at workgroup/block level. You can reason about synchronization points, and using collective operations in a more natural way. When I mentioned cyclic earlier, I was thinking of mapping the set of loop.parallel ops generated from tiling to workgroups/blocks in a cyclic manner (so that would be block-cyclic w.r.t to workitems/threads).

Hope that clarifies what I am trying to get to eventually.