Tile/LowerToLoops + Distribute to processors with Linalg ops

Currently Linalg tiling and lowering to loops results in generation of scf.parallel/scf.for operations. In IREE, for GPU code-generation, tiling is used as a way to map to the different GPU compute heirarchies. Specifically, the scf.parallel loops generated for inter-tile iteration can be distributed across processors at each level. (More info here). For IREE itself the explicit instantiation of the scf.parallel is unnecessary. It would be much better if the tiling and lowering to loops transformations in Linalg would also distribute the iterations across processors. This post is to highlight use cases we have in IREE, and collect thoughts on the best way to proceed with adding this feature to Linalg. Presumably this would simplify the transformations in IREE by quite a bit, and would also be useful for similar use cases (more on this later).

Use case 1 : Distributing iterations to processors in a cyclic manner (Most general case)

Starting with a scf.parallel operation of this form

scf.parallel (%arg0, %arg1) = (%lb0, %lb1) to (%ub0, %ub1) step (%s0, %s1) {

can be distributed amongst a 2D grid of processors (%nx, %ny) in a cyclic manner to get

scf.for %arg0 = (%lb0 + %pidy * %s0) to %ub0 step (%ny * %s0) {
  scf.for %arg1 = (%lb1 + %pidx * %s1) to %ub1 step (%nx * %s1) {

Such a distribution does not assume any relationship between the number of iterations and the number of processors that each parallel loop is distributed to.

Use case 2 : Distributing iterations to processors in a cyclic manner with over-subscribed number of processors

If it is known that the number of processors (%nx, %ny) is greater than or equal to the number of iterations (floordiv(%ub0 - %lb0), %s0), floordiv(%ub1 - %lb1), %s1), then you can generate IR with much lesser overhead.

%n0 = floordiv(%ub0 - %lb0), %s0)
%n1 = floordiv(%ub1 - %lb1), %s1)
%0 = cmpi "slt", %pidy, %n0
%1 = cmpi "slt", %pidx, %n1
scf if (%0 && %1) {
  %arg0 = %lb0 + %pidy * %s0
  %arg1 = %lb1 + %pidx * %s1

Use case 3 : Distributing iterations to processors in a cyclic manner with matching number of processors

If it is known that the number of iterations exactly match the number of processors, this could be further simplified to

%arg0 = %lb0 + %pidy * %s0
%arg1 = %lb1 + %pidx * %s1

Note that all of these use cases rely on some information that is not deducible from the IR.

There are a couple of existing mechanism to map loops to processors

  1. The mapLoopsToProcessorIds method here. This can be used to implement use case 1, but relies on materializing the original scf.parallel operation.
  2. The SCFToGPU lowering has mapping attribute that is used to implement the distribution. This is very tied to the use of gpu.launch mechanism. This is really nice idea IMO and the fact that it controls the number of processors as well means that the transformation is correct by construction. But in uses cases which are similar to IREE, where the lowering is used only for device side code-generation this mechanism cant be used. In such situations it is much better to distribute the loop by construction.

(If there are other approaches, please let me know)

In terms of implementation, one option I can think of is to have an enum list that represents different distributions. The LinalgTilingOptions can be updated to include this to distribute the inter-tile loops. This is slightly painful cause it wont allow for custom distributions. Maybe the use of AffineMap to specify the distributions is an option, but its not clear to me how to encode the assumptions that allow for control-flow simplification. I am wondering if there is a way to allow the client of the tiling/lowering-to-loop transformations have a custom distribution (and not use some hard-coded strategies).?

Note that simplifying the control-flow that is generated this way will make some other transformations much easier since you might not need to move instructions across region boundaries

(fyi : @nicolasvasilache, @ThomasRaoux, @antiagainst, @herhut, @hanchung)

I am not sure what distribute the loop by construction refers to, here. The approach using mapping attributes was not meant to infer these attributes. Instead, the producer of the loops should already supply the mapping, as well. In the end, one does the tiling etc. with a specific mapping in mind, so it also makes sense to supply that mapping.

We currently have no producers for the mapping attribute except for a greedy mapper that was written mostly as a test pass. So if you want to add support to LinAlg to produce the right mapping annotations from the start, that would be awesome!

How about having a couple hard-wired distribution strategies that generate loops with mapping attributes right away. For custom use cases, also allow to do tiling without mapping attributes and then users can write a pass that only produces the mapping they desire. That way they get to reuse the tiling and transformation from LinAlg to loops and the transformation from loops to SIMT.

scf.parallel document what is possible rather than behavior. E.g., it just says it could be executed in parallel, need not be executed such but could be. At least until it is mapped, then you have more explicit semantics and can’t just lower to single, plain for loop.

+1 and this should be able to allow lowering to LinAlg to handle it.

@herhut, @jpienaar Thanks for the comments. I realize now that I wasnt clear that the distribution is meant to be optional. The default will be what is currently done, i.e. explicitly instantiate the scf.parallel. This is an additional option that allows you to directly generate the distributed loops.

I am working on a patch how this looks in Linalg. WIll send it out for review not as finished product, but something to anchor the discussion and evolve.

I implemented a prototype of what I had in mind.
Here is the patch for that : https://reviews.llvm.org/D85147

It works for context of Linalg (and what use case I listed above). The change above is the first draft and open to evolving it based on inputs. The difference from the comments above though is that it distributed the loop while generating it. So you can pass the distribution as an option to Linalg Tiling as follows

            .setTileSizes({8, 8, 4})

Effectively it allows generating parallel code from Linalg using a similar specification as with sequential code, but with an extra option.
I am not saying that the explicit instantiation of scf.parallel and distribution is never needed. Whether to use the approach above, or what exists with lowering scf.parallel → GPU depends on use case.

Thanks Mahesh for tackling mapping to SPMD loops and for making it play nicely with the patterns.

This is one of the important transformations that we have been discussing for some time but didn’t have the resources to attack yet. I am very excited to see progress on this!

This all looks great to me, my only comment would be re. “Use case 2:” and “Use case 3:”.
I think one of the longer term goals should be that we can express everything like “Use case 1:” + appropriate std.assume / std.assert / new special op carrying the relevant information + canonicalizations should kick in.

In practice, many canonicalizations are missing to get there (for instance scf.if doesn;t even know how to shed an empty else block) so it makes total sense to start with specific implementations for now.

What is the benefit of doing the above transformation within LinAlg directly? One could equally well transform out of LinAlg into parallel loops (maybe already with processor mapping annotations) and then do the transformations for the different distributions on that level. The advantage I see is that it would enable more reuse, as the distribution of those loops is not a LinAlg specific thing.

What information that you are having in LinAlg is missing at the (parallel-)loop nest level?

Independently, I would prefer if this materialized parallel loops first before going to a SIMT function. It would allow the reuse of this mapping in the flow that uses the GPU host side. As far as I understand the current patch, the transformation goes directly to SIMT functions without host-side.

@herhut it’s the same type of benefits that are described in the Rationale: Linalg is a higher-level representation that still has information about both the iteration space and the data type.

Off the bat I can list 4 concrete benefits:

  1. programmatic control: it’s easy to just integrate within a pattern or an “expert compiler” strategy as Mahesh has shown in the code snippet above. We spent a lot of effort making these patterns compose and canonicalize so we’re starting to reap the rewards of doing more things that compose at this level.
  2. once we have mapping to SPMD loops, the next step is to distribute the data at a higher-level (we’ll probably wait for Linalg on tensors to be finished for this though). This is expected to also be simple and not require complex analysis since Linalg ops “know about” control-flow and data. This is where interesting abstractions will be developed and where we’ll start getting into collectives, subview intersection and difference, data ownership etc.
  3. as usual with Linalg, with a little indirection and a new data type this is expected to extend to other structured data types (e.g. sparse, trees, …). Note that @aartbik has started pushing on this front with sparse vector operations.
  4. I also envision this will compose nicely with fusion. Once the tiled and mapped subviews use processor ids, it is possible some type of fusions “will just work”.

There is still much work to be done but taking the first step like @MaheshRavishankar did here opens up a wide range of possibilities.

Also, as you know from prior discussions, I think starting this line of work is ~9 months overdue so I am doubly excited :slight_smile: !

Totally agree with this (and was thinking along the same line as well). But it seems like the std.assume/std.assert can be quite rich, and needs some design to make it really usable. As you mentioned, we can have hard-wired strategy enums and then make the choice of which one to use automatic in the future by using this information.

I think it depends on use case. Explicitly instantiating the loops required additional things like hoisting etc. to also kick in to get the generated code to the right form. This means additional patterns to do things, which we could avoid if we didnt instantiate the loops in the first place (which we can avoid doing since we have some information about the loop trip count that is not statically know, but a dynamically assertable property). THen there is a matter of making sure these patterns run in a particular order or splitting them in passes. So depending on use case one approach maybe better than the other. As @nicolasvasilache mentioned, this is something we have been discussing for a while, and with this option we could get the generated code to the form we want much easier.

It is not clear to me why the same approach cannot be taken when lowering to parallel loops. I am not arguing that the descision should not be made at the linalg level. What I am arguing for is to materialize the loops instead of directly creating a SIMT function

What is a SPMD loop in this context? Again, I am not arguing that LinAlg is the right level to make these decisions. I am only arguing about the output format.

Do you have an example of what you mean here?

I am no less excited about this to happen. The current implementation is very focussed on the use of LinAlg in the context of IREE. We would have to reimplement the same transformation again to make it work in the context of the gpu dialect.

What information do you have that cannot be encoded in loops + mapping annotations?

That is an argument why this is easier to do to get exactly what is needed in the context of IREE. My point was whether we can reformulate this so that it can also be shared with other use cases. Going to parallel loops is one way to do this.

If adding an extra layer of abstraction makes it impossible to get the ordering of pattern application right, then that is a structural problem we have in MLIR and IMHO the solution should not be to do everything at the LinAlg level.

If you insist to do this in one step, could we at least have a version that generates a corresponding gpu.launch so that this supports cases where you have multiple LinAlg operations within one function?

I am not sure introducing a gpu.launch abstraction here works. This mapping is not necessarily for GPUs. You could do this partitioning to CPUs, or some other architecture. Nothing in the use case mentioned above is specific to GPUs. Its just a distribution primitive. Even for GPUs it would mean that any GPU targeting necessarily has to go through the GPU dialect which models both the host and device side. It doesnt allow you to decouple the two. Cases like IREE which uses Linalg → … → SPIR-V only for kernel side code generation, and models the host side by itself would be hard to support. For example, IREE codegen already does this, but that code is too involved. This would have the double benefit of moving all of this into core so that other use cases similar to IREE can use it from MLIR (and not have to use IREEs code-generation), while also reducing the complexity of the codegen significantly.

I have similar question as Stephan here on separation of the decision vs materialization of the loop nest. The latter is important from a progressive lowering point of view and code-reuse.

Could you give a bit more detail here? Otherwise, I have exactly the same questions as @herhut. Whatever you want to do isn’t really tied to Linalg - it’s just that linalg is one of the users of this functionality and this is possible by attaching the right attributes and/or having the right API on scf.parallel like you briefly mention about in your original post. It finally comes down to the control flow simplification that you say can’t be encoded/specified in an integrated manner while doing such a tiling/distribution. This is the part that’s not clear to me and I believe it should all be possible.

Actually, block, cyclic, and block cyclic are by far the most common and pretty much the only things I’ve seen as far as structured static mappings go. Block cyclic in fact subsumes the other two, which are special cases of it. In addition, if you extend these to make them multi-level (multiple nested blocks), I think it is all that’s needed or used in practice. The only other structured mappings outside of these I’ve seen are multi-partitioning (used in the Rice dHPF compiler - it’s a skewed block cyclic mapping) and Sudoku mappings (which is similar in spirit to multipartitioning). But these are really specialized but emphasized then because they provided significant improvements for NAS SP and BT benchmarks.

All block cyclic mappings can be specified by affine maps if necessary - by in fact restricting them to be of a certain form and verifying that, but affine maps aren’t really necessary and an overkill if you want to specify a single level block cyclic with a block cycle size. However, for the full generality where you want multi-level block cyclic with perhaps even a customized permutation, using an affine map will provide a compact form without the need to invent too many custom attributes / strings and ways to compose them.

Thanks for the comments/questions. Maybe one specific use case that triggered this might be helpful. @ThomasRaoux is looking into mapping linalg.matmul to co-operative matrix multiply ops. So the way to do this would

  1. Tile once to get the loop that will get distributed across workgroups/blocks
  2. Tile once more to get the loop that will get distributed across subgroups/warps
  3. Lower the subgroup level linalg.matmul operation to spv.CooperativeMatrixMatrixMultiply operation.

So roughly the code is

scf.parallel (%arg0, %arg1) = (%c0, %c0) to (%M, %K) step (%blockDim_y, blockDim_x) {
  scf.for %arg2 = %c0 to %K step %tileSize_k {
     %wgmem_A = alloc(%blockDim_y, %tileSize_k) : memref<?x?xf32,3> // workgroup memory
     %wgmemsv_A = subview %wgmem_A[0, 0][%blockDim_y, %tileSize_k]
     %subview_A = subview %A[%arg0, %arg1][%blockDim_y, %tileSize_k]
     linalg.copy(%subview_A, %wgmemsv_A)
     scf.parallel (%arg3, %arg4) = (%arg0, %arg1) to (%blockDim_y, blockDim_x) step (%sg_y, %sg_x) {
        scf.for (%arg5) = (%arg2) to (%tileSize_k) step %sg_z {

Some issue that came up here are

  1. The tiling using the LinalgTilingPattern (which overloads the PatternRewriter class) generates the scf.parallel loops. It gets fairly complicated (and ad-hoc) to later on have a pass that looks at nesting of scf.parallel ops and then maps it to different processor heirarchy. It seems much cleaner to distribute the loops while they are being generated if you are not doing any transformation at the loop level. (As a contrary example, if you want to do double buffering, then there is a definite advantage of explicitly instantiating the loop, for example).

  2. Using the same mapping mechanism that is used from scf.parallel to gpu.* lowering is possible, i.e. while tiling you can add the attributes that will later be used to distribute the loop. This has further issues

    • This only works for the case where gpu dialect is used in the lowering. Since GPU dialect is used for both host and device side modeling, for cases like IREE which is only using the Linalg → … → SPIR-V path for device side code-generation this approach doesnt work.
    • I have looked at the mapping attributes and it is fairly non-intuitive to me what mapping I need to use to map to workgroups, subgroups and workitems

    Had an offline conversation with @herhut and we will look into how to refactor the scf.parallel to gpu. lowering to not be tied to a gpu.launch operation, as well as helping me figure out what the mapping attribute should be for getting the lowering working