[RFC] Adding transform.loop.tile_indirection_using_forall

Motivation

During EuroLLVM 2024 I presented a proposal to automatically parallelize indirect memory writes in MLIR. This kind of memory access is present in many AI and HPC applications. I will summarize the motivation now, but you can find the slides attached at the bottom of this post if you want more information.

Indirect memory writes involve writing to an array where the access is based on indices stored in another array. For example:

%0 = scf.for %arg0 = %c0 to %1 step %c1 iter_args(%arg1 = %2) -> (tensor<?xi8>) {
  %extracted = tensor.extract %indices[%arg0] : tensor<?xi32>
  %idx = arith.index_cast %extracted : i32 to index
  %4 = tensor.extract %buff[%idx] : tensor<?xi32>
  %5 = arith.addi %4, %c1 : i32
  %inserted = tensor.insert %5 into %buff[%idx] : tensor<?xi32>
  ...
  scf.yield %inserted : tensor<?x32>
}

Here you can see that the index used to access %buff is given by %idx, which comes from another array, %indices.

To parallelize this in MLIR we would apply tile_using_forall, but in this case it can not be done like this, because idx can point to an arbitrary location, so race conditions may arise easily.

Why at the loop level and not at the linalg level?

An indirect memory write (like the one shown above) can not be represented in linalg, only at the loop level, so loops are the higher-level where this transformation can happen.

How it works

The idea is to generate a private buffer for each thread, so that each thread will compute its data and store it in the private buffer. Then, we would have n private buffers where n is the number of threads. So after the loop, a reduction takes place, which combines the partial outputs of each thread into a final output. The slides explain this in more detail (page 7 onwards).

At the moment this does not implement TilingInterface. However, after some conversations with some people (e.g., @ftynse), I realized this is a major requirement, so I will adapt the code to implement it before upstreaming.

Why not splitting tile_indirection_using_forall op in two ops?

A more generic approach would be to have loop tiling as a separate transform and then the buffer privatization as an option. This is how it is presented in the slides, and this would allow users to have tiling at the loop level and, if needed, buffer privatization. However, I discarded this possibility because “traditional” tiling and the tiling needed to privatize buffers are radically different.

In “traditional” tiling, each loop iteration reads, computes and writes into part of the output (e.g., using subviews). On the contrary, tiling in order to achieve buffer privatization is implemented by giving each thread the ownership of the whole output (e.g., no subviews are needed). Therefore, if this is implemented in two separate steps, where privatize buffers is applied after tiling, the tiling should be altered (or removed) in order to be able to implement the buffer privatization, which does not seem reasonable.

This is why having one single op, transform.loop.tile_indirection_using_forall, seems like the most elegant solution.

Example

Please consider the following code:

%0 = scf.for %arg2 = %c0 to %c4 step %c1 iter_args(%arg3 = %arg1) -> (tensor<4xf32>) {
  %extracted = tensor.extract %arg0[%arg2] : tensor<4xf32>    
  %1 = arith.addf %extracted, %extracted : f32
  %inserted = tensor.insert %1 into %arg3[%arg2] : tensor<4xf32>
  scf.yield %inserted : tensor<4xf32>
}

a possible loop tiling using forall (2 threads) would be:

%0 = scf.forall (%arg2) in (%c2) shared_outs(%arg3 = %arg1) -> (tensor<4xf32>) {
  %1 = affine.apply #map(%arg2)[%c4]
  %2 = affine.min #map1(%arg2)[%c4]
  %extracted_slice = tensor.extract_slice %arg0[%1] [2] [1] : tensor<4xf32> to tensor<2xf32>
  %extracted_slice_0 = tensor.extract_slice %arg3[%1] [2] [1] : tensor<4xf32> to tensor<2xf32>
  %3 = scf.for %arg4 = %c0 to %c2 step %c1 iter_args(%arg5 = %extracted_slice_0) -> (tensor<2xf32>) {
    %extracted = tensor.extract %extracted_slice[%arg4] : tensor<2xf32>
    %4 = arith.addf %extracted, %extracted : f32
    %inserted = tensor.insert %4 into %arg5[%arg4] : tensor<2xf32>
    scf.yield %inserted : tensor<2xf32>
  }
  scf.forall.in_parallel {
    tensor.parallel_insert_slice %3 into %arg3[%arg2] [2] [1] : tensor<2xf32> into tensor<4xf32>
  }
}

whereas a tile_indirection_using_forall would yield:

%1 = scf.forall (%arg2) in (%c2) shared_outs(%arg3 = %0) -> (tensor<2x4xf32>) {
  %2 = affine.apply #map(%arg2)[%c4]
  %3 = affine.min #map1(%arg2)[%c4]
  %extracted_slice = tensor.extract_slice %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<2x4xf32> to tensor<4xf32>
  %cst = arith.constant 0.000000e+00 : f32
  %4 = linalg.fill ins(%cst : f32) outs(%extracted_slice : tensor<4xf32>) -> tensor<4xf32>
  %5 = scf.for %arg4 = %2 to %3 step %c1 iter_args(%arg5 = %4) -> (tensor<4xf32>) {
    %extracted = tensor.extract %arg0[%arg4] : tensor<4xf32>
    %6 = arith.addf %extracted, %extracted : f32
    %inserted = tensor.insert %6 into %arg5[%arg4] : tensor<4xf32>
    scf.yield %inserted : tensor<4xf32>
  }
  scf.forall.in_parallel {
    tensor.parallel_insert_slice %5 into %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<4xf32> into tensor<2x4xf32>
  }
}
%reduced = linalg.reduce ins(%1 : tensor<2x4xf32>) outs(%arg1 : tensor<4xf32>) dimensions = [0]  (%in: f32, %init: f32) {
  %2 = arith.addf %in, %init : f32
  linalg.yield %2 : f32
}

Note that in the last case we can not tile the output since we do not now beforehand the index so the tensor.insert can happen at any location, thus motivating the need of having the whole (private) tensor for each thread.

Parallelizing_Applications_With_Indirect_Memory_Writes_in_MLIR.pdf (1.6 MB)

1 Like

Could you please elaborate what is the input to your transformation, specifically, what is the operation that gets tiled.

High-level remarks:

  • We should separate the discussion of the transformation itself from how it is exposed in the transform dialect. Even if we end up having a single transform op, the implementation should still reuse code at the C++ level.
  • It is unclear to me that summing up buffers from different threads is the behavior that is always desired. We may want to somehow generalize the reduction logic. In particular, I can see the caller making an assumption that indexes never overlap and generating assertions as reduction.
  • This generally goes into the direction of generalizing the relation between iteration space and input/output data spaces of a linalg op. Specifically, the indirect access cannot be encoded as an affine_map indexing map. Have you considered extending the possible indexing maps? And/or generalizing the tiling mechanism to accept “data tile” generators (the interface may already provide methods for this, though potentially insufficiently generic) instead of assuming an invertible affine map?

Could you please elaborate what is the input to your transformation, specifically, what is the operation that gets tiled.

The input (which is the operation that gets tiled) is an scf::ForOp. The output is the tiled implementation of the scf::ForOp enclosed by an scf::ForallOp, followed by a reduction (implemented with linalg.reduce)

High-level remarks:

  • We should separate the discussion of the transformation itself from how it is exposed in the transform dialect. Even if we end up having a single transform op, the implementation should still reuse code at the C++ level.

I agree. Regarding how it is exposed in the transform dialect, I believe the cleanest way is as a single op as I explained in the post. About the implementation, I think it is possible to reuse code at the C++ level. The driver I currently have for loop tiling is similar to the one used by tileToForallOpImpl. What cannot be reused is the buffer privatization, but in the tiling I believe it is possible.

  • It is unclear to me that summing up buffers from different threads is the behavior that is always desired. We may want to somehow generalize the reduction logic. In particular, I can see the caller making an assumption that indexes never overlap and generating assertions as reduction.

Good point! I didn’t make this clear in my post but the idea is to allow the user specify what reduction type to use instead of hardcoding an add. The implementation would follow something like what createLinalgBodyCalculationForReduceOp does, basically generating the body of the reduction depending on a given input. In this case, this reduction kind could be exposed at the transform dialect.

  • This generally goes into the direction of generalizing the relation between iteration space and input/output data spaces of a linalg op. Specifically, the indirect access cannot be encoded as an affine_map indexing map. Have you considered extending the possible indexing maps? And/or generalizing the tiling mechanism to accept “data tile” generators (the interface may already provide methods for this, though potentially insufficiently generic) instead of assuming an invertible affine map?

I haven’t considered any of those. Like we discussed, this is an interesting path to pursue in the future, and I personally would love to see that happening, but at the moment I don’t have anything in that line.

1 Like

Let’s make sure to ask @nicolasvasilache and @MaheshRavishankar .

I can reply more next week (I am a bit sick this week and also going to be be travelling which is fun ), but I think what you are looking for is essentially gather/scatter kind of operations. I dont think the solution is to start at the loop level and tile there. You are right you cannot represent these operations in Linalg (you can for gather, but no much for scatter), but you can define operations that implement scatter/gather semantics (like we have in LinalgExt dialect) which implement the TilingInterface. The exact semantics of how you generate the intra-tile implementation is op dependent, but if you implement the interface it effectively generates the inter-tile traversal code (and also enables fusion using tile and fuse).

Next week I can look deeper into the RFC and provide some better answers, but hopefully this gives some useful information.

Thanks for the suggestion. If I understand you correctly, you are proposing to define high-level operations that encapsulates the semantics of, for example (from my slides), triangle counting, betweeness centrality or unsorted segment sum. Then make those operations implement the TilingInterface (each op with its corresponding implementation). Because those ops cannot be represented with linalg, if lowered, they could only be lowered to scf directly. But yes, they could be tiled at a high-level and, once lowered, land in the scf dialect.

If my understanding is correct then I’m not sure if it is the most effective way of doing this, because I assume we would need N different implementations of TilingInterface (where N is the number of ops with scatter/gather semantics), whereas tiling at the loop level representation of each of those ops can be accomplished with a single loop tiling implementation. Besides, scatter/gather is in LinalgExt dialect in IREE, but I don’t see the same thing happening for the 3 ops I mentioned in upstream linalg. Happy to discuss any of those points further.

As a fly-by: there are ways to reuse interface implementations for multiple ops, starting with naively calling the same Impl function template.

Actually looking through the proposal again, I dont see what is “indirect” about this transformation. This is essentially tree reduction? This is what PartialReductionOpInterface and the tileReductionUsingSCF is meant to handle? The additional step that is needed here is that instead of direct mapping to iteration to thread, this uses an indirect buffer to get which thread executes which iteration. This is something that could be added to the implementation of tileReductionUsingSCF?

Actually looking through the proposal again, I dont see what is “indirect” about this transformation.

Maybe the name is a bit confusing, but my intention with the name is to express that it tiles indirect memory writes, not that the transformation itself is indirect. You can see the indirection by inspecting the input loop, which contains a load from an array which position depends on the load from another array, hence being indirect.

This is essentially tree reduction? This is what PartialReductionOpInterface and the tileReductionUsingSCF is meant to handle? The additional step that is needed here is that instead of direct mapping to iteration to thread, this uses an indirect buffer to get which thread executes which iteration.

Thank you for briniging this idea into the conversation. When I started working on this, I also had the impression that this is similar to what tile_reduction_using_scf does. However, I see a critical difference.

The target (input) of tile_reduction_using_scf is a structured op with a reduction, whereas the target of tile_indirection_using_forall is an a loop contanining an indirect memory write. Assuming that we could apply tile_reduction_using_scf to such a loop, it would not work, because the expected input is a reduction, not an indirect memory write.

IMO, the key here is that an indirect memory write and a reduction are different things. Linalg supports reductions, but not indirect accesses. I think this tells us about how different an indirection and a reduction are. On the other hand, I agree that the way in which we tile a reduction and an indirection is similar, being the main difference the one you outlined. I feel like this differentiation deserves to be in a different op.

Besides, tile_reduction_using_scf targets structured ops. My intention is that tile_indirection_using_forall targets loops, as indirect memory writes cannot be currently represented with structured ops.

This is something that could be added to the implementation of tileReductionUsingSCF ?

In theory, it should be. However, to me it is not natural since the name implies it should be tiling a reduction, not an indirect access, and tileReductionUsingSCF targets structured ops, not loops.

I think I need to step back a little bit more. The main problem I have is that you are starting with a “lowered representation” of a scatter-like operation

This above code is essentially a scatter operation. To implement the tiling that you have in mind you have to do a lot of analysis

  1. You need to look for the tensor.extract whose values comes from another tensor.extract.
  2. You then need to know that the second tensor.extract is coming from a loop.

This kind of “lifting” is fragile. One way to solve this is to have an operation that represents your original computation. THis is basically a scatter_add. Leave the op design aside for a moment and say you had an operation scatter add which is something like this

%0 = scatter_add(%buff, %indices) : ...

that represents the loop computation you are trying. (you can add regions to the op to represent more than just scatter-add etc, but lets leave that aside).
Now your op can implement the TilingInterface.

  1. It has an 1D iteration space of lb, ub, step = %c0, %1, %c1
  2. The loop code that you have is essentially what the “lowerToLoops” of this operation would be.

Then if you implement the TilingInterface on this operation, you should be able to tile this operation just like any other op (there is some subtlity here w.r.t scatter which makes the iteration space, either parallel or sequential, but thats going to the weeds a bit)

btw, this is not a scatter any more. This is just doing an elementwise add of %arg0 with itself. So thats not the same as your original example. So I dont understand what is happening in your example of subsequent tiling and tile_using_indirection. Are these examples what you intended.

I generally agree with the sentiment here: we can and should have a tileable scatter and raising is fragile. However, not every input comes from a frontend where this information about something being a scatter is readily available. Therefore, I’m supportive of having properly designed loop-level transformations if needed.

We’d still need the way to combine the results produced by each tile though. AFAIR, current tiling schemes will just insert_slice/parallel_insert_slice them assuming they don’t overlap. This is indeed akin to reduction tiling that generates a different combinator, but may still require a new transformation/scheme. Could be interesting to think about generalization here, assuming we pursue the direction of higher-level ops, which can be first prototyping the transformation separately and then analyzing its overlap with the reduction case.

See [RFC] Structured Codegen Beyond Rectangular Arrays

If there is interest in contributing a proper nested tensor<tensor...> abstraction and its lowering to memref<memref...> I’ll happily review and help guide.

The way we did this in IREE was the the “dimension along which the scatter occurs” can be either sequential or parallel. If specified as “parallel” then it is an assertion that the scatter can happen in parallel (based on the indices used for scatter). We could more fancy with “lowering to loops” of the scatter to use atomics, etc. but we havent gone there.

1 Like

Yeah, I’m thinking of a generalization that’s require explicit re-combination like reductions.

I’ve tried something in this direction, and I found a lot of broken glass/gave up. I think this is a large project to do right and needs design/analysis. Just setting expectations that I don’t think this is an incremental enhancement and might need a bit more of a thought out push vs just diving in.

btw, this is not a scatter any more. This is just doing an elementwise add of %arg0 with itself. So thats not the same as your original example. So I dont understand what is happening in your example of subsequent tiling and tile_using_indirection. Are these examples what you intended.

Sorry for the confusion. My first example shows a code example of what I want to tile (what you suggest to represent with a scatter_add) so your understanding and reasoning about that code is correct.

The second code (the one that confused you) is a motivating example to show that the upstream tiling pass and the kind of tiling I’m proposing in the RFC are different. You can see these differences better by looking at the inner scf.for. But yes, your are right, the code itself is not an indirect memory write (e.g., is not a scatter_add).

This above code is essentially a scatter operation. To implement the tiling that you have in mind you have to do a lot of analysis

Initially, I also thought that this would require a lot of analysis. However I later realized that this is not the case. I think the example I just mentioned is a good way to show that this.

If you look at the inner scf.for you will see that the extract:

%extracted = tensor.extract %arg0[%arg2] : tensor<4xf32> 

is transformed into after tile_indirection_using_forall:

%extracted = tensor.extract %arg0[%arg4] : tensor<4xf32>

So the only thing to do is adapting the induction variable (easy). And the insert:

%inserted = tensor.insert %1 into %arg3[%arg2] 

is also easily transformed into:

%inserted = tensor.insert %6 into %arg5[%arg4] : tensor<4xf32>

because %arg5 is just the empty tensor we automatically generate, thus requiring no analysis to perform. I hope this example clarifies my point.

Thank you all for your comments. It’s always fun to have good discussions around MLIR.

I personally think that a high-level op such as scatter_add would be helpful indeed, and I also think that tilling at the loop level can be potentially fragile.

That said, I will now talk as a MLIR user; not always we can have this high-level information. Sometimes I find myself at the loop level and, well, I understand that most of the things should be ideally happening in linalg or similar but, as @ftynse also pointed out, I would find it useful to also have a well designed loop-level transformation for the cases where we find ourselves at the loop level for whatever reason it is.

Because I think I might have an implementation that fits this description, and because I don’t have enough experience to judge if it’s fair to have this at the loop level or not, I opened this RFC. For me is not reasonable to re-implement this at a high-level (e.g., scatter_add), so if the answer to the last question is no (which, as I explained earlier, certainly makes sense to me), then I shall keep my implementation downstream.

1 Like

FWIW, @wsmoses and friends have similar needs for loop transformations from Polygeist, which starts at C++ and not Linalg. The best they can do is affine loops, and not always. Given that there is more than one client for such a transformation, it’s reasonable to have it upstream, even if we will eventually have a scatter at a higher level.