Motivation
During EuroLLVM 2024 I presented a proposal to automatically parallelize indirect memory writes in MLIR. This kind of memory access is present in many AI and HPC applications. I will summarize the motivation now, but you can find the slides attached at the bottom of this post if you want more information.
Indirect memory writes involve writing to an array where the access is based on indices stored in another array. For example:
%0 = scf.for %arg0 = %c0 to %1 step %c1 iter_args(%arg1 = %2) -> (tensor<?xi8>) {
%extracted = tensor.extract %indices[%arg0] : tensor<?xi32>
%idx = arith.index_cast %extracted : i32 to index
%4 = tensor.extract %buff[%idx] : tensor<?xi32>
%5 = arith.addi %4, %c1 : i32
%inserted = tensor.insert %5 into %buff[%idx] : tensor<?xi32>
...
scf.yield %inserted : tensor<?x32>
}
Here you can see that the index used to access %buff
is given by %idx
, which comes from another array, %indices
.
To parallelize this in MLIR we would apply tile_using_forall
, but in this case it can not be done like this, because idx
can point to an arbitrary location, so race conditions may arise easily.
Why at the loop level and not at the linalg level?
An indirect memory write (like the one shown above) can not be represented in linalg, only at the loop level, so loops are the higher-level where this transformation can happen.
How it works
The idea is to generate a private buffer for each thread, so that each thread will compute its data and store it in the private buffer. Then, we would have n
private buffers where n
is the number of threads. So after the loop, a reduction takes place, which combines the partial outputs of each thread into a final output. The slides explain this in more detail (page 7 onwards).
At the moment this does not implement TilingInterface
. However, after some conversations with some people (e.g., @ftynse), I realized this is a major requirement, so I will adapt the code to implement it before upstreaming.
Why not splitting tile_indirection_using_forall
op in two ops?
A more generic approach would be to have loop tiling as a separate transform and then the buffer privatization as an option. This is how it is presented in the slides, and this would allow users to have tiling at the loop level and, if needed, buffer privatization. However, I discarded this possibility because “traditional” tiling and the tiling needed to privatize buffers are radically different.
In “traditional” tiling, each loop iteration reads, computes and writes into part of the output (e.g., using subviews). On the contrary, tiling in order to achieve buffer privatization is implemented by giving each thread the ownership of the whole output (e.g., no subviews are needed). Therefore, if this is implemented in two separate steps, where privatize buffers is applied after tiling, the tiling should be altered (or removed) in order to be able to implement the buffer privatization, which does not seem reasonable.
This is why having one single op, transform.loop.tile_indirection_using_forall
, seems like the most elegant solution.
Example
Please consider the following code:
%0 = scf.for %arg2 = %c0 to %c4 step %c1 iter_args(%arg3 = %arg1) -> (tensor<4xf32>) {
%extracted = tensor.extract %arg0[%arg2] : tensor<4xf32>
%1 = arith.addf %extracted, %extracted : f32
%inserted = tensor.insert %1 into %arg3[%arg2] : tensor<4xf32>
scf.yield %inserted : tensor<4xf32>
}
a possible loop tiling using forall (2 threads) would be:
%0 = scf.forall (%arg2) in (%c2) shared_outs(%arg3 = %arg1) -> (tensor<4xf32>) {
%1 = affine.apply #map(%arg2)[%c4]
%2 = affine.min #map1(%arg2)[%c4]
%extracted_slice = tensor.extract_slice %arg0[%1] [2] [1] : tensor<4xf32> to tensor<2xf32>
%extracted_slice_0 = tensor.extract_slice %arg3[%1] [2] [1] : tensor<4xf32> to tensor<2xf32>
%3 = scf.for %arg4 = %c0 to %c2 step %c1 iter_args(%arg5 = %extracted_slice_0) -> (tensor<2xf32>) {
%extracted = tensor.extract %extracted_slice[%arg4] : tensor<2xf32>
%4 = arith.addf %extracted, %extracted : f32
%inserted = tensor.insert %4 into %arg5[%arg4] : tensor<2xf32>
scf.yield %inserted : tensor<2xf32>
}
scf.forall.in_parallel {
tensor.parallel_insert_slice %3 into %arg3[%arg2] [2] [1] : tensor<2xf32> into tensor<4xf32>
}
}
whereas a tile_indirection_using_forall would yield:
%1 = scf.forall (%arg2) in (%c2) shared_outs(%arg3 = %0) -> (tensor<2x4xf32>) {
%2 = affine.apply #map(%arg2)[%c4]
%3 = affine.min #map1(%arg2)[%c4]
%extracted_slice = tensor.extract_slice %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<2x4xf32> to tensor<4xf32>
%cst = arith.constant 0.000000e+00 : f32
%4 = linalg.fill ins(%cst : f32) outs(%extracted_slice : tensor<4xf32>) -> tensor<4xf32>
%5 = scf.for %arg4 = %2 to %3 step %c1 iter_args(%arg5 = %4) -> (tensor<4xf32>) {
%extracted = tensor.extract %arg0[%arg4] : tensor<4xf32>
%6 = arith.addf %extracted, %extracted : f32
%inserted = tensor.insert %6 into %arg5[%arg4] : tensor<4xf32>
scf.yield %inserted : tensor<4xf32>
}
scf.forall.in_parallel {
tensor.parallel_insert_slice %5 into %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<4xf32> into tensor<2x4xf32>
}
}
%reduced = linalg.reduce ins(%1 : tensor<2x4xf32>) outs(%arg1 : tensor<4xf32>) dimensions = [0] (%in: f32, %init: f32) {
%2 = arith.addf %in, %init : f32
linalg.yield %2 : f32
}
Note that in the last case we can not tile the output since we do not now beforehand the index so the tensor.insert
can happen at any location, thus motivating the need of having the whole (private) tensor for each thread.
Parallelizing_Applications_With_Indirect_Memory_Writes_in_MLIR.pdf (1.6 MB)