# Representing tiling on tensors + parallelism

Tiling on Linalg on tensors uses `scf.for` with destructive updates to represent the tiled computation. For example, a simple 1D operation

``````#map = affine_map<(d0) -> (d0)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = linalg.generic {
indexing_maps = [#map, #map], iterator_types = ["parallel"]}
ins(%0 : tensor<?xf32>) outs(%1 : tensor<?xf32>) :
^bb0(%arg0 : f32, %arg1 : f32) {
%3 = sqrt %arg0 : f32
linalg.yield %3 : f32
} -> tensor<?xf32>
``````

when tiled using tile size `%t` is represented as follows

``````#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
%3 = scf.for %iv = %c to %2 step %t iter_args(%arg0 = %1) {
%4 = affine_min #map1(%iv)[%t, %2]
%5 = tensor.extract_slice %0[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
%6 = tensor.extract_slice %arg0[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
%7 = linalg.generic {
indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
^bb0(%arg0 : f32, %arg1 : f32) {
%3 = sqrt %arg0 : f32
linalg.yield %3 : f32
} -> tensor<?xf32>
%8 = tensor.insert_slice %7 into %arg0[0] [%4] [1] : tensor<?xf32> into tensor<?xf32>
scf.yield %8 : tensor<?xf32>
} -> tensor<?xf32>
``````

The `tensor.extract_slice %arg0` and the `tensor.insert_slice %... into %arg0` represent the destructive update pattern, i.e. each iteration of the loop computes a tile of the computation and inserts it into the current value of the result tensor (`%arg0`) and yields the entire tensor. As represented this computation is losing the information that the `scf.for` is actually parallel. The serialization is not inherent to the computation, but is purely due to inability to represent parallel computation using `scf.for` with `tensor` type operands and returns.

The problem becomes even more pronounced with distribution. Currently in Linalg, distribution is part of the tiling transformation, i.e. for a loop that is parallel, block-cyclic distribution is used during tiling to distribute work across threads on a grid. For example, tile + distribute of the original op results in the following IR.

``````#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
#map2 = affine_map<()[s0, s1] -> (s0 * s1>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
%id = "proc_id"() : index
%np = "nprocs"() : index
%offset = affine.apply #map2()[%id, %t]
%step = affine.apply #map2()[%np, %t]
%3 = scf.for %iv = %offset to %2 step %step iter_args(%arg0 = %1) {
....
} -> tensor<?xf32>
``````

This leads to a whole host of problem. The main one is that for the case where all shapes are statically known, all tile sizes used are statically known and the tile size divides the problem size, the code within the tiled loop can be canonicalized to use static shapes as well, which then vectorizes easily (without requiring peeling/padding, etc. given the constraints). Unfortunately the materialization of the distributed loop makes canonicalizing away the `affine.min` operation very involved, and without this the sizes arent propagated as static shapes.

There are many potential solutions, each with its own trade-offs. I am listing them here in order of my preference, but really looking for a end-to-end solution (that we can use in IREE effectively). I have a proposal below that is the cleanest IMO, but I am not tied to it. There are potential other solutions as well.

### Solution 1 : Extend `scf.for` operations.

The distribution that is done during tiling is a really simple transformation. It just a change to the lower-bound and step values using the SSA value for number of processor and processor ID. There is no need to actually materialize the distributed loop. Instead one could change the `scf.for` semantics to take these SSA value as two additional operands, so the distributed loop above would be represented above as

``````%3 = scf.for %iv = %c0 to %ub step %c1 iter_args(...) distributed (id = %id , np = %np) {
...
} -> tensor<?x?xf32>
``````

Here the `scf.for` accepts two additional optional operands `%id` and `%np` to specify the processor ID and number of processors. The presence of these operands also indicates that the loop is parallel and distributable. This also allows easy analysis of the loop while maintaining the semantics that the loop is distributable. For example, for CUDA codegen in IREE, there is a pretty heavy analysis done to figure out if a loop can be removed because its a zero-trip or one-trip loop. Lot of this complication arises from the loop being distributed already. The above form of the loop makes this analysis fairly trivial. The actual distribution can happen pretty late, either just before lowering to LLVM dialect (or SPIR-V dialect).

### Solution 2: Making `scf.parallel` work on tensors.

AFAIK `scf.parallel` only works on `memref` types. It could be extended to work on tensors while still maintaining the parallel semantics. The down side I see is that it does not have a way to carry the SSA values to use for distribution as represented above. The semantics of the `scf.parallel` loop could also be extended to use the `id` and `np`, but I am less convinced of the actual value of `scf.parallel` over `scf.for` if it is just to say some loops are parallel. The same could be achieved by using an attribute on `scf.for` for example.

### Solution 3: `linalg.tiled_loop`

For the most part `linalg.tiled_loop` actually already has some of the requirements that I am looking for. The way the distribution is represented needs to be tweaked. Instead of using attributes to represent how it is distributed (which is very CUDA specific and an abstraction leak IMO), they should just use optional SSA values. Having followed the `linalg.tiled_loop` evolution and after discussions with a few folks, I am still not convinced that `linalg.tiled_loop` is paying for the abstraction. I think effective extensions to `scf.for` gives us everything we need to represent parallel semantics.

Looking for comments about how to proceed here. Happy to take up any tasks that helps in better representation. I really hope we can go with `scf.for` extensions. From experience of using these in IREE that would be the easiest to make work e2e.
cc @nicolasvasilache @albertcohen @ftynse @herhut

2 Likes

Thanks for the nice write up @MaheshRavishankar !
Quick question: `scf.parallel` supports reductions as well: isnâ€™t your â€śdestructive updateâ€ť just something that can fit the model? You wrote that `scf.parallel only works on memref types` but it isnâ€™t clear to me where this requirements come from?
Isnâ€™t it possible to write something that looks roughly like:

``````#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
%3 = scf.parallel (%iv) = (%c) to (%2) step (%t) init(%1) -> tensor<?xf32> {
%4 = affine_min #map1(%iv)[%t, %2]
%5 = tensor.extract_slice %0[0] [%4] [1] : tensor<?xf32> to tensor<?xf32>
%6 = tensor.extract_slice %1[0] [%4] [1] : tensor<?xf32> to tensor<?xf32>
%7 = linalg.generic {
indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
^bb0(%arg0 : f32, %arg1 : f32) {
%3 = sqrt %arg0 : f32
linalg.yield %3 : f32
} -> tensor<?xf32>
scf.reduce(%7) : tensor<?xf32> {
^bb0(%lhs : tensor<?xf32>, %rhs: tensor<?xf32>):
%8 = tensor.insert_slice %lhs into %rhs[0] [%4] [1] : tensor<?xf32> into tensor<?xf32>
scf.yield %8 : tensor<?xf32>
} -> tensor<?xf32>
``````

A large part of `scf.parallel` design was about how weâ€™d structure the â€śreductionâ€ť aspect, I didnâ€™t quite get how this would be handled in your proposal, you just wrote that the â€śpresence of these operands also indicates that the loop is parallel and distributableâ€ť, but this does not address how the result of such a loop is computed.

I must be mistaken here. It seems like this was a restriction a while ago, but is no more the case.

Your representation of the destructive update using `scf.parallel` is correct. Looking at this, I think it is also possible to teach the in-place bufferization that is central to Linalg to recognize the destructive update pattern in a more structured way (I can expand on that later, or maybe I am totally off, I need to check with Nicolas on that).
But I have broader concern of modeling reductions this way. This might be based on how things work in IREE, specifically for GPU codegen, but this representation works only if you are generating code for both the host and device, i.e. this `scf.parallel` is lowered in a context where downstream to this op is generating both host side code and device-side code. Main reason being that you could need multiple kernels to implement the reduction. This makes `scf.parallel` restrictive in its use (cant use it in IREE for example).

In the approach I was specifying, there is no context in the `scf.for` about what is being lowered. It is just indicating that the loop is distributed. The fact that the body of the loop is performing one stage of, say, a tree based reduction is then just represented in the region of the loop, i.e. something that is generating this loop is then generating multiple such â€śdistributedâ€ť `scf.for` operations to implement the tree-based reduction. or whatever other computation it needs to. Essentially the `scf.for` is very straight-forward in what it represents (and makes it easier to target).

Yes all these points have been discussed at length previously, thanks for bringing them together in a single post.

As you know, Iâ€™m interested in exploring extending the `scf.for` semantics, so having it prototyped and connected end-to-end would go a long way towards understanding all the pieces better.

This would also help answer Mehdiâ€™s comments and other questions related to:

• multi vs single loop representation and tradeoffs (there are quite a few related to canonicalization, writing transformations, parallel semantics in tensor vs buffer land etc)
• reduction representation: this is currently only a terminator in scf.parallel and should be an independent op reusable in other places e.g. vector dialect.

My position is that we should be in the prototyping and proof-of-concept phase to build up confidence and there are quite a few caveats to make everything compose properly and have a chance at being future proof (e.g. for ragged and sparse tensors). Without a PoC implementation and data to back this up I fear this will get sidetracked in lenghty discussions for which we simply donâ€™t know enough to conclude yet.

Sorry to chime in for a question on the semantic of the destructive update from tiled reduction. I came to this question when I was playing with LinAlg tiling so want to take this opportunity to understand it better. Does the `tensor.insert_slice` operation as in @MaheshRavishankarâ€™s example have the right semantic for tiling reduction? For reduction, inserting back the tiled result to the full tensor seems to me missing the accumulation part for reduction. I was expecting to see something more like â€śtensor.reduce_slice_on_some_specific_reduction_typeâ€ť that describes the reduction semantic more precisely.

Besides my above question on the semantic from the serialized tiled reduction, Iâ€™m also interested in keeping parallelism from tiled LinAlg operations expressed in some way. Besides approaches listed by @MaheshRavishankar, is expressing the tiled loop in linalg.genericâ€™s implicit looping structure something too wild for consideration? Iâ€™m new to LinAlg so apologize if my question is too naĂŻve. This helps me understand what loop transformations could be be driven by linalg.generic ops in the long run. Thanks.

Ah I see what you mean, I would rephrase it differently though (I donâ€™t think we need to bother about one kernels vs multiple kind of thinking here, we can express this at high level of semantics):

while this modeling is â€ścorrectâ€ť, it is also overly conservative because it forces to apply a â€średuction algorithmâ€ť and does not carry the fact that the â€śdestructive updatesâ€ť are entirely independent.

Makes sense?

That said, I still donâ€™t quite get how youâ€™re working around this TBH, there is a weird contract with your SCF for that I canâ€™t wrap my head around. I think the issue is that `iter_args` implies fundamentally a value that goes from one iteration to the next (and the last iteration is the returned value of the `scf.for`), but by implying that this is parallel we get in a weird situation (which is why the reduction was made explicit in `scf.parallel`).

The slice is first extracted, updated, and then reinserted. This isnâ€™t atomic though and the assumption here is that each iteration gets its own tile and there is no overlap in the updates.
(as far as I understand )

1 Like

Thinking a bit more: maybe it could fall into the category of â€śatomic updatesâ€ť that make such â€śreductionâ€ť fully parallel/independent again?

Well you donâ€™t need such a semantics (the `linalg.tiled_loop` does have this now FWIW). Leave aside the parallelism question for now. Then the destructive update is equivalent to doing a read-modify-write of the tensor within the loop body, i.e. it reads a slice of `%arg0` using `tensor.extract_slice`, computes a new slice to use in its place, inserts it into`%arg0` and yields the new tensor created. The body of the loop could use a reduction update to compute the new slice to be inserted.

FWIW, if you do tiling on `memref` type, you can generate the tiled loops as `scf.parallel` ops and that is the way to preserve the parallelism today.

Agreed that this is part of the weirdness here. Its a representation issue. When working with SSA values, there doesnt seem to be a way around it. This is a way to side-step the inherent representation issue with the hope that later on in the compilation, past bufferization, the representation issue is resolved. The challenge is to carry the parallel semantics upto that level.

Yeah I think Iâ€™d agree with that. It is easier to think of it in terms of IREEs representation. The tiled computation in IREE is represented as

``````#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
scf.for %iv = %c to %2 step %t {
%4 = affine_min #map1(%iv)[%t, %2]
%5 = flow.dispatch.tensor.load %input[0] [%4] [1] : !flow.dispatch.tensor<?xf32> -> tensor<?xf32>
%6 = flow.dispatch.tensor.load %init[0] [%4] [1] : !flow.dispatch.tensor<?xf32> -> tensor<?xf32>
%7 = linalg.generic {
indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
^bb0(%arg0 : f32, %arg1 : f32) {
%3 = sqrt %arg0 : f32
linalg.yield %3 : f32
} -> tensor<?xf32>
flow.dispatch.tensor.store %7 into %init[0] [%4] [1] : tensor<?xf32> -> flow.dispatch.tensor<?xf32>
} -> tensor<?xf32>
``````

The `!flow.dispatch.tensor` are â€śABI typesâ€ť, i.e. they allow reads/writes to it. The `flow.dispatch.tensor.load/store` allow you to read/write to it, but the modeling here is that the â€śsomehowâ€ť these ops extract/insert the slice requested from/into the `flow.dispatch.tensor` value. These are atomic, and any overlap is undefined behavior. The trick is that after â€śbufferizationâ€ť you could just treat the `flow.dispatch.tensor` as regular `memref`s. In my mind I always take the pure `scf.for` version and reason about it in terms of how it is represented above, cause I dont think there is a way to model parallelism at the tensor level apart from what is done in `scf.parallel`, but that has different trade-offs.

1 Like

For folks tagging along, @MaheshRavishankar agreed to brainstorm this topic in the public meeting tomorrow. No presentation/slides, weâ€™ll take it from code sample and try to play with this!

1 Like

Following up on this morning, and extending a bit on @albertcohen mention of â€śconcat is parallelâ€ť, why donâ€™t we embed the â€śreductionâ€ť in the terminator:

``````#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
%3 = scf.parallel (%iv) = (%c0 to (%2) step (%t) init(%1) -> tensor<?xf32> {
%4 = affine_min #map1(%iv)[%t, %2]
%5 = tensor.extract_slice %0[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
%6 = tensor.extract_slice %1[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
%7 = linalg.generic {
indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
^bb0(%arg0 : f32, %arg1 : f32) {
%3 = sqrt %arg0 : f32
linalg.yield %3 : f32
} -> tensor<?xf32>
scf.reduce_insert_slice %7 at [%iv] [%4] [1] : tensor<?xf32
}
``````

I made up this `scf.reduce_insert_slice` (maybe this does not belong to scf and we should use another loop that scf.parallel, but letâ€™s leave it for later): the semantics is that each iteration produces a slice of the output tensor, and express where to insert it in the result. This is like the `scf.reduce`, except that the reduction being a â€śconcatâ€ť it can be fully parallel.
It seems like it addresses your concern @MaheshRavishankar in that we donâ€™t materialize the full tensor at each iteration (or for each reduction step)?

I think this is also close the loop and brings us back to where we arrived 3 months ago?

Here: [RFC] Changes to linalg::TiledLoopOp to unblock reductions - #9 by mehdi_amini
I wrote at the time:

``````...
%tile_range = tensor.tile_range [%j, %i][%c20, %c10][%c1, %c1] : !tensor.tile_range
%out_sub = tensor.extract_slice %out(%tile_range) // only if %out_sub is used as input to the computation!
...
linalg.tiled_yield %transpose_sub at %tile_range : tensor<10x10xf32>
}
``````

And @_sean_silva thought of a variant: [RFC] Changes to linalg::TiledLoopOp to unblock reductions - #35 by _sean_silva

``````  %in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]

%transpose_sub = ...

linalg.tiled_loop_terminator {
tiled_yield %transpose_sub at [%j, %i][%c20, %c10][%c1, %c1]
}
}
``````

(grouping the â€śyieldâ€ť in a region might help to better represent the possible multiple-results of the loop)

I donâ€™t if @pifon2a made progress on their idea about something more flexible / expressive though: [RFC] Changes to linalg::TiledLoopOp to unblock reductions - #27 by pifon2a

These points have been indeed discussed and some even prototyped in the past months.

The current design has settled on 1-d control flow + n-d data representation for multiple intersecting reasons: as usual itâ€™s tradeoffs everywhere.

One particular reason we currently do â€ś1-d/n-dâ€ť is that (I believe) n-d control flow + n-d terminator makes it tricky to represent imperfect nestedness in a stable form at this point in the transformation pipeline. In particular, the big problem I see forcing the control-flow and data in the same abstraction (i.e. a containing op and its terminator) is that â€śthe value `n` ends up being the sameâ€ť for control-flow and data.

I have heard a few times something along the lines of: â€śif you donâ€™t want n-d scf.parallel you can just break them down into 1-d scf.parallelâ€ť. Well letâ€™s unpack this on the simplest imperfectly nested example possible:

``````%t = scf.parallel %i ... {
//
// I want to do something special here (prefetch, pack, compute)
//
% tt = scf.parallel %j ... {
// compute 2-D tile
terminator1?
}
terminator2?
}
``````

What possibilities do we have for terminator1 and terminator2?

# Possibility 1: Symmetric terminators

If both terminator1 and terminator2 are â€śthe symmetricâ€ť, both can only insert 1-D slices: `%j` does not dominate `terminator2`. This breaks the ability to perform n-D block insertions and is a non-starter for this body of work (may be perfectly valid as an abstraction to have but we donâ€™t want to use it in this instance).

# Possibility 2: Asymmetric terminators

Alternatively, `terminator1` can be the special `reduce` terminator and `terminator2` can be a simple `scf.yield`. While this could technically work, this is still very brittle and there are numerous implications.

Letâ€™s make the example a little more interesting: I claim there is such a thing as fusing 1-D and 2-D loop nests.

``````%t0, %t1 = scf.parallel %i ... {
//
// I actually do something here.
//
// compute 1-D tile
%t = some_pointwise_op_1d()

% tt = scf.parallel %j ... {
// compute 2-D tile
fancy_terminator?
}

terminator2?
}
``````

What should terminator2 look like? I am not sure.
How does this generalize to more results and different types? Iâ€™m afraid to ask.
What are the implications on analysis and transformations? oh my â€¦

# Possibility 3: Reduce is not a terminator

Separately, we have been discussing with @ftynse and @dcaballe the fact that a generic `reduce` is needed to capture more advanced combinators. This would be generally useful everywhere if it were decoupled from the semantics an enclosing op. In particular `reduce` on vectors is something that we consider needed to increase applicability of generic n-D vectorization while keeping it short.

``````%t0, %t1 = scf.parallel %i ... {
%t = some_pointwise_op_1d()

% tt = scf.parallel %j ... {
// compute 2-D tile
%ttt = reduce (%small_tile, %big_tile) @some_init {some_body} -> type(%big_tile)
scf.yield %ttt: %big_tile
}

scf.yield %tt, t
}
``````

I believe something like this was previously proposed but had some caveats re `@some_init` that was represented as a function and required all passes to become module passes but this should be easily fixable.

# The real problems

Given the multiple interactions Iâ€™ve had over the past few months, Iâ€™d venture that this is just a symptom of some difficult tradeoffs that I will list here.

## IR design vs analysis for transformations

`linalg.generic` is perfecly nested by design. This a conscious and carefully weighted tradeoff between expressiveness and the 3 harbingers of transformations (legality, applicability and profitability), I wonâ€™t repeat the points of Rationale but youâ€™ll note there is a section on closedness under transformations.

Once tiling and fusion are performed, we are in imperfectly nested-land on the tiled loops but still perfectly nested within the `linalg.generic` within the tiled loops. At this point there is no escaping our imperfect-nestedness condition. The positive is the transformations that usually are needed on tiled loops are a much simpler bunch. Still, coming back to the imperfect-nestedness:

1. we can push forward on the n-D abstractions with implicit load/store in which case we should do the hard work of introducing `tensor<... x tensor< ...>>` and codegen with it as a first-class citizen. This is also very desirable as our way into richer semantics that can easily be codegenâ€™ed (ragged, jagged, sparse, trees, databases etcâ€¦)

2. we can use traditional loops and write the analyses on these explicit subset load/store and gradually introduce better abstractions and interfaces to simplify and generalize. However much I dislike unnecessary complexity, we do have to bite the bullet and do some analysis. Luckily we are in SSA-land and a lot of this is implemented and functional in the sequential case. This is significantly simpler than memory-based analysis and extends to more `subset/subset_insert` operation pairs and `data_types`. Trying to cut corners here is a very bad idea.

(spoiler: we should do both).

`scf.parallel`, `linalg.tiled_loop` and â€śsmartâ€ť terminators simply are not up for the job we need them to: they sit at an abstraction level that is half-explicit / half-implicit and that weird symmetry does not compose or generalize well in an imperfectly nested world on tensors.

## Big tile, small tile, no tile

You will note here that we seem to have given up on â€śyielding the small tileâ€ť and that the various terminators proposed â€śyield the big tileâ€ť.

It seems this is a trigger point for people but this is the result of applying the first principles outlined above related to implicit vs explicit load/store (and more importantly implementing them in practice and getting a fully functional system up and running).

This is also fundamentally what triggers the race condition if we try to materialize the parallelism in the IR.

In other words, the problem comes from trying to write something like this on tensors:

``````// Boo races here, races there, races everywhere.
%t = looping_control_flow_with_parallel_semantics %i = %proc_id, %ub, step = %num_procs {
...
wathever_terminator %foo : full_tensor
} -> full_tensor
``````

The simple thing to remark is that one should not try to do that and instead carry enough information in the IR for this to materialize parallelism after bufferization

``````looping_control_flow_with_parallel_semantics %i = %proc_id, %ub, step = %num_procs {
// yay side-effecting computations on buffers
}
``````

You will note that this is exactly what `scf.parallel` and `linalg.tiled_loop` do but while additionally introducing other undesirable things mentioned above.

You will also note that IREE has decided to introduce side-effects early and have a mixed â€śbuffers in the ether accessed by nameâ€ť at the boundary and tensors inside. This also comes with its many implications and tradeoffs that are really out of scope here.

Bottom line, while we should also push for ` tensor<... x tensor< ...>>`, we also need something like:

``````%t = looping_control_flow_with_parallel_semantics %i = %lb, %ub, %step
{attributes_and_operands_to_specify_how_loop_is_distributed}
{
...
scf.yield %foo
} -> full_tensor
``````

while we should have an `scf.reduce`, it is important IMO that the terminator not do any magic, given the first principles.

At which point the jump to `scf.for` seems pretty natural to me:

``````%t = scf.for %i = %lb, %ub, %step
{attributes_and_operands_to_specify_how_loop_is_distributed}
{
...
scf.yield %foo
} -> full_tensor
``````

We could put lipstick on it and call it something else, fine.
I tend to want to avoid having to duplicate a lot of the canonicalization and folding transformations on `scf.for` but as often, maybe this requires new interfaces.

## The real fundamental problem

Instead of going through another deeper path and start talking about transformations, Iâ€™ll try to step back and remark that the fundamental problem is much simpler and has not been addressed IMO.

I claim we do not have a good representation for a simple async construct with regions that operates solely on values; for instance, some notional:

``````%0 = struct.undef : struct<i32, i32>

%r = parallel_region %thread1 (%0) {
%i = constant 0: i32
%1 = struct.insertvalue %i, %0[0]: struct<i32, i32>
yield %1: struct<i32, i32>
%i = constant 1: i32
%1 = struct.insertvalue %i, %0[1]: struct<i32, i32>
yield %1: struct<i32, i32>
} : struct<i32, i32>
``````

Going through memory to represent this is the only thing that comes to my feeble mind.

Do people know how to express this in SSA-land without going through side effects?
If we have good ways, Iâ€™d expect weâ€™d want to generalize them to work with looping constructs.

# Epilogue

In the end, Iâ€™d love to see prototypes that run end-to-end for the particular solutions that are proposed and how they handle imperfectly-nestedness: it is easy to overlook â€śunknown unknownsâ€ť that only surface when e2e prototypes are built.

Thanks for reading this wall of text!

Being able to accumulate in a single SSA value in the terminator captures the information we need without breaking SSA and without involving implicit copies/materialization of the full tensor.
A remaining concern would be that you generally have more than one value to accumulate in a loopâ€¦ A single terminator does not suit that purpose.

Quick heads up, we had a deep dive with Adam Paszke on this and related topic and it is likely that this line of work has reached the point where we are ready to concretely engage with our functional friends.

In the next few days we expect to share a proposal and start some prototyping.
This should help move forward on multiple fronts that I am very excited about.

Stay tuned!

Great news Nicolas.
Adamâ€™s Dex work does have a bunch of good ideas in representing iterators on arrays or tensors which could be helpful here.

Ok, Iâ€™ll wait for the proposal, but also want to re-iterate some requirements that would be needed if this were usable in IREE (of course that is not a blocking requirement, but if these arent satisfied, we just wont be able to use it). Without going into details, the main practical requirements are

1. The modeling should be usable at different stages of the compilation. For example, the same construct should be able to represent distribution to blocks (when targeting CUDA, say) and distributing to threads. In other words, it should not force and end-to-end solution have to reason about both â€śhostâ€ť side and â€śdeviceâ€ť side to be workable. This is vague, so I am happy to take a look and provide feedback on the proposal in more concrete terms.
1. The distributed loop should allow us to analyse the full iteration space without having to complex pattern matching (or using heavy weight-machinery).

This might be moot now, but one thing I was thinking about with after yesterdays ODM is that the `scf.parallel` is actually not a `map` operation (it might be more of a `map-reduce`). In a map operation you take a lambda and apply it to data. The shape of the data determines the shape of the result. The analogous representation here would be that, even without reduction the region of `scf.parallel` yields a value.

``````%3 = scf.parallel (%iv) = .... {
scf.yield %t : tensor<?xf32>
} -> tensor<?x?xf32>
``````

i.e. if the `yield` is a tensor of `M` dimensions, and the `scf.parallel` has `K` induction variables, then the result has a type `tensor<MxKxf32>`, basically a concat of all the results (like Albert suggested). Then an extra `reduce` region informs how to collapse the `M` dimensions of the result.

Just wanted to upvote this. This will solve a lot of issues, and what I was going from with â€śSolution 1â€ť all the way in the first post.

Looking forward to get the inputs from folks with more â€śfunctional programmingâ€ť background! This is exciting

What is the type of %foo? Is this the full tensor?

My problem here is that I donâ€™t understand how to avoid this magic. In your example of â€śyielding the full tensorâ€ť in a `scf.for`, what is the terminator doing and how do we get from â€śeach iteration yield a full tensorâ€ť to â€śthe loop produces a single tensorâ€ť?
â€śSomethingâ€ť has to merge/concat the tensors produced by each iteration, and papering over this seems exactly like â€śthere is a magic terminatorâ€ť (or â€śmagic in between the terminator and the opâ€ť).

Right, this is a variant on the terminator/loop contract to produce the final value. Instead of providing the location for the slice, it gets implicit: it makes it really for a `concat`. Looks more like what Maksim (I think?) was mentioning during the meeting (they were mentioning returning `list[tensor]` which I think looks like what youâ€™re doing?)

I suspect the limitation here is that because you canâ€™t control the exact location / n-D insertion of the slide and youâ€™re limited to a simple `concat` you may need non trivial data reshuffling after the loop.

Edit: after talking offline to Mahesh, it can be seen as a question of â€śfusionâ€ť. You may either have the shuffle fused in the loop (through the terminator I showed above for example) or have it as a separate step after the fact.
It will be interesting to see how the unfused version plays with bufferization and later codegen stage if we go with it!

@mehdi_amini I am trying to formulate everything in terms of subset operations so that there is an â€śevolvedâ€ť version of `tiled_loop` that has two regions. One region yields subsets of tensors/memrefs and the other contains computations on the tensor/memref types of these subsets. That way the second region actually yields a subset/tile instead of the full tensor.

2D row reduction:

``````%sum = st.loop (%i, %j) = (%c0, %c0) to (%c80, %c60) step (%c4, %c4)
ins (%in_ = %in: tensor<80x60xf32>, %cst_ = %cst: f32)
outs (%out_ = %out: tensor<80xf32>)
iterators["parallel", "sequential"]
subsets {
%in_sub = subset.tile %in_[%i, %j] [4, 4] [1, 1]
: tensor<80x60xf32> to !st.tile<4x4xf32>
%out_sub = subset.tile %out_[%i] [4] [1]
: tensor<80xf32> to !st.tile<4xf32>
subset.yield ins (%in_sub: !st.tile<4x4xf32>) outs (%out_sub: !st.tile<4xf32>)
}
compute ins (%in_sub: tensor<4x4xf32>) outs(%out_sub: tensor<4xf32>) {
%reduction = linalg.generic {
indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
affine_map<(d0, d1) -> (d0)>],
iterator_types = ["parallel", "reduction"]}
ins(%in_sub : tensor<4x4xf32>)
outs(%out_sub : tensor<4xf32>) {
^bb0(%a: f32, %b: f32):
%0 = arith.addf %a, %b : f32
linalg.yield %0 : f32
} -> tensor<4xf32>
st.yield %reduction : tensor<4xf32>
}
``````

1D cwise

``````st.loop (%i) = (%c0) to (%dim) step (%c10)
ins (%in_ = %in: tensor<100xf32>)
outs (%out_ = %out: tensor<100xf32>)
subsets {
%in_sub = subset.tile %in[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
%out_sub = subset.tile %out[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
subset.yield ins (%in_sub: !st.tile<10xf32>) outs (%out_sub: !st.tile<10xf32>)
}
compute ins (%in_sub: tensor<10xf32>) outs (%out_sub: tensor<10xf32>) {
%abs = linalg.generic {
indexing_maps =  [#identity_1D, #identity_1D],
iterator_types = ["parallel"]}
ins(%in_sub: tensor<10xf32>)
outs(%out_sub: tensor<10xf32>) {
^bb(%a: f32, %b: f32) :
%0 = math.abs %a : f32
linalg.yield %0 : f32
} -> tensor<10xf32>
st.yield %abs : tensor<10xf32>
}
}
``````

1D cwise bufferized

``````%dim = memref.dim %in, %c0: memref<100xf32>
st.loop (%i) = (%c0) to (%dim) step (%c10)
ins (%in_ = %in: memref<100xf32>)
outs (%out_ = %out: memref<100xf32>)
subsets {
%in_sub = subset.tile %in[%i][10][1] : memref<100xf32> to !st.tile<10xf32>
%out_sub = subset.tile %out[%i][10][1] : memref<100xf32> to !st.tile<10xf32>
subset.yield ins (%in_sub: !st.tile<10xf32>) outs (%out_sub: !st.tile<10xf32>)
}
compute ins (%in_sub: memref<10xf32, #map> outs (%out_sub: memref<10xf32, #map>) {
%abs = linalg.generic {
indexing_maps =  [#identity_1D, #identity_1D],
iterator_types = ["parallel"]}
ins(%in_sub: memref<10xf32, #map>)
outs(%out_sub: memref<10xf32, #map>) {
^bb(%a: f32, %b: f32) :
%0 = math.abs %a : f32
linalg.yield %0 : f32
} -> memref<10xf32>
st.yield %abs : memref<10xf32>
}
}
``````

In that case the fusion into `st.loop` becomes mostly about constructing a composition of subsets. Subset types can be extended, etc.

1 Like