Representing tiling on tensors + parallelism

Tiling on Linalg on tensors uses scf.for with destructive updates to represent the tiled computation. For example, a simple 1D operation

#map = affine_map<(d0) -> (d0)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = linalg.generic {
    indexing_maps = [#map, #map], iterator_types = ["parallel"]}
    ins(%0 : tensor<?xf32>) outs(%1 : tensor<?xf32>) :
    ^bb0(%arg0 : f32, %arg1 : f32) {
      %3 = sqrt %arg0 : f32
      linalg.yield %3 : f32
    } -> tensor<?xf32>

when tiled using tile size %t is represented as follows

#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
%3 = scf.for %iv = %c to %2 step %t iter_args(%arg0 = %1) {
       %4 = affine_min #map1(%iv)[%t, %2]
       %5 = tensor.extract_slice %0[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
       %6 = tensor.extract_slice %arg0[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
       %7 = linalg.generic {
            indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
            ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
            ^bb0(%arg0 : f32, %arg1 : f32) {
              %3 = sqrt %arg0 : f32
              linalg.yield %3 : f32
            } -> tensor<?xf32>
      %8 = tensor.insert_slice %7 into %arg0[0] [%4] [1] : tensor<?xf32> into tensor<?xf32>
      scf.yield %8 : tensor<?xf32>
    } -> tensor<?xf32>

The tensor.extract_slice %arg0 and the tensor.insert_slice %... into %arg0 represent the destructive update pattern, i.e. each iteration of the loop computes a tile of the computation and inserts it into the current value of the result tensor (%arg0) and yields the entire tensor. As represented this computation is losing the information that the scf.for is actually parallel. The serialization is not inherent to the computation, but is purely due to inability to represent parallel computation using scf.for with tensor type operands and returns.

The problem becomes even more pronounced with distribution. Currently in Linalg, distribution is part of the tiling transformation, i.e. for a loop that is parallel, block-cyclic distribution is used during tiling to distribute work across threads on a grid. For example, tile + distribute of the original op results in the following IR.

#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
#map2 = affine_map<()[s0, s1] -> (s0 * s1>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
%id = "proc_id"() : index
%np = "nprocs"() : index
%offset = affine.apply #map2()[%id, %t]
%step = affine.apply #map2()[%np, %t]
%3 = scf.for %iv = %offset to %2 step %step iter_args(%arg0 = %1) {
  ....
  } -> tensor<?xf32>

This leads to a whole host of problem. The main one is that for the case where all shapes are statically known, all tile sizes used are statically known and the tile size divides the problem size, the code within the tiled loop can be canonicalized to use static shapes as well, which then vectorizes easily (without requiring peeling/padding, etc. given the constraints). Unfortunately the materialization of the distributed loop makes canonicalizing away the affine.min operation very involved, and without this the sizes arent propagated as static shapes.

There are many potential solutions, each with its own trade-offs. I am listing them here in order of my preference, but really looking for a end-to-end solution (that we can use in IREE effectively). I have a proposal below that is the cleanest IMO, but I am not tied to it. There are potential other solutions as well.

Solution 1 : Extend scf.for operations.

The distribution that is done during tiling is a really simple transformation. It just a change to the lower-bound and step values using the SSA value for number of processor and processor ID. There is no need to actually materialize the distributed loop. Instead one could change the scf.for semantics to take these SSA value as two additional operands, so the distributed loop above would be represented above as

%3 = scf.for %iv = %c0 to %ub step %c1 iter_args(...) distributed (id = %id , np = %np) {
  ...
  } -> tensor<?x?xf32>

Here the scf.for accepts two additional optional operands %id and %np to specify the processor ID and number of processors. The presence of these operands also indicates that the loop is parallel and distributable. This also allows easy analysis of the loop while maintaining the semantics that the loop is distributable. For example, for CUDA codegen in IREE, there is a pretty heavy analysis done to figure out if a loop can be removed because its a zero-trip or one-trip loop. Lot of this complication arises from the loop being distributed already. The above form of the loop makes this analysis fairly trivial. The actual distribution can happen pretty late, either just before lowering to LLVM dialect (or SPIR-V dialect).

Solution 2: Making scf.parallel work on tensors.

AFAIK scf.parallel only works on memref types. It could be extended to work on tensors while still maintaining the parallel semantics. The down side I see is that it does not have a way to carry the SSA values to use for distribution as represented above. The semantics of the scf.parallel loop could also be extended to use the id and np, but I am less convinced of the actual value of scf.parallel over scf.for if it is just to say some loops are parallel. The same could be achieved by using an attribute on scf.for for example.

Solution 3: linalg.tiled_loop

For the most part linalg.tiled_loop actually already has some of the requirements that I am looking for. The way the distribution is represented needs to be tweaked. Instead of using attributes to represent how it is distributed (which is very CUDA specific and an abstraction leak IMO), they should just use optional SSA values. Having followed the linalg.tiled_loop evolution and after discussions with a few folks, I am still not convinced that linalg.tiled_loop is paying for the abstraction. I think effective extensions to scf.for gives us everything we need to represent parallel semantics.

Looking for comments about how to proceed here. Happy to take up any tasks that helps in better representation. I really hope we can go with scf.for extensions. From experience of using these in IREE that would be the easiest to make work e2e.
cc @nicolasvasilache @albertcohen @ftynse @herhut

3 Likes

Thanks for the nice write up @MaheshRavishankar !
Quick question: scf.parallel supports reductions as well: isn’t your “destructive update” just something that can fit the model? You wrote that scf.parallel only works on memref types but it isn’t clear to me where this requirements come from?
Isn’t it possible to write something that looks roughly like:

#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
%3 = scf.parallel (%iv) = (%c) to (%2) step (%t) init(%1) -> tensor<?xf32> {
       %4 = affine_min #map1(%iv)[%t, %2]
       %5 = tensor.extract_slice %0[0] [%4] [1] : tensor<?xf32> to tensor<?xf32>
       %6 = tensor.extract_slice %1[0] [%4] [1] : tensor<?xf32> to tensor<?xf32>
       %7 = linalg.generic {
            indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
            ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
            ^bb0(%arg0 : f32, %arg1 : f32) {
              %3 = sqrt %arg0 : f32
              linalg.yield %3 : f32
            } -> tensor<?xf32>
     scf.reduce(%7) : tensor<?xf32> {
    ^bb0(%lhs : tensor<?xf32>, %rhs: tensor<?xf32>):
      %8 = tensor.insert_slice %lhs into %rhs[0] [%4] [1] : tensor<?xf32> into tensor<?xf32>
      scf.yield %8 : tensor<?xf32>
    } -> tensor<?xf32>

A large part of scf.parallel design was about how we’d structure the “reduction” aspect, I didn’t quite get how this would be handled in your proposal, you just wrote that the “presence of these operands also indicates that the loop is parallel and distributable”, but this does not address how the result of such a loop is computed.

I must be mistaken here. It seems like this was a restriction a while ago, but is no more the case.

Your representation of the destructive update using scf.parallel is correct. Looking at this, I think it is also possible to teach the in-place bufferization that is central to Linalg to recognize the destructive update pattern in a more structured way (I can expand on that later, or maybe I am totally off, I need to check with Nicolas on that).
But I have broader concern of modeling reductions this way. This might be based on how things work in IREE, specifically for GPU codegen, but this representation works only if you are generating code for both the host and device, i.e. this scf.parallel is lowered in a context where downstream to this op is generating both host side code and device-side code. Main reason being that you could need multiple kernels to implement the reduction. This makes scf.parallel restrictive in its use (cant use it in IREE for example).

In the approach I was specifying, there is no context in the scf.for about what is being lowered. It is just indicating that the loop is distributed. The fact that the body of the loop is performing one stage of, say, a tree based reduction is then just represented in the region of the loop, i.e. something that is generating this loop is then generating multiple such “distributed” scf.for operations to implement the tree-based reduction. or whatever other computation it needs to. Essentially the scf.for is very straight-forward in what it represents (and makes it easier to target).

Yes all these points have been discussed at length previously, thanks for bringing them together in a single post.

As you know, I’m interested in exploring extending the scf.for semantics, so having it prototyped and connected end-to-end would go a long way towards understanding all the pieces better.

This would also help answer Mehdi’s comments and other questions related to:

  • multi vs single loop representation and tradeoffs (there are quite a few related to canonicalization, writing transformations, parallel semantics in tensor vs buffer land etc)
  • reduction representation: this is currently only a terminator in scf.parallel and should be an independent op reusable in other places e.g. vector dialect.

My position is that we should be in the prototyping and proof-of-concept phase to build up confidence and there are quite a few caveats to make everything compose properly and have a chance at being future proof (e.g. for ragged and sparse tensors). Without a PoC implementation and data to back this up I fear this will get sidetracked in lenghty discussions for which we simply don’t know enough to conclude yet.

Sorry to chime in for a question on the semantic of the destructive update from tiled reduction. I came to this question when I was playing with LinAlg tiling so want to take this opportunity to understand it better. Does the tensor.insert_slice operation as in @MaheshRavishankar’s example have the right semantic for tiling reduction? For reduction, inserting back the tiled result to the full tensor seems to me missing the accumulation part for reduction. I was expecting to see something more like “tensor.reduce_slice_on_some_specific_reduction_type” that describes the reduction semantic more precisely.

Besides my above question on the semantic from the serialized tiled reduction, I’m also interested in keeping parallelism from tiled LinAlg operations expressed in some way. Besides approaches listed by @MaheshRavishankar, is expressing the tiled loop in linalg.generic’s implicit looping structure something too wild for consideration? I’m new to LinAlg so apologize if my question is too naïve. This helps me understand what loop transformations could be be driven by linalg.generic ops in the long run. Thanks.

Ah I see what you mean, I would rephrase it differently though (I don’t think we need to bother about one kernels vs multiple kind of thinking here, we can express this at high level of semantics):

while this modeling is “correct”, it is also overly conservative because it forces to apply a “reduction algorithm” and does not carry the fact that the “destructive updates” are entirely independent.

Makes sense?

That said, I still don’t quite get how you’re working around this TBH, there is a weird contract with your SCF for that I can’t wrap my head around. I think the issue is that iter_args implies fundamentally a value that goes from one iteration to the next (and the last iteration is the returned value of the scf.for), but by implying that this is parallel we get in a weird situation (which is why the reduction was made explicit in scf.parallel).

The slice is first extracted, updated, and then reinserted. This isn’t atomic though and the assumption here is that each iteration gets its own tile and there is no overlap in the updates.
(as far as I understand :slight_smile: )

1 Like

Thinking a bit more: maybe it could fall into the category of “atomic updates” that make such “reduction” fully parallel/independent again?

Well you don’t need such a semantics (the linalg.tiled_loop does have this now FWIW). Leave aside the parallelism question for now. Then the destructive update is equivalent to doing a read-modify-write of the tensor within the loop body, i.e. it reads a slice of %arg0 using tensor.extract_slice, computes a new slice to use in its place, inserts it into%arg0 and yields the new tensor created. The body of the loop could use a reduction update to compute the new slice to be inserted.

FWIW, if you do tiling on memref type, you can generate the tiled loops as scf.parallel ops and that is the way to preserve the parallelism today.

Agreed that this is part of the weirdness here. Its a representation issue. When working with SSA values, there doesnt seem to be a way around it. This is a way to side-step the inherent representation issue with the hope that later on in the compilation, past bufferization, the representation issue is resolved. The challenge is to carry the parallel semantics upto that level.

Yeah I think I’d agree with that. It is easier to think of it in terms of IREEs representation. The tiled computation in IREE is represented as

#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
 scf.for %iv = %c to %2 step %t {
       %4 = affine_min #map1(%iv)[%t, %2]
       %5 = flow.dispatch.tensor.load %input[0] [%4] [1] : !flow.dispatch.tensor<?xf32> -> tensor<?xf32>
       %6 = flow.dispatch.tensor.load %init[0] [%4] [1] : !flow.dispatch.tensor<?xf32> -> tensor<?xf32>
       %7 = linalg.generic {
            indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
            ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
            ^bb0(%arg0 : f32, %arg1 : f32) {
              %3 = sqrt %arg0 : f32
              linalg.yield %3 : f32
            } -> tensor<?xf32>
      flow.dispatch.tensor.store %7 into %init[0] [%4] [1] : tensor<?xf32> -> flow.dispatch.tensor<?xf32>
    } -> tensor<?xf32>

The !flow.dispatch.tensor are “ABI types”, i.e. they allow reads/writes to it. The flow.dispatch.tensor.load/store allow you to read/write to it, but the modeling here is that the “somehow” these ops extract/insert the slice requested from/into the flow.dispatch.tensor value. These are atomic, and any overlap is undefined behavior. The trick is that after “bufferization” you could just treat the flow.dispatch.tensor as regular memrefs. In my mind I always take the pure scf.for version and reason about it in terms of how it is represented above, cause I dont think there is a way to model parallelism at the tensor level apart from what is done in scf.parallel, but that has different trade-offs.

1 Like

For folks tagging along, @MaheshRavishankar agreed to brainstorm this topic in the public meeting tomorrow. No presentation/slides, we’ll take it from code sample and try to play with this!

1 Like

Following up on this morning, and extending a bit on @albertcohen mention of “concat is parallel”, why don’t we embed the “reduction” in the terminator:

#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
%3 = scf.parallel (%iv) = (%c0 to (%2) step (%t) init(%1) -> tensor<?xf32> {
       %4 = affine_min #map1(%iv)[%t, %2]
       %5 = tensor.extract_slice %0[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
       %6 = tensor.extract_slice %1[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
       %7 = linalg.generic {
            indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
            ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
            ^bb0(%arg0 : f32, %arg1 : f32) {
              %3 = sqrt %arg0 : f32
              linalg.yield %3 : f32
            } -> tensor<?xf32>
     scf.reduce_insert_slice %7 at [%iv] [%4] [1] : tensor<?xf32
}

I made up this scf.reduce_insert_slice (maybe this does not belong to scf and we should use another loop that scf.parallel, but let’s leave it for later): the semantics is that each iteration produces a slice of the output tensor, and express where to insert it in the result. This is like the scf.reduce, except that the reduction being a “concat” it can be fully parallel.
It seems like it addresses your concern @MaheshRavishankar in that we don’t materialize the full tensor at each iteration (or for each reduction step)?

I think this is also close the loop and brings us back to where we arrived 3 months ago?

Here: [RFC] Changes to linalg::TiledLoopOp to unblock reductions - #9 by mehdi_amini
I wrote at the time:

...
  %tile_range = tensor.tile_range [%j, %i][%c20, %c10][%c1, %c1] : !tensor.tile_range
  %out_sub = tensor.extract_slice %out(%tile_range) // only if %out_sub is used as input to the computation!
  ...
  linalg.tiled_yield %transpose_sub at %tile_range : tensor<10x10xf32>
}

And @_sean_silva thought of a variant: [RFC] Changes to linalg::TiledLoopOp to unblock reductions - #35 by _sean_silva

  %in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
 
  %transpose_sub = ...

  linalg.tiled_loop_terminator {
    tiled_yield %transpose_sub at [%j, %i][%c20, %c10][%c1, %c1]
  }
}

(grouping the “yield” in a region might help to better represent the possible multiple-results of the loop)

I don’t if @pifon2a made progress on their idea about something more flexible / expressive though: [RFC] Changes to linalg::TiledLoopOp to unblock reductions - #27 by pifon2a

These points have been indeed discussed and some even prototyped in the past months.

The current design has settled on 1-d control flow + n-d data representation for multiple intersecting reasons: as usual it’s tradeoffs everywhere.

One particular reason we currently do “1-d/n-d” is that (I believe) n-d control flow + n-d terminator makes it tricky to represent imperfect nestedness in a stable form at this point in the transformation pipeline. In particular, the big problem I see forcing the control-flow and data in the same abstraction (i.e. a containing op and its terminator) is that “the value n ends up being the same” for control-flow and data.

I have heard a few times something along the lines of: “if you don’t want n-d scf.parallel you can just break them down into 1-d scf.parallel”. Well let’s unpack this on the simplest imperfectly nested example possible:

%t = scf.parallel %i ... {
   //
   // I want to do something special here (prefetch, pack, compute)
   // 
   % tt = scf.parallel %j ... {
      // compute 2-D tile
      terminator1?
   }
   terminator2?
}

What possibilities do we have for terminator1 and terminator2?

Possibility 1: Symmetric terminators

If both terminator1 and terminator2 are “the symmetric”, both can only insert 1-D slices: %j does not dominate terminator2. This breaks the ability to perform n-D block insertions and is a non-starter for this body of work (may be perfectly valid as an abstraction to have but we don’t want to use it in this instance).

Possibility 2: Asymmetric terminators

Alternatively, terminator1 can be the special reduce terminator and terminator2 can be a simple scf.yield. While this could technically work, this is still very brittle and there are numerous implications.

Let’s make the example a little more interesting: I claim there is such a thing as fusing 1-D and 2-D loop nests.

%t0, %t1 = scf.parallel %i ... {
   //
   // I actually do something here.
   // 
   // compute 1-D tile
  %t = some_pointwise_op_1d()

   % tt = scf.parallel %j ... {
      // compute 2-D tile
      fancy_terminator?
   }

   terminator2?
}

What should terminator2 look like? I am not sure.
How does this generalize to more results and different types? I’m afraid to ask.
What are the implications on analysis and transformations? oh my …

Possibility 3: Reduce is not a terminator

Separately, we have been discussing with @ftynse and @dcaballe the fact that a generic reduce is needed to capture more advanced combinators. This would be generally useful everywhere if it were decoupled from the semantics an enclosing op. In particular reduce on vectors is something that we consider needed to increase applicability of generic n-D vectorization while keeping it short.

%t0, %t1 = scf.parallel %i ... {
  %t = some_pointwise_op_1d()

   % tt = scf.parallel %j ... {
      // compute 2-D tile
      %ttt = reduce (%small_tile, %big_tile) @some_init {some_body} -> type(%big_tile)
      scf.yield %ttt: %big_tile
   }

   scf.yield %tt, t
}

I believe something like this was previously proposed but had some caveats re @some_init that was represented as a function and required all passes to become module passes but this should be easily fixable.

The real problems

Given the multiple interactions I’ve had over the past few months, I’d venture that this is just a symptom of some difficult tradeoffs that I will list here.

IR design vs analysis for transformations

linalg.generic is perfecly nested by design. This a conscious and carefully weighted tradeoff between expressiveness and the 3 harbingers of transformations (legality, applicability and profitability), I won’t repeat the points of Rationale but you’ll note there is a section on closedness under transformations.

Once tiling and fusion are performed, we are in imperfectly nested-land on the tiled loops but still perfectly nested within the linalg.generic within the tiled loops. At this point there is no escaping our imperfect-nestedness condition. The positive is the transformations that usually are needed on tiled loops are a much simpler bunch. Still, coming back to the imperfect-nestedness:

  1. we can push forward on the n-D abstractions with implicit load/store in which case we should do the hard work of introducing tensor<... x tensor< ...>> and codegen with it as a first-class citizen. This is also very desirable as our way into richer semantics that can easily be codegen’ed (ragged, jagged, sparse, trees, databases etc…)

  2. we can use traditional loops and write the analyses on these explicit subset load/store and gradually introduce better abstractions and interfaces to simplify and generalize. However much I dislike unnecessary complexity, we do have to bite the bullet and do some analysis. Luckily we are in SSA-land and a lot of this is implemented and functional in the sequential case. This is significantly simpler than memory-based analysis and extends to more subset/subset_insert operation pairs and data_types. Trying to cut corners here is a very bad idea.

(spoiler: we should do both).

scf.parallel, linalg.tiled_loop and “smart” terminators simply are not up for the job we need them to: they sit at an abstraction level that is half-explicit / half-implicit and that weird symmetry does not compose or generalize well in an imperfectly nested world on tensors.

Big tile, small tile, no tile

You will note here that we seem to have given up on “yielding the small tile” and that the various terminators proposed “yield the big tile”.

It seems this is a trigger point for people but this is the result of applying the first principles outlined above related to implicit vs explicit load/store (and more importantly implementing them in practice and getting a fully functional system up and running).

This is also fundamentally what triggers the race condition if we try to materialize the parallelism in the IR.

In other words, the problem comes from trying to write something like this on tensors:

// Boo races here, races there, races everywhere.
%t = looping_control_flow_with_parallel_semantics %i = %proc_id, %ub, step = %num_procs {
  ...
  wathever_terminator %foo : full_tensor
} -> full_tensor

The simple thing to remark is that one should not try to do that and instead carry enough information in the IR for this to materialize parallelism after bufferization

looping_control_flow_with_parallel_semantics %i = %proc_id, %ub, step = %num_procs {
  // yay side-effecting computations on buffers
} 

You will note that this is exactly what scf.parallel and linalg.tiled_loop do but while additionally introducing other undesirable things mentioned above.

You will also note that IREE has decided to introduce side-effects early and have a mixed “buffers in the ether accessed by name” at the boundary and tensors inside. This also comes with its many implications and tradeoffs that are really out of scope here.

Bottom line, while we should also push for tensor<... x tensor< ...>>, we also need something like:

%t = looping_control_flow_with_parallel_semantics %i = %lb, %ub, %step 
   {attributes_and_operands_to_specify_how_loop_is_distributed} 
{
  ...
  scf.yield %foo
} -> full_tensor

while we should have an scf.reduce, it is important IMO that the terminator not do any magic, given the first principles.

At which point the jump to scf.for seems pretty natural to me:

%t = scf.for %i = %lb, %ub, %step 
   {attributes_and_operands_to_specify_how_loop_is_distributed} 
{
  ...
  scf.yield %foo
} -> full_tensor

We could put lipstick on it and call it something else, fine.
I tend to want to avoid having to duplicate a lot of the canonicalization and folding transformations on scf.for but as often, maybe this requires new interfaces.

The real fundamental problem

Instead of going through another deeper path and start talking about transformations, I’ll try to step back and remark that the fundamental problem is much simpler and has not been addressed IMO.

I claim we do not have a good representation for a simple async construct with regions that operates solely on values; for instance, some notional:

%0 = struct.undef : struct<i32, i32>
%thread1, %thread2 = fork(2) : thread, thread

%r = parallel_region %thread1 (%0) {
  %i = constant 0: i32
  %1 = struct.insertvalue %i, %0[0]: struct<i32, i32>
  yield %1: struct<i32, i32>
} parallel_region %thread2 (%0) {
  %i = constant 1: i32
  %1 = struct.insertvalue %i, %0[1]: struct<i32, i32>
  yield %1: struct<i32, i32>
} : struct<i32, i32>

Going through memory to represent this is the only thing that comes to my feeble mind.

Do people know how to express this in SSA-land without going through side effects?
If we have good ways, I’d expect we’d want to generalize them to work with looping constructs.

Epilogue

In the end, I’d love to see prototypes that run end-to-end for the particular solutions that are proposed and how they handle imperfectly-nestedness: it is easy to overlook “unknown unknowns” that only surface when e2e prototypes are built.

Thanks for reading this wall of text!

Being able to accumulate in a single SSA value in the terminator captures the information we need without breaking SSA and without involving implicit copies/materialization of the full tensor.
A remaining concern would be that you generally have more than one value to accumulate in a loop… A single terminator does not suit that purpose.

Quick heads up, we had a deep dive with Adam Paszke on this and related topic and it is likely that this line of work has reached the point where we are ready to concretely engage with our functional friends.

In the next few days we expect to share a proposal and start some prototyping.
This should help move forward on multiple fronts that I am very excited about.

Stay tuned!

Great news Nicolas.
Adam’s Dex work does have a bunch of good ideas in representing iterators on arrays or tensors which could be helpful here.

Ok, I’ll wait for the proposal, but also want to re-iterate some requirements that would be needed if this were usable in IREE (of course that is not a blocking requirement, but if these arent satisfied, we just wont be able to use it). Without going into details, the main practical requirements are

  1. The modeling should be usable at different stages of the compilation. For example, the same construct should be able to represent distribution to blocks (when targeting CUDA, say) and distributing to threads. In other words, it should not force and end-to-end solution have to reason about both “host” side and “device” side to be workable. This is vague, so I am happy to take a look and provide feedback on the proposal in more concrete terms.
  1. The distributed loop should allow us to analyse the full iteration space without having to complex pattern matching (or using heavy weight-machinery).

This might be moot now, but one thing I was thinking about with after yesterdays ODM is that the scf.parallel is actually not a map operation (it might be more of a map-reduce). In a map operation you take a lambda and apply it to data. The shape of the data determines the shape of the result. The analogous representation here would be that, even without reduction the region of scf.parallel yields a value.

%3 = scf.parallel (%iv) = .... {
  scf.yield %t : tensor<?xf32>
} -> tensor<?x?xf32>

i.e. if the yield is a tensor of M dimensions, and the scf.parallel has K induction variables, then the result has a type tensor<MxKxf32>, basically a concat of all the results (like Albert suggested). Then an extra reduce region informs how to collapse the M dimensions of the result.

Just wanted to upvote this. This will solve a lot of issues, and what I was going from with “Solution 1” all the way in the first post.

Looking forward to get the inputs from folks with more “functional programming” background! This is exciting :slight_smile:

What is the type of %foo? Is this the full tensor?

My problem here is that I don’t understand how to avoid this magic. In your example of “yielding the full tensor” in a scf.for, what is the terminator doing and how do we get from “each iteration yield a full tensor” to “the loop produces a single tensor”?
“Something” has to merge/concat the tensors produced by each iteration, and papering over this seems exactly like “there is a magic terminator” (or “magic in between the terminator and the op”).

Right, this is a variant on the terminator/loop contract to produce the final value. Instead of providing the location for the slice, it gets implicit: it makes it really for a concat. Looks more like what Maksim (I think?) was mentioning during the meeting (they were mentioning returning list[tensor] which I think looks like what you’re doing?)

I suspect the limitation here is that because you can’t control the exact location / n-D insertion of the slide and you’re limited to a simple concat you may need non trivial data reshuffling after the loop.

Edit: after talking offline to Mahesh, it can be seen as a question of “fusion”. You may either have the shuffle fused in the loop (through the terminator I showed above for example) or have it as a separate step after the fact.
It will be interesting to see how the unfused version plays with bufferization and later codegen stage if we go with it!

@mehdi_amini I am trying to formulate everything in terms of subset operations so that there is an “evolved” version of tiled_loop that has two regions. One region yields subsets of tensors/memrefs and the other contains computations on the tensor/memref types of these subsets. That way the second region actually yields a subset/tile instead of the full tensor.

2D row reduction:

%sum = st.loop (%i, %j) = (%c0, %c0) to (%c80, %c60) step (%c4, %c4)
  ins (%in_ = %in: tensor<80x60xf32>, %cst_ = %cst: f32)
  outs (%out_ = %out: tensor<80xf32>)
  iterators["parallel", "sequential"]
  subsets {
    %in_sub = subset.tile %in_[%i, %j] [4, 4] [1, 1]
      : tensor<80x60xf32> to !st.tile<4x4xf32>
    %out_sub = subset.tile %out_[%i] [4] [1]
      : tensor<80xf32> to !st.tile<4xf32>
    subset.yield ins (%in_sub: !st.tile<4x4xf32>) outs (%out_sub: !st.tile<4xf32>)
  }
  compute ins (%in_sub: tensor<4x4xf32>) outs(%out_sub: tensor<4xf32>) {
    %reduction = linalg.generic {
        indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
                         affine_map<(d0, d1) -> (d0)>],
        iterator_types = ["parallel", "reduction"]}
        ins(%in_sub : tensor<4x4xf32>)
        outs(%out_sub : tensor<4xf32>) {
      ^bb0(%a: f32, %b: f32):
        %0 = arith.addf %a, %b : f32
        linalg.yield %0 : f32
    } -> tensor<4xf32>
    st.yield %reduction : tensor<4xf32>
  }

1D cwise

st.loop (%i) = (%c0) to (%dim) step (%c10)
  ins (%in_ = %in: tensor<100xf32>)
  outs (%out_ = %out: tensor<100xf32>)
  subsets {
    %in_sub = subset.tile %in[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    %out_sub = subset.tile %out[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    subset.yield ins (%in_sub: !st.tile<10xf32>) outs (%out_sub: !st.tile<10xf32>)
  }
  compute ins (%in_sub: tensor<10xf32>) outs (%out_sub: tensor<10xf32>) {
    %abs = linalg.generic {
      indexing_maps =  [#identity_1D, #identity_1D],
      iterator_types = ["parallel"]}
     ins(%in_sub: tensor<10xf32>)
     outs(%out_sub: tensor<10xf32>) {
      ^bb(%a: f32, %b: f32) :
        %0 = math.abs %a : f32
        linalg.yield %0 : f32
    } -> tensor<10xf32>
    st.yield %abs : tensor<10xf32>
  }
}

1D cwise bufferized

%dim = memref.dim %in, %c0: memref<100xf32>
st.loop (%i) = (%c0) to (%dim) step (%c10)
  ins (%in_ = %in: memref<100xf32>)
  outs (%out_ = %out: memref<100xf32>)
  subsets {
    %in_sub = subset.tile %in[%i][10][1] : memref<100xf32> to !st.tile<10xf32>
    %out_sub = subset.tile %out[%i][10][1] : memref<100xf32> to !st.tile<10xf32>
    subset.yield ins (%in_sub: !st.tile<10xf32>) outs (%out_sub: !st.tile<10xf32>)
  }
  compute ins (%in_sub: memref<10xf32, #map> outs (%out_sub: memref<10xf32, #map>) {
    %abs = linalg.generic {
      indexing_maps =  [#identity_1D, #identity_1D],
      iterator_types = ["parallel"]}
     ins(%in_sub: memref<10xf32, #map>)
     outs(%out_sub: memref<10xf32, #map>) {
      ^bb(%a: f32, %b: f32) :
        %0 = math.abs %a : f32
        linalg.yield %0 : f32
    } -> memref<10xf32>
    st.yield %abs : memref<10xf32>
  }
}

In that case the fusion into st.loop becomes mostly about constructing a composition of subsets. Subset types can be extended, etc.

1 Like