Representing tiling on tensors + parallelism

The slice is first extracted, updated, and then reinserted. This isn’t atomic though and the assumption here is that each iteration gets its own tile and there is no overlap in the updates.
(as far as I understand :slight_smile: )

1 Like

Thinking a bit more: maybe it could fall into the category of “atomic updates” that make such “reduction” fully parallel/independent again?

Well you don’t need such a semantics (the linalg.tiled_loop does have this now FWIW). Leave aside the parallelism question for now. Then the destructive update is equivalent to doing a read-modify-write of the tensor within the loop body, i.e. it reads a slice of %arg0 using tensor.extract_slice, computes a new slice to use in its place, inserts it into%arg0 and yields the new tensor created. The body of the loop could use a reduction update to compute the new slice to be inserted.

FWIW, if you do tiling on memref type, you can generate the tiled loops as scf.parallel ops and that is the way to preserve the parallelism today.

Agreed that this is part of the weirdness here. Its a representation issue. When working with SSA values, there doesnt seem to be a way around it. This is a way to side-step the inherent representation issue with the hope that later on in the compilation, past bufferization, the representation issue is resolved. The challenge is to carry the parallel semantics upto that level.

Yeah I think I’d agree with that. It is easier to think of it in terms of IREEs representation. The tiled computation in IREE is represented as

#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
 scf.for %iv = %c to %2 step %t {
       %4 = affine_min #map1(%iv)[%t, %2]
       %5 = flow.dispatch.tensor.load %input[0] [%4] [1] : !flow.dispatch.tensor<?xf32> -> tensor<?xf32>
       %6 = flow.dispatch.tensor.load %init[0] [%4] [1] : !flow.dispatch.tensor<?xf32> -> tensor<?xf32>
       %7 = linalg.generic {
            indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
            ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
            ^bb0(%arg0 : f32, %arg1 : f32) {
              %3 = sqrt %arg0 : f32
              linalg.yield %3 : f32
            } -> tensor<?xf32>
      flow.dispatch.tensor.store %7 into %init[0] [%4] [1] : tensor<?xf32> -> flow.dispatch.tensor<?xf32>
    } -> tensor<?xf32>

The !flow.dispatch.tensor are “ABI types”, i.e. they allow reads/writes to it. The flow.dispatch.tensor.load/store allow you to read/write to it, but the modeling here is that the “somehow” these ops extract/insert the slice requested from/into the flow.dispatch.tensor value. These are atomic, and any overlap is undefined behavior. The trick is that after “bufferization” you could just treat the flow.dispatch.tensor as regular memrefs. In my mind I always take the pure scf.for version and reason about it in terms of how it is represented above, cause I dont think there is a way to model parallelism at the tensor level apart from what is done in scf.parallel, but that has different trade-offs.

1 Like

For folks tagging along, @MaheshRavishankar agreed to brainstorm this topic in the public meeting tomorrow. No presentation/slides, we’ll take it from code sample and try to play with this!

1 Like

Following up on this morning, and extending a bit on @albertcohen mention of “concat is parallel”, why don’t we embed the “reduction” in the terminator:

#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0)[s0, s1] -> (s0, -d0 + s1)>
%1 = linalg.init_tensor [..] : tensor<?xf32>
%2 = tensor.dim %1, %c0 : tensor<?xf32>
%3 = scf.parallel (%iv) = (%c0 to (%2) step (%t) init(%1) -> tensor<?xf32> {
       %4 = affine_min #map1(%iv)[%t, %2]
       %5 = tensor.extract_slice %0[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
       %6 = tensor.extract_slice %1[%iv] [%4] [1] : tensor<?xf32> to tensor<?xf32>
       %7 = linalg.generic {
            indexing_maps = [#map0, #map0], iterator_types = ["parallel"]}
            ins(%5 : tensor<?xf32>) outs(%6 : tensor<?xf32>) :
            ^bb0(%arg0 : f32, %arg1 : f32) {
              %3 = sqrt %arg0 : f32
              linalg.yield %3 : f32
            } -> tensor<?xf32>
     scf.reduce_insert_slice %7 at [%iv] [%4] [1] : tensor<?xf32
}

I made up this scf.reduce_insert_slice (maybe this does not belong to scf and we should use another loop that scf.parallel, but let’s leave it for later): the semantics is that each iteration produces a slice of the output tensor, and express where to insert it in the result. This is like the scf.reduce, except that the reduction being a “concat” it can be fully parallel.
It seems like it addresses your concern @MaheshRavishankar in that we don’t materialize the full tensor at each iteration (or for each reduction step)?

I think this is also close the loop and brings us back to where we arrived 3 months ago?

Here: [RFC] Changes to linalg::TiledLoopOp to unblock reductions - #9 by mehdi_amini
I wrote at the time:

...
  %tile_range = tensor.tile_range [%j, %i][%c20, %c10][%c1, %c1] : !tensor.tile_range
  %out_sub = tensor.extract_slice %out(%tile_range) // only if %out_sub is used as input to the computation!
  ...
  linalg.tiled_yield %transpose_sub at %tile_range : tensor<10x10xf32>
}

And @_sean_silva thought of a variant: [RFC] Changes to linalg::TiledLoopOp to unblock reductions - #35 by _sean_silva

  %in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
 
  %transpose_sub = ...

  linalg.tiled_loop_terminator {
    tiled_yield %transpose_sub at [%j, %i][%c20, %c10][%c1, %c1]
  }
}

(grouping the “yield” in a region might help to better represent the possible multiple-results of the loop)

I don’t if @pifon2a made progress on their idea about something more flexible / expressive though: [RFC] Changes to linalg::TiledLoopOp to unblock reductions - #27 by pifon2a

These points have been indeed discussed and some even prototyped in the past months.

The current design has settled on 1-d control flow + n-d data representation for multiple intersecting reasons: as usual it’s tradeoffs everywhere.

One particular reason we currently do “1-d/n-d” is that (I believe) n-d control flow + n-d terminator makes it tricky to represent imperfect nestedness in a stable form at this point in the transformation pipeline. In particular, the big problem I see forcing the control-flow and data in the same abstraction (i.e. a containing op and its terminator) is that “the value n ends up being the same” for control-flow and data.

I have heard a few times something along the lines of: “if you don’t want n-d scf.parallel you can just break them down into 1-d scf.parallel”. Well let’s unpack this on the simplest imperfectly nested example possible:

%t = scf.parallel %i ... {
   //
   // I want to do something special here (prefetch, pack, compute)
   // 
   % tt = scf.parallel %j ... {
      // compute 2-D tile
      terminator1?
   }
   terminator2?
}

What possibilities do we have for terminator1 and terminator2?

Possibility 1: Symmetric terminators

If both terminator1 and terminator2 are “the symmetric”, both can only insert 1-D slices: %j does not dominate terminator2. This breaks the ability to perform n-D block insertions and is a non-starter for this body of work (may be perfectly valid as an abstraction to have but we don’t want to use it in this instance).

Possibility 2: Asymmetric terminators

Alternatively, terminator1 can be the special reduce terminator and terminator2 can be a simple scf.yield. While this could technically work, this is still very brittle and there are numerous implications.

Let’s make the example a little more interesting: I claim there is such a thing as fusing 1-D and 2-D loop nests.

%t0, %t1 = scf.parallel %i ... {
   //
   // I actually do something here.
   // 
   // compute 1-D tile
  %t = some_pointwise_op_1d()

   % tt = scf.parallel %j ... {
      // compute 2-D tile
      fancy_terminator?
   }

   terminator2?
}

What should terminator2 look like? I am not sure.
How does this generalize to more results and different types? I’m afraid to ask.
What are the implications on analysis and transformations? oh my …

Possibility 3: Reduce is not a terminator

Separately, we have been discussing with @ftynse and @dcaballe the fact that a generic reduce is needed to capture more advanced combinators. This would be generally useful everywhere if it were decoupled from the semantics an enclosing op. In particular reduce on vectors is something that we consider needed to increase applicability of generic n-D vectorization while keeping it short.

%t0, %t1 = scf.parallel %i ... {
  %t = some_pointwise_op_1d()

   % tt = scf.parallel %j ... {
      // compute 2-D tile
      %ttt = reduce (%small_tile, %big_tile) @some_init {some_body} -> type(%big_tile)
      scf.yield %ttt: %big_tile
   }

   scf.yield %tt, t
}

I believe something like this was previously proposed but had some caveats re @some_init that was represented as a function and required all passes to become module passes but this should be easily fixable.

The real problems

Given the multiple interactions I’ve had over the past few months, I’d venture that this is just a symptom of some difficult tradeoffs that I will list here.

IR design vs analysis for transformations

linalg.generic is perfecly nested by design. This a conscious and carefully weighted tradeoff between expressiveness and the 3 harbingers of transformations (legality, applicability and profitability), I won’t repeat the points of Rationale but you’ll note there is a section on closedness under transformations.

Once tiling and fusion are performed, we are in imperfectly nested-land on the tiled loops but still perfectly nested within the linalg.generic within the tiled loops. At this point there is no escaping our imperfect-nestedness condition. The positive is the transformations that usually are needed on tiled loops are a much simpler bunch. Still, coming back to the imperfect-nestedness:

  1. we can push forward on the n-D abstractions with implicit load/store in which case we should do the hard work of introducing tensor<... x tensor< ...>> and codegen with it as a first-class citizen. This is also very desirable as our way into richer semantics that can easily be codegen’ed (ragged, jagged, sparse, trees, databases etc…)

  2. we can use traditional loops and write the analyses on these explicit subset load/store and gradually introduce better abstractions and interfaces to simplify and generalize. However much I dislike unnecessary complexity, we do have to bite the bullet and do some analysis. Luckily we are in SSA-land and a lot of this is implemented and functional in the sequential case. This is significantly simpler than memory-based analysis and extends to more subset/subset_insert operation pairs and data_types. Trying to cut corners here is a very bad idea.

(spoiler: we should do both).

scf.parallel, linalg.tiled_loop and “smart” terminators simply are not up for the job we need them to: they sit at an abstraction level that is half-explicit / half-implicit and that weird symmetry does not compose or generalize well in an imperfectly nested world on tensors.

Big tile, small tile, no tile

You will note here that we seem to have given up on “yielding the small tile” and that the various terminators proposed “yield the big tile”.

It seems this is a trigger point for people but this is the result of applying the first principles outlined above related to implicit vs explicit load/store (and more importantly implementing them in practice and getting a fully functional system up and running).

This is also fundamentally what triggers the race condition if we try to materialize the parallelism in the IR.

In other words, the problem comes from trying to write something like this on tensors:

// Boo races here, races there, races everywhere.
%t = looping_control_flow_with_parallel_semantics %i = %proc_id, %ub, step = %num_procs {
  ...
  wathever_terminator %foo : full_tensor
} -> full_tensor

The simple thing to remark is that one should not try to do that and instead carry enough information in the IR for this to materialize parallelism after bufferization

looping_control_flow_with_parallel_semantics %i = %proc_id, %ub, step = %num_procs {
  // yay side-effecting computations on buffers
} 

You will note that this is exactly what scf.parallel and linalg.tiled_loop do but while additionally introducing other undesirable things mentioned above.

You will also note that IREE has decided to introduce side-effects early and have a mixed “buffers in the ether accessed by name” at the boundary and tensors inside. This also comes with its many implications and tradeoffs that are really out of scope here.

Bottom line, while we should also push for tensor<... x tensor< ...>>, we also need something like:

%t = looping_control_flow_with_parallel_semantics %i = %lb, %ub, %step 
   {attributes_and_operands_to_specify_how_loop_is_distributed} 
{
  ...
  scf.yield %foo
} -> full_tensor

while we should have an scf.reduce, it is important IMO that the terminator not do any magic, given the first principles.

At which point the jump to scf.for seems pretty natural to me:

%t = scf.for %i = %lb, %ub, %step 
   {attributes_and_operands_to_specify_how_loop_is_distributed} 
{
  ...
  scf.yield %foo
} -> full_tensor

We could put lipstick on it and call it something else, fine.
I tend to want to avoid having to duplicate a lot of the canonicalization and folding transformations on scf.for but as often, maybe this requires new interfaces.

The real fundamental problem

Instead of going through another deeper path and start talking about transformations, I’ll try to step back and remark that the fundamental problem is much simpler and has not been addressed IMO.

I claim we do not have a good representation for a simple async construct with regions that operates solely on values; for instance, some notional:

%0 = struct.undef : struct<i32, i32>
%thread1, %thread2 = fork(2) : thread, thread

%r = parallel_region %thread1 (%0) {
  %i = constant 0: i32
  %1 = struct.insertvalue %i, %0[0]: struct<i32, i32>
  yield %1: struct<i32, i32>
} parallel_region %thread2 (%0) {
  %i = constant 1: i32
  %1 = struct.insertvalue %i, %0[1]: struct<i32, i32>
  yield %1: struct<i32, i32>
} : struct<i32, i32>

Going through memory to represent this is the only thing that comes to my feeble mind.

Do people know how to express this in SSA-land without going through side effects?
If we have good ways, I’d expect we’d want to generalize them to work with looping constructs.

Epilogue

In the end, I’d love to see prototypes that run end-to-end for the particular solutions that are proposed and how they handle imperfectly-nestedness: it is easy to overlook “unknown unknowns” that only surface when e2e prototypes are built.

Thanks for reading this wall of text!

Being able to accumulate in a single SSA value in the terminator captures the information we need without breaking SSA and without involving implicit copies/materialization of the full tensor.
A remaining concern would be that you generally have more than one value to accumulate in a loop… A single terminator does not suit that purpose.

Quick heads up, we had a deep dive with Adam Paszke on this and related topic and it is likely that this line of work has reached the point where we are ready to concretely engage with our functional friends.

In the next few days we expect to share a proposal and start some prototyping.
This should help move forward on multiple fronts that I am very excited about.

Stay tuned!

Great news Nicolas.
Adam’s Dex work does have a bunch of good ideas in representing iterators on arrays or tensors which could be helpful here.

Ok, I’ll wait for the proposal, but also want to re-iterate some requirements that would be needed if this were usable in IREE (of course that is not a blocking requirement, but if these arent satisfied, we just wont be able to use it). Without going into details, the main practical requirements are

  1. The modeling should be usable at different stages of the compilation. For example, the same construct should be able to represent distribution to blocks (when targeting CUDA, say) and distributing to threads. In other words, it should not force and end-to-end solution have to reason about both “host” side and “device” side to be workable. This is vague, so I am happy to take a look and provide feedback on the proposal in more concrete terms.
  1. The distributed loop should allow us to analyse the full iteration space without having to complex pattern matching (or using heavy weight-machinery).

This might be moot now, but one thing I was thinking about with after yesterdays ODM is that the scf.parallel is actually not a map operation (it might be more of a map-reduce). In a map operation you take a lambda and apply it to data. The shape of the data determines the shape of the result. The analogous representation here would be that, even without reduction the region of scf.parallel yields a value.

%3 = scf.parallel (%iv) = .... {
  scf.yield %t : tensor<?xf32>
} -> tensor<?x?xf32>

i.e. if the yield is a tensor of M dimensions, and the scf.parallel has K induction variables, then the result has a type tensor<MxKxf32>, basically a concat of all the results (like Albert suggested). Then an extra reduce region informs how to collapse the M dimensions of the result.

Just wanted to upvote this. This will solve a lot of issues, and what I was going from with “Solution 1” all the way in the first post.

Looking forward to get the inputs from folks with more “functional programming” background! This is exciting :slight_smile:

What is the type of %foo? Is this the full tensor?

My problem here is that I don’t understand how to avoid this magic. In your example of “yielding the full tensor” in a scf.for, what is the terminator doing and how do we get from “each iteration yield a full tensor” to “the loop produces a single tensor”?
“Something” has to merge/concat the tensors produced by each iteration, and papering over this seems exactly like “there is a magic terminator” (or “magic in between the terminator and the op”).

Right, this is a variant on the terminator/loop contract to produce the final value. Instead of providing the location for the slice, it gets implicit: it makes it really for a concat. Looks more like what Maksim (I think?) was mentioning during the meeting (they were mentioning returning list[tensor] which I think looks like what you’re doing?)

I suspect the limitation here is that because you can’t control the exact location / n-D insertion of the slide and you’re limited to a simple concat you may need non trivial data reshuffling after the loop.

Edit: after talking offline to Mahesh, it can be seen as a question of “fusion”. You may either have the shuffle fused in the loop (through the terminator I showed above for example) or have it as a separate step after the fact.
It will be interesting to see how the unfused version plays with bufferization and later codegen stage if we go with it!

@mehdi_amini I am trying to formulate everything in terms of subset operations so that there is an “evolved” version of tiled_loop that has two regions. One region yields subsets of tensors/memrefs and the other contains computations on the tensor/memref types of these subsets. That way the second region actually yields a subset/tile instead of the full tensor.

2D row reduction:

%sum = st.loop (%i, %j) = (%c0, %c0) to (%c80, %c60) step (%c4, %c4)
  ins (%in_ = %in: tensor<80x60xf32>, %cst_ = %cst: f32)
  outs (%out_ = %out: tensor<80xf32>)
  iterators["parallel", "sequential"]
  subsets {
    %in_sub = subset.tile %in_[%i, %j] [4, 4] [1, 1]
      : tensor<80x60xf32> to !st.tile<4x4xf32>
    %out_sub = subset.tile %out_[%i] [4] [1]
      : tensor<80xf32> to !st.tile<4xf32>
    subset.yield ins (%in_sub: !st.tile<4x4xf32>) outs (%out_sub: !st.tile<4xf32>)
  }
  compute ins (%in_sub: tensor<4x4xf32>) outs(%out_sub: tensor<4xf32>) {
    %reduction = linalg.generic {
        indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
                         affine_map<(d0, d1) -> (d0)>],
        iterator_types = ["parallel", "reduction"]}
        ins(%in_sub : tensor<4x4xf32>)
        outs(%out_sub : tensor<4xf32>) {
      ^bb0(%a: f32, %b: f32):
        %0 = arith.addf %a, %b : f32
        linalg.yield %0 : f32
    } -> tensor<4xf32>
    st.yield %reduction : tensor<4xf32>
  }

1D cwise

st.loop (%i) = (%c0) to (%dim) step (%c10)
  ins (%in_ = %in: tensor<100xf32>)
  outs (%out_ = %out: tensor<100xf32>)
  subsets {
    %in_sub = subset.tile %in[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    %out_sub = subset.tile %out[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    subset.yield ins (%in_sub: !st.tile<10xf32>) outs (%out_sub: !st.tile<10xf32>)
  }
  compute ins (%in_sub: tensor<10xf32>) outs (%out_sub: tensor<10xf32>) {
    %abs = linalg.generic {
      indexing_maps =  [#identity_1D, #identity_1D],
      iterator_types = ["parallel"]}
     ins(%in_sub: tensor<10xf32>)
     outs(%out_sub: tensor<10xf32>) {
      ^bb(%a: f32, %b: f32) :
        %0 = math.abs %a : f32
        linalg.yield %0 : f32
    } -> tensor<10xf32>
    st.yield %abs : tensor<10xf32>
  }
}

1D cwise bufferized

%dim = memref.dim %in, %c0: memref<100xf32>
st.loop (%i) = (%c0) to (%dim) step (%c10)
  ins (%in_ = %in: memref<100xf32>)
  outs (%out_ = %out: memref<100xf32>)
  subsets {
    %in_sub = subset.tile %in[%i][10][1] : memref<100xf32> to !st.tile<10xf32>
    %out_sub = subset.tile %out[%i][10][1] : memref<100xf32> to !st.tile<10xf32>
    subset.yield ins (%in_sub: !st.tile<10xf32>) outs (%out_sub: !st.tile<10xf32>)
  }
  compute ins (%in_sub: memref<10xf32, #map> outs (%out_sub: memref<10xf32, #map>) {
    %abs = linalg.generic {
      indexing_maps =  [#identity_1D, #identity_1D],
      iterator_types = ["parallel"]}
     ins(%in_sub: memref<10xf32, #map>)
     outs(%out_sub: memref<10xf32, #map>) {
      ^bb(%a: f32, %b: f32) :
        %0 = math.abs %a : f32
        linalg.yield %0 : f32
    } -> memref<10xf32>
    st.yield %abs : memref<10xf32>
  }
}

In that case the fusion into st.loop becomes mostly about constructing a composition of subsets. Subset types can be extended, etc.

1 Like

This is TBD because it really depends on explicit vs implicit and how we want to hide the abstraction gap between “sequential semantics on tensors” and “materialized parallel semantics on buffers”.

In the sequential case on tensors, this is well-defined.
Folks may complain about “insert” like operations and analyses—which by the way, also exist everywhere in vector and llvm—this is still well-defined.

In the parallel case, the idea of adding a first-class distribution attribute to scf.for that only materializes after bufferization is one way of attaching distributions information on sequential IR at the tensor level. We are looking for some better ones.

Spoiler: this action at a distance between “unmaterialized parallel IR on tensors” and “materialized parallel IR on buffers”, is also present in the functional land FWIU atm. This is all various forms of “sequential IR + hiding the abstraction gap” that can only resolve after bufferization

How would one represent a tiled and fused DAG?

Seems that in this representation, multiple interleaved (subset-compute)+ regions would be needed?

Could you also discuss the multidimensional case esp. wrt imperfect-nestedness considerations that I explained above?

I am a bit late to the party but will still throw my two cents in since I’ve been summoned to this thread by explicit mentions.

I am not supportive of extending the scope of the scf.for operation to support distribution or parallelism precisely for the reason why Mahesh likes this operation: “the scf.for is very straight-forward in what it represents”. I believe it should remain straightforward and adding extra semantics makes it less so.

Regarding the difference between scf.for and scf.parallel, there are several aspects:

  1. scf.parallel indicates that its iterations may be executed in parallel (but may be sequentialized, it doesn’t prescribe parallel execution or anything else);
  2. scf.parallel is a multi-for nest, intended to guarantee permutability of individual loops;
  3. scf.parallel has explicit support for reductions via scf.reduce regions whereas scf.for provides secondary induction variables (aka iter_args) that may be used to express reductions not are not necessarily reductions (they can also be used to express, e.g., iterating on parallel containers similarly to llvm::zip where each iteration increments the pointer carried around as iter_args, but the result of the loop is unused).

We can discard (2) for the sake simplicity and assume we only care about single-for loops.

(1) feels important here: what we seem to be needing to express is distribution-to-independent-processing-units or required concurrency, not parallelism-as-independence-of-iterations. For example, one cannot deterministically express a simple signal/wait communication using scf.parallel:

scf.parallel %i = 0 to 2 {
  scf.if %i == 0 {
    wait();
  }
  scf.if %i == 1 {
    signal();
  }
}

as it may or may not deadlock depending on iterations being distributed to different processing units or not.

This issue is partially orthogonal, but expressing reductions usually requires some sort of communication between threads or other concurrent processing units, and not being able to express it is problematic for modeling.

(3) is what is being mostly discussed in this thread, and partially connected to (1) in how the results are recombined. Ignoring what I see as semantic sugar about there being a reduction across loop iterations, the only difference that matters is scf.for guaranteeing that iter_args in an iteration have the values yielded by the previous iteration while scf.parallel only guaranteeing that the first argument of each scf.reduce region is the value yielded by some other iteration or the init value. It does not prescribe parallel reduction. Note that doesn’t directly allow for a tree reduction, because it would require knowing the neutral value, which may or may not be the init value.

The actual compute pattern that we seem to need is map/reduce on tensors. This includes:

  • the tiling/subsetting/data-splitting operation that partitions the inputs for each independent computation;
  • the main computation with additional specification where it may or must be executed concurrently with other computations;
  • the result combinator (inverse of tiling/subsetting/data-splitting operation) with optional additional indication on how it can be executed concurrently.

There are two simple case of combinators: computing independent results (what Nicolas refers to as yielding small-tile), or systematically combining results from concurrent computation with the same “accumulator” value (reductions as we have them). Yielding subsensors / small tiles from the operation, like tiled_loop seems to be doing, is just putting the “concatenate” combinator into the operation semantics. Yielding full tensors / big tiles from the loop, like linalg+for is currently doing, is just putting the “overwrite” combinator into the operation semantics. There are more complex cases like partially overlapping results that need to be reduced on the boundary but are independent in the middle, which we may want to model one day. I feel like if we have IR modeling at this level, it will be the right abstraction. Ideally, subsetting, parallelism/concurrency and combinators are designed as independent, reusable constructs.

3 Likes

There are always two regions no matter how many producers we fuse.

Example:

exponential(add(a,b))
st.loop (%i) = (%c0) to (%dim) step (%c10)
  ins (%a_ = %a: tensor<100xf32>, %b_ = %b: tensor<100xf32>)
  outs (%out_ = %out: tensor<100xf32>)
  subsets {
    %a_sub = subset.tile %a[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    %b_sub = subset.tile %b[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    %out_sub = subset.tile %out[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    subset.yield ins (%a_sub: !st.tile<10xf32>, %b_sub: !st.tile<10xf32>)
                        outs (%out_sub: !st.tile<10xf32>)
  }
  compute ins (%a_sub: tensor<10xf32>, %b_sub: tensor<10xf32>)
                 outs (%out_sub: tensor<10xf32>) {
    %add = linalg.generic // %a_sub + %b_sub
    %exp = linalg.generic // %exp(%add)
    st.yield %exp : tensor<10xf32>
  }
}

Apologies for resurrecting such an old thread, I just wanted to check out if there have been any recent developments / discussions regarding reductions in parallel loops operating on tensors.

Yes there has been recent progress and I should be able to share more details in the next few days/weeks (~2weeks as I will be travelling a bit).