Representing tiling on tensors + parallelism

These points have been indeed discussed and some even prototyped in the past months.

The current design has settled on 1-d control flow + n-d data representation for multiple intersecting reasons: as usual it’s tradeoffs everywhere.

One particular reason we currently do “1-d/n-d” is that (I believe) n-d control flow + n-d terminator makes it tricky to represent imperfect nestedness in a stable form at this point in the transformation pipeline. In particular, the big problem I see forcing the control-flow and data in the same abstraction (i.e. a containing op and its terminator) is that “the value n ends up being the same” for control-flow and data.

I have heard a few times something along the lines of: “if you don’t want n-d scf.parallel you can just break them down into 1-d scf.parallel”. Well let’s unpack this on the simplest imperfectly nested example possible:

%t = scf.parallel %i ... {
   //
   // I want to do something special here (prefetch, pack, compute)
   // 
   % tt = scf.parallel %j ... {
      // compute 2-D tile
      terminator1?
   }
   terminator2?
}

What possibilities do we have for terminator1 and terminator2?

Possibility 1: Symmetric terminators

If both terminator1 and terminator2 are “the symmetric”, both can only insert 1-D slices: %j does not dominate terminator2. This breaks the ability to perform n-D block insertions and is a non-starter for this body of work (may be perfectly valid as an abstraction to have but we don’t want to use it in this instance).

Possibility 2: Asymmetric terminators

Alternatively, terminator1 can be the special reduce terminator and terminator2 can be a simple scf.yield. While this could technically work, this is still very brittle and there are numerous implications.

Let’s make the example a little more interesting: I claim there is such a thing as fusing 1-D and 2-D loop nests.

%t0, %t1 = scf.parallel %i ... {
   //
   // I actually do something here.
   // 
   // compute 1-D tile
  %t = some_pointwise_op_1d()

   % tt = scf.parallel %j ... {
      // compute 2-D tile
      fancy_terminator?
   }

   terminator2?
}

What should terminator2 look like? I am not sure.
How does this generalize to more results and different types? I’m afraid to ask.
What are the implications on analysis and transformations? oh my …

Possibility 3: Reduce is not a terminator

Separately, we have been discussing with @ftynse and @dcaballe the fact that a generic reduce is needed to capture more advanced combinators. This would be generally useful everywhere if it were decoupled from the semantics an enclosing op. In particular reduce on vectors is something that we consider needed to increase applicability of generic n-D vectorization while keeping it short.

%t0, %t1 = scf.parallel %i ... {
  %t = some_pointwise_op_1d()

   % tt = scf.parallel %j ... {
      // compute 2-D tile
      %ttt = reduce (%small_tile, %big_tile) @some_init {some_body} -> type(%big_tile)
      scf.yield %ttt: %big_tile
   }

   scf.yield %tt, t
}

I believe something like this was previously proposed but had some caveats re @some_init that was represented as a function and required all passes to become module passes but this should be easily fixable.

The real problems

Given the multiple interactions I’ve had over the past few months, I’d venture that this is just a symptom of some difficult tradeoffs that I will list here.

IR design vs analysis for transformations

linalg.generic is perfecly nested by design. This a conscious and carefully weighted tradeoff between expressiveness and the 3 harbingers of transformations (legality, applicability and profitability), I won’t repeat the points of Rationale but you’ll note there is a section on closedness under transformations.

Once tiling and fusion are performed, we are in imperfectly nested-land on the tiled loops but still perfectly nested within the linalg.generic within the tiled loops. At this point there is no escaping our imperfect-nestedness condition. The positive is the transformations that usually are needed on tiled loops are a much simpler bunch. Still, coming back to the imperfect-nestedness:

  1. we can push forward on the n-D abstractions with implicit load/store in which case we should do the hard work of introducing tensor<... x tensor< ...>> and codegen with it as a first-class citizen. This is also very desirable as our way into richer semantics that can easily be codegen’ed (ragged, jagged, sparse, trees, databases etc…)

  2. we can use traditional loops and write the analyses on these explicit subset load/store and gradually introduce better abstractions and interfaces to simplify and generalize. However much I dislike unnecessary complexity, we do have to bite the bullet and do some analysis. Luckily we are in SSA-land and a lot of this is implemented and functional in the sequential case. This is significantly simpler than memory-based analysis and extends to more subset/subset_insert operation pairs and data_types. Trying to cut corners here is a very bad idea.

(spoiler: we should do both).

scf.parallel, linalg.tiled_loop and “smart” terminators simply are not up for the job we need them to: they sit at an abstraction level that is half-explicit / half-implicit and that weird symmetry does not compose or generalize well in an imperfectly nested world on tensors.

Big tile, small tile, no tile

You will note here that we seem to have given up on “yielding the small tile” and that the various terminators proposed “yield the big tile”.

It seems this is a trigger point for people but this is the result of applying the first principles outlined above related to implicit vs explicit load/store (and more importantly implementing them in practice and getting a fully functional system up and running).

This is also fundamentally what triggers the race condition if we try to materialize the parallelism in the IR.

In other words, the problem comes from trying to write something like this on tensors:

// Boo races here, races there, races everywhere.
%t = looping_control_flow_with_parallel_semantics %i = %proc_id, %ub, step = %num_procs {
  ...
  wathever_terminator %foo : full_tensor
} -> full_tensor

The simple thing to remark is that one should not try to do that and instead carry enough information in the IR for this to materialize parallelism after bufferization

looping_control_flow_with_parallel_semantics %i = %proc_id, %ub, step = %num_procs {
  // yay side-effecting computations on buffers
} 

You will note that this is exactly what scf.parallel and linalg.tiled_loop do but while additionally introducing other undesirable things mentioned above.

You will also note that IREE has decided to introduce side-effects early and have a mixed “buffers in the ether accessed by name” at the boundary and tensors inside. This also comes with its many implications and tradeoffs that are really out of scope here.

Bottom line, while we should also push for tensor<... x tensor< ...>>, we also need something like:

%t = looping_control_flow_with_parallel_semantics %i = %lb, %ub, %step 
   {attributes_and_operands_to_specify_how_loop_is_distributed} 
{
  ...
  scf.yield %foo
} -> full_tensor

while we should have an scf.reduce, it is important IMO that the terminator not do any magic, given the first principles.

At which point the jump to scf.for seems pretty natural to me:

%t = scf.for %i = %lb, %ub, %step 
   {attributes_and_operands_to_specify_how_loop_is_distributed} 
{
  ...
  scf.yield %foo
} -> full_tensor

We could put lipstick on it and call it something else, fine.
I tend to want to avoid having to duplicate a lot of the canonicalization and folding transformations on scf.for but as often, maybe this requires new interfaces.

The real fundamental problem

Instead of going through another deeper path and start talking about transformations, I’ll try to step back and remark that the fundamental problem is much simpler and has not been addressed IMO.

I claim we do not have a good representation for a simple async construct with regions that operates solely on values; for instance, some notional:

%0 = struct.undef : struct<i32, i32>
%thread1, %thread2 = fork(2) : thread, thread

%r = parallel_region %thread1 (%0) {
  %i = constant 0: i32
  %1 = struct.insertvalue %i, %0[0]: struct<i32, i32>
  yield %1: struct<i32, i32>
} parallel_region %thread2 (%0) {
  %i = constant 1: i32
  %1 = struct.insertvalue %i, %0[1]: struct<i32, i32>
  yield %1: struct<i32, i32>
} : struct<i32, i32>

Going through memory to represent this is the only thing that comes to my feeble mind.

Do people know how to express this in SSA-land without going through side effects?
If we have good ways, I’d expect we’d want to generalize them to work with looping constructs.

Epilogue

In the end, I’d love to see prototypes that run end-to-end for the particular solutions that are proposed and how they handle imperfectly-nestedness: it is easy to overlook “unknown unknowns” that only surface when e2e prototypes are built.

Thanks for reading this wall of text!