Representing tiling on tensors + parallelism

This is TBD because it really depends on explicit vs implicit and how we want to hide the abstraction gap between “sequential semantics on tensors” and “materialized parallel semantics on buffers”.

In the sequential case on tensors, this is well-defined.
Folks may complain about “insert” like operations and analyses—which by the way, also exist everywhere in vector and llvm—this is still well-defined.

In the parallel case, the idea of adding a first-class distribution attribute to scf.for that only materializes after bufferization is one way of attaching distributions information on sequential IR at the tensor level. We are looking for some better ones.

Spoiler: this action at a distance between “unmaterialized parallel IR on tensors” and “materialized parallel IR on buffers”, is also present in the functional land FWIU atm. This is all various forms of “sequential IR + hiding the abstraction gap” that can only resolve after bufferization

How would one represent a tiled and fused DAG?

Seems that in this representation, multiple interleaved (subset-compute)+ regions would be needed?

Could you also discuss the multidimensional case esp. wrt imperfect-nestedness considerations that I explained above?

I am a bit late to the party but will still throw my two cents in since I’ve been summoned to this thread by explicit mentions.

I am not supportive of extending the scope of the scf.for operation to support distribution or parallelism precisely for the reason why Mahesh likes this operation: “the scf.for is very straight-forward in what it represents”. I believe it should remain straightforward and adding extra semantics makes it less so.

Regarding the difference between scf.for and scf.parallel, there are several aspects:

  1. scf.parallel indicates that its iterations may be executed in parallel (but may be sequentialized, it doesn’t prescribe parallel execution or anything else);
  2. scf.parallel is a multi-for nest, intended to guarantee permutability of individual loops;
  3. scf.parallel has explicit support for reductions via scf.reduce regions whereas scf.for provides secondary induction variables (aka iter_args) that may be used to express reductions not are not necessarily reductions (they can also be used to express, e.g., iterating on parallel containers similarly to llvm::zip where each iteration increments the pointer carried around as iter_args, but the result of the loop is unused).

We can discard (2) for the sake simplicity and assume we only care about single-for loops.

(1) feels important here: what we seem to be needing to express is distribution-to-independent-processing-units or required concurrency, not parallelism-as-independence-of-iterations. For example, one cannot deterministically express a simple signal/wait communication using scf.parallel:

scf.parallel %i = 0 to 2 {
  scf.if %i == 0 {
    wait();
  }
  scf.if %i == 1 {
    signal();
  }
}

as it may or may not deadlock depending on iterations being distributed to different processing units or not.

This issue is partially orthogonal, but expressing reductions usually requires some sort of communication between threads or other concurrent processing units, and not being able to express it is problematic for modeling.

(3) is what is being mostly discussed in this thread, and partially connected to (1) in how the results are recombined. Ignoring what I see as semantic sugar about there being a reduction across loop iterations, the only difference that matters is scf.for guaranteeing that iter_args in an iteration have the values yielded by the previous iteration while scf.parallel only guaranteeing that the first argument of each scf.reduce region is the value yielded by some other iteration or the init value. It does not prescribe parallel reduction. Note that doesn’t directly allow for a tree reduction, because it would require knowing the neutral value, which may or may not be the init value.

The actual compute pattern that we seem to need is map/reduce on tensors. This includes:

  • the tiling/subsetting/data-splitting operation that partitions the inputs for each independent computation;
  • the main computation with additional specification where it may or must be executed concurrently with other computations;
  • the result combinator (inverse of tiling/subsetting/data-splitting operation) with optional additional indication on how it can be executed concurrently.

There are two simple case of combinators: computing independent results (what Nicolas refers to as yielding small-tile), or systematically combining results from concurrent computation with the same “accumulator” value (reductions as we have them). Yielding subsensors / small tiles from the operation, like tiled_loop seems to be doing, is just putting the “concatenate” combinator into the operation semantics. Yielding full tensors / big tiles from the loop, like linalg+for is currently doing, is just putting the “overwrite” combinator into the operation semantics. There are more complex cases like partially overlapping results that need to be reduced on the boundary but are independent in the middle, which we may want to model one day. I feel like if we have IR modeling at this level, it will be the right abstraction. Ideally, subsetting, parallelism/concurrency and combinators are designed as independent, reusable constructs.

3 Likes

There are always two regions no matter how many producers we fuse.

Example:

exponential(add(a,b))
st.loop (%i) = (%c0) to (%dim) step (%c10)
  ins (%a_ = %a: tensor<100xf32>, %b_ = %b: tensor<100xf32>)
  outs (%out_ = %out: tensor<100xf32>)
  subsets {
    %a_sub = subset.tile %a[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    %b_sub = subset.tile %b[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    %out_sub = subset.tile %out[%i][10][1] : tensor<100xf32> to !st.tile<10xf32>
    subset.yield ins (%a_sub: !st.tile<10xf32>, %b_sub: !st.tile<10xf32>)
                        outs (%out_sub: !st.tile<10xf32>)
  }
  compute ins (%a_sub: tensor<10xf32>, %b_sub: tensor<10xf32>)
                 outs (%out_sub: tensor<10xf32>) {
    %add = linalg.generic // %a_sub + %b_sub
    %exp = linalg.generic // %exp(%add)
    st.yield %exp : tensor<10xf32>
  }
}

Apologies for resurrecting such an old thread, I just wanted to check out if there have been any recent developments / discussions regarding reductions in parallel loops operating on tensors.

Yes there has been recent progress and I should be able to share more details in the next few days/weeks (~2weeks as I will be travelling a bit).