[RFC] Parallel Abstraction For Tensors and Buffers

[RFC] Parallel Abstraction For Tensors and Buffers

Co-authored by: @apaszke

There is currently no good way of representing IR that involves parallelism with explicit thread/processor ids, tensors types and subset abstractions.
For more context on the discussion and alternatives, see the prior discussion in which alternatives involving scf.for, scf.parallel and the now retired linalg.tile_loop have been discussed at length.

This limits the ability to write transformations that benefit from subset abstractions on tensor SSA values.
As a reminder, structured codegen on tensors has been shown to provide a low-surprise codegen path that runs at close to peak (sequential) performance.

The object of this RFC is to propose a retargetable abstraction that supports parallelism with explicit thread/processor ids and operates on either tensors or buffers.
This will allow scaling sequential structured codegen to the parallel case and be reusable with either the async or the GPU-inspired processor grid models of execution.

Proposed abstraction

At a high-level, the proposed ops specify that a region that is evaluated multiple times in parallel, once per value of the single associated thread_id block argument. It represents a target-independent parallel function application operation.
The operation is not isolated from above and captures every value implicitly.

The proposed abstraction is a combination of atoms proposed during the prior discussion, with additional insights coming from mixing them to build an end-to-end prototype.

The proposed abstraction is called xxx.foreach_thread, with xxx denoting uncertainty about the dialect in which this should live.

Side Effecting Form

A parallel matmul on buffers can be expressed using xxx.foreach_thread as follows:

//
// Sequential context.
//
xxx.foreach_thread (%thread_id_1, %thread_id_2) in (%num_threads_1, %num_threads_2) {
  //
  // Parallel context, each thread with id = (%thread_id_1, %thread_id_2) runs its version of the code.
  //
  // f, g, h represent pseudo-IR, the details of which chunk of matmul is read/written by a 
  // particular thread are irrelevant in this example.
  %sA = memref.subview %A[f((%thread_id_1, %thread_id_2))]: memref<?x?xT> to memref<?x?xT>
  %sB = memref.subview %B[g((%thread_id_1, %thread_id_2))]: memref<?x?xT> to memref<?x?xT>
  %sC = memref.subview %C[h((%thread_id_1, %thread_id_2))]: memref<?x?xT> to memref<?x?xT>
  matmul ins(%sA, %sB) outs(%sC)
}
//
// Sequential context.
//

The details of which chunk of matmul is read/written by a particular thread is irrelevant.

This can be further lowered to materializing parallelism with e.g. nested async dialect as follows:

//
// Sequential context.
//
%group_1 = async.create_group %num_threads_1: !async.group
scf.for %thread_id_1 = 0 to %num_threads_1 step %c1 { 
  %token_1 = async.execute {
    //
    // Parallel context, each thread with id = %thread_id_1 runs its version of the code.
    //
    %group_2 = async.create_group %num_threads_2: !async.group
    scf.for %thread_id_2 = 0 to %num_threads_2 step %c1 { 
      %token_2 = async.execute {
        //
        // Parallel context, each thread with id = %thread_id_2 runs its version of the code.
        //
        %sA = memref.subview %A[f((%thread_id_1, %thread_id_2))]: memref<?x?xT> to memref<?x?xT>
        %sB = memref.subview %B[g((%thread_id_1, %thread_id_2))]: memref<?x?xT> to memref<?x?xT>
        %sC = memref.subview %C[h((%thread_id_1, %thread_id_2))]: memref<?x?xT> to memref<?x?xT>
        matmul ins(%sA, %sB) outs(%sC)
        asyc.yields
      }
      async.add_to_group %token_2, %group_2 : !async.token
    }
    async.await_all %group_2
    async.yield
  }
  async.add_to_group %token_1, %group_1 : !async.token
}
async.await_all %group
//
// Sequential context.
//

Or to a CUDA-like processor-grid execution model:

//
// Sequential context.
//
  ///
  /// Surrounding kernel launch and kernel encapsulation IR omitted.
  /// 
  ... {
    //
    // Parallel context, each thread with id = (%thread_id_1, %thread_id_2) runs its version of the code.
    //
    %thread_id_1 = gpu.block_dim x
    %thread_id_2 = gpu.block_dim y
    %sA = memref.subview %A[f((%thread_id_1, %thread_id_2))]: memref<?x?xT> to memref<?x?xT>
    %sB = memref.subview %B[g((%thread_id_1, %thread_id_2))]: memref<?x?xT> to memref<?x?xT>
    %sC = memref.subview %C[h((%thread_id_1, %thread_id_2))]: memref<?x?xT> to memref<?x?xT>
    matmul ins(%sA, %sB) outs(%sC)
  }
//
// Sequential context.
//

The order of reads and writes to memory is unspecified across iterations.

Note in particular that this form allows representing the simple signal/wait construct, for which it was previously noted that scf.parallel is not expressive enough:

//
// Sequential context.
//
xxx.foreach_thread (%thread_id) in (%c2) {
  //
  // Parallel context, each thread with id = (%thread_id) runs its version of the code.
  //
  scf.if %thread_id == 0 {
    wait();
  }
  scf.if %thread_id == 1 {
    signal();
  }
}
//
// Sequential context.
//

Pure Form

In its pure form, the op has an extensible concurrent terminator region containing explicit parallel_insert ops. The ops themselves do not create new values, rather the terminator yields a concurrently assembled tensor. This allows maintaining a clean separation between the subset and full tensor. The terminator specifies how the results of all parallel invocations should be reconciled into a full value that will be returned from xxx.foreach_thread.
Multi-return values are encoded by including multiple operations inside the xxx.perform_concurrently block.

//
// Sequential context.
//
%matmul_and_pointwise:2 = xxx.foreach_thread %thread_id in %num_threads -> (tensor<?x?xT>, tensor<?xT>) {
  //
  // Parallel context, each thread with id = (%thread_id_1, %thread_id_2) runs its version of the code.
  //
  %sA = tensor.extract_slice %A[f((%thread_id_1, %thread_id_2))]: tensor<?x?xT> to tensor<?x?xT>
  %sB = tensor.extract_slice %B[g((%thread_id_1, %thread_id_2))]: tensor<?x?xT> to tensor<?x?xT>
  %sC = tensor.extract_slice %C[h((%thread_id_1, %thread_id_2))]: tensor<?x?xT> to tensor<?x?xT>
  %sD = matmul ins(%sA, %sB) outs(%sC)

  %spointwise = subtensor %pointwise[i((%thread_id_1, %thread_id_2))]: tensor<?xT> to tensor<?xT>
  %sE = add ins(%spointwise) outs(%sD)

  xxx.perform_concurrently {
    // First op within the parallel terminator contributes to producing %matmul_and_pointwise#0. 
    xxx.parallel_insert_slice %sD into %C[h((%thread_id_1, %thread_id_2))]: tensor<?x?xT> into tensor<?x?xT>

    // Second op within the parallel terminator contributes to producing %matmul_and_pointwise#1.
    xxx.parallel_insert_slice %spointwise into %pointwise[i((%thread_id_1, %thread_id_2))]: tensor<?xT> into tensor<?xT>
  }
}
//
// Sequential context.
//

The op does not use the notion of iter_args or init, instead the xxx.perform_concurrently captures tensor values that are “inserted into”.
Similarly to the result of tensor.insert_slice, the resulting tensor is a new tensor with the same value as the into tensor except in the places updated by the xxx.parallel_insert_slice ops.
The order of the parallel updates is unspecified.

From a bufferization perspective, the op is considered to be in destination-passing style where a result #n and the #n's parallel_insert operation destination are “tied”.
Bufferization is guaranteed to occur inplace: in the example above, the buffer for %matmul_and_pointwise#0 is bufferized to the same buffer as for %C. If any external conflicts occur (e.g. RAW conflict on %Cneeded after the end of execution of the parallel op) then appropriate clones of %C will be introduced by bufferization.

Internal conflicts (e.g. overlapping xxx.parallel_insert_slice) are legitimate races that will turn into side-effecting races after bufferization.

Nested Forms

The following illustrates a 2-level 1-D parallel version with xxx.foreach_thread:

//
// Sequential context.
//
%7 = xxx.foreach_thread %thread_id_1 in %c125 -> (tensor<250x1020xf32>) {
  %8 = affine.apply affine_map<(d0) -> (d0 * 2)>(%thread_id)
  %10 = tensor.extract_slice %6[%8, 0] [2, 1020] [1, 1] : tensor<250x1020xf32> to tensor<2x1020xf32>
  %11 = xxx.foreach_thread %thread_id_2 in %c255 -> (tensor<2x1020xf32>) {
    %13 = affine.apply affine_map<(d0) -> (d0 * 4)>(%%thread_id_2)
    %15 = tensor.extract_slice %10[0, %13] [2, 4] [1, 1] : tensor<2x1020xf32> to tensor<2x4xf32>
    %16 = tensor.extract_slice %3[%8, 0] [2, 500] [1, 1] : tensor<250x500xf32> to tensor<2x500xf32>
    %17 = tensor.extract_slice %4[0, %13] [500, 4] [1, 1] : tensor<500x1020xf32> to tensor<500x4xf32>
    %18 = linalg.matmul ins(%16, %17 : tensor<2x500xf32>, tensor<500x4xf32>) outs(%15 : tensor<2x4xf32>) -> tensor<?x?xf32>
    xxx.perform_concurrently {
      xxx.parallel_insert_slice %18 into %10[0, %13] [2, 4] [1, 1] : tensor<2x4xf32> into tensor<2x1020xf32>
    }
  }
  xxx.perform_concurrently {
    xxx.parallel_insert_slice %11 into %6[%8, 0] [2, 1020] [1, 1] : tensor<2x1020xf32> into tensor<250x1020xf32>
  }
}
//
// Sequential context.
//

After inplace bufferization, 1-D slices properly combine into a single 2-D slice, and the IR becomes:

//
// Sequential context.
//
xxx.foreach_thread %thread_id in %c125 -> () {
  %3 = affine.apply affine_map<(d0) -> (d0 * 2)>(%arg0)
  %5 = memref.subview %2[%3, 0] [2, 1020] [1, 1] : memref<250x1020xf32> to memref<2x1020xf32, affine_map<(d0, d1)[s0] -> (d0 * 1020 + s0 + d1)>>
  xxx.foreach_thread %thread_id_2 in %c255 -> () {
    %6 = affine.apply affine_map<(d0) -> (d0 * 4)>(%arg1)
    %8 = memref.subview %5[0, %6] [2, 4] [1, 1] : memref<2x1020xf32, affine_map<(d0, d1)[s0] -> (d0 * 1020 + s0 + d1)>> to memref<2x4xf32, affine_map<(d0, d1)[s0] -> (d0 * 1020 + s0 + d1)>>
    %9 = memref.subview %0[%3, 0] [2, 500] [1, 1] : memref<250x500xf32> to memref<2x500xf32, affine_map<(d0, d1)[s0] -> (d0 * 500 + s0 + d1)>>
    %10 = memref.subview %1[0, %6] [500, 4] [1, 1] : memref<500x1020xf32> to memref<500x4xf32, affine_map<(d0, d1)[s0] -> (d0 * 1020 + s0 + d1)>>
    linalg.matmul ins(%9, %10 : memref<2x500xf32, affine_map<(d0, d1)[s0] -> (d0 * 500 + s0 + d1)>>, memref<500x4xf32, affine_map<(d0, d1)[s0] -> (d0 * 1020 + s0 + d1)>>) outs(%8 : memref<2x4xf32, affine_map<(d0, d1)[s0] -> (d0 * 1020 + s0 + d1)>>)
  }
}
//
// Sequential context.
//

By virtue of memref.subview operations—and more generally pointer arithmetic—composing, we obtain N-D slices of data and compute after bufferization.

Multiple nested 1-D xxx.foreach_thread may be transformed into a single flattened n-D xxx.foreach_thread.
This process may require computation duplication when the nested 1-D form is not perfectly nested and is generally not a straightforward transformation.

Observations

Some observations surfaced during the design of these abstractions and during the prior discussion:

  • Abstractions related to parallelism, distribution, loops and sub-tensor-SSA values only truly materialize post bufferization. The exercise is to invent representations on tensors and sub-tensor-SSA values that allow transformations to be specified and carried past bufferization without inconsistencies or abstraction gaps.
  • Extra care about the full/sub-tensor type of operands is key and has historically been limited.
  • A thread-first parallel iteration model is deemed important (i.e. thread ids explicit and visible). In the current incantations of MLIR, the movement from data-first to thread-first happens late in the pipeline (e.g. when lowering to GPU). New abstractions expose these concepts earlier and analyze/preserve parallelism the same way early vectorization preserves vector types. This will also avoid pigeonholing to existing dense and regular abstractions; dynamic sizes, load-balancing and future types (e.g. lists) must not be precluded. This requires going beyond a (lower bound, upper bound, step)-representation and embracing a thread-first representation.
  • In addition to enabling more parallel codegen, we expect such an explicit thread-first parallel iteration model to be a first step towards distributed tensors.

Question

What xxx dialect is appropriate for such a thread-first parallel construct ?

@MaheshRavishankar @apaszke @ftynse

6 Likes

I have a couple of questions:

  1. Just to clarify my understanding, this models concurrency / required parallelism?
  2. This mentions extensibility of the terminator, but only discusses parallel_insert_slice. What are the other possible combinators? In particular, can there be some combinators that require non-trivial reconciliation such as adding co-indexed elements (think outer-product + parallel-add reduction to implement matmul)?
  3. Why is the perform_concurrently terminator necessary and why it has to be a terminator? It looks like the combination semantics are those of parallel_insert_slice that feels like it can be placed just anywhere in the foreach_thread body.

Regarding the dialect, foreach_thread sounds like it could be in the scf dialect since it is a control flow construct. parallel_insert_slice isn’t as good of a fit there though. I am pondering the idea of having a parallel dialect that would include scf.parallel as parallel.for + these constructs.

1 Like

I didn’t get this from my reading, so it is worth making it clear: is a sequential implementation (that is: a machine with a single core executing this with a single thread) always legal? If not: why?

I’m also interested in how any kind of reduction would work?

How does it relate to the work on the OpenMP resp. OpenAcc dialects?

Thanks Nicolas!

Is there a branch or patch somewhere where the prototype is accessible?

For the naming and xxx part, I have a couple (probably naive) questions.

  1. Why wouldn’t it go in the SCF dialect?
  2. Does it have to be an Operation or can it be an Interface? For example, the GPU dialect has similar semantics for its Operation that represents a kernel function, but it has also has some extensions. So it makes sense that one would want to use the idea for different kinds of operations with small differences.
  3. Related to 2, the _thread in the name seems to clash with how one might want to think about what the operation is representing (e.g. mapping over blocks vs subgroups vs threads on a GPU).

Thanks for the proposal, Nicolas! A few questions for my understanding:

  • Is also one of the goals to provide a common landing pad for the implementation of multiple parallel constructs (worksharing, tasks, nd-range, etc.)? If so, this could be used as a common layer for different programming models (OpenMP, OpenAcc, OpenCL, etc.).

  • I infer that the proposed ops will be very specific to thread creation/forking execution. They won’t abstract away details about iteration space distribution or data visibility/ownership (shared, private, etc.) among threads. This makes the level of abstraction relatively low and explicit, right?

  • What about thread synchronization? Do we need new ops to describe barriers and groups of threads that can be synchronized? Could we have a barrier within the op region? Is there an implicit barrier at the end of the region?

IIUC, in addition to the duplicated computation, the nested example illustrates a case of hierarchical parallelism (i.e., every independent thread in the first level of parallelism will fork and create a second level of threads), which would model something significantly different to the flattened counterpart, isn’t that the case?

+1. It would be great if we finally had a powerful model for reductions that includes arbitrary initializers and combiners.

+1. That makes sense to me.

The op is aa abstraction to enable transformations on tensors with an explicit parallel context.
Implication on transformations relate to crossing the sequential / parallel boundary; for instance, hoisting a temporary buffer allocated within the op out of the parallel context requires allocating a copy per thread.
There is no strictly required parallelism and the abstraction does not prevent one from shooting oneself in the foot. Lowering the wait/signal example to a sequential loop will deadlock.

This proposal does not address reductions or general representations of parallelism. We have multiple lower-level representations, all working on buffers (async, omp, gpu dialects). The proposal is directed at bridging the abstraction gap that prevents us from representing parallelism + subset + tensors, as was discussed previously.

Regarding parallel reductions, we had discussed offline with @dcaballe and @ftynse adding a reduction operation that does not depend on an enclosing op; this proposal does not touch on this.

The terminator extensibility mention is related to the fact that these can also work with other types that will require other op spellings to combine in parallel (e.g. sparse), not to different combining semantics.

This goes back to the discussion on full and partial tensors: the operation needs to yield full tensors and compute partial tensors internally. parallel_insert_slice collectively produce a full tensor but none of the instances returns a full tensor to the local thread: threads never have access to a partially updated full tensor. I am not sure what else than a terminator is appropriate to encode such behavior, do you have a suggestion as to some alternative way of spelling this?

I am not sure how that would work with e.g. surrounding control flow, encapsulating in a terminator with strict rules seems safer to me.

Similarly to the way it relates to async and gpu dialects, it can lower to them.

This is prototyped in IREE, here is a test (it is called linalg_ext.in_parallel there) and has been connected to e2e parallel execution in the sandbox.

It could go there or in other places that other suggested, no strong opinion on the location on my end.
A parallel dialect is interesting but it seems a significantly larger endeavor than the operation in this RFC.

In the prototype impl, the op is call in_parallel, I am not hung up on the name and am happy to change it.

Not at this point: this seems significantly larger scope than what is proposed here and I am not sure how to spell these in tensor-land. I have found that there is a very fine line on can tread to build abstractions that connect end-to-end from tensors to LLVM.

It is still a retargetable abstraction so it still sits on top of async and similar dialects. The thread “creation/forking execution” are abstracted away until lowered into a particular parallelism implementation dialect.

There is an implicit barrier at the end of the region but no barriers or synchronization within the region: the tensor values do not escape and partial updates or temporary tensor values are not visible across threads. I do not know that such synchronization concepts that are very dependent on side effects and memory translate to tensor-land.

I think that reductions should fit into the same framework very naturally. Instead of performing parallel_insert_slice, you could have a reduce operation in the perform_concurrently terminator, and have the overall foreach_thread return an extra output that contains the fully reduced value. The code idea here is that perform_concurrently specifies a contribution of every thread to the final output, but crucially in a way which doesn’t allow any single thread to see a partially formed value. For every op in the perform_concurrently block, there is an extra output from foreach_thread, representing the final value.

I think that with lowering to async and LLVM coroutines it can actually work even in a sequential case, wait will be just “suspend and yield” and signal will trigger the continuation. Unlikely will work out of the box today, but conceptually should be doable.

Should threads have different “thread spaces” (similar to memref memory space)? E.g. on CPU there is just one “thread space” == CPU thread, on GPU there a blocks and threads and even sub-groups (Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Technical Blog)

How would thread synchronization look like for cooperative groups and blocks?

1. Implicit?

xxx.foreach_thread %block in %blocks { 
 xxx.foreach_thread %group in %groups {
   xxx.foreach_thread %thread in %threads {
   }
   // Implicit barrier: synchronize group
   xxx.foreach_thread %thread in %threads {
   }
 }
 // Implicit barrier: synchronize block
 xxx.foreach_thread %group in %groups {
 }
}

2. Explicit?

xxx.foreach_thread %block, %group, %thread in (%blocks, %groups, %threads) {
  // Synchronizes all threads within a group
  xxx.barrier %block, %group : !xxx.thread_id<space=0>, !xxx.thread_id<space=1>
  // Synchronizes all threads within a block
  xxx.barrier %block : !xxx.thrread_id<space=0>
}

3. Something else?

UPD:

I guess because threads do not have access to each others state, it should be implicit as in (1), explicit synchronization primitives don’t make any sense in tensor land with immutable SSA values.

Then it is desirable to specify that the only combination semantics that is supported is generalized concatenation (i.e. there is no possibility to see two “small pieces” coming from different threads at the same time), though it is up to types to define what concatenation means for them.

Let me unpack a bit. The foreach_thread operation needs to produce full tensors while its body only has partial tensors. That is totally fine. However, parallel_insert_slice does not produce anything, and neither does perform_concurrently. The former is just a regular operation with operands, one of which is the partial tensor. The semantics as I read them are as follows: parallel_insert_slice records that a partial tensor is to be inserted into some full tensor, but doesn’t perform its actual insertion; the semantics of perform_concurrently is the union of all nested parallel_insert_slice, still without the actual insertion, plus the control flow transfer back to the parent op; finally, it’s the foreach_thread that actually performs all insertions concurrently after it obtains the control back from its body. There may be a different interpretation that says perform_concurrently actually performs the parallel insertion, forms full tensors and then forwards them to the parent foreach_thread. This requires perform_concurrently to “communicate” across threads, and potentially know about the types of values it should produce.

The alternative is to drop perform_concurrently and just have the “record+delay” semantics of parallel_insert_slice with foreach_thread handling the execution of insertions. Something like

%matmul_and_pointwise:2 = xxx.foreach_thread %thread_id in %num_threads -> (tensor<?x?xT>, tensor<?xT>) {
  %sA = tensor.extract_slice %A[f((%thread_id_1, %thread_id_2))]: tensor<?x?xT> to tensor<?x?xT>
  %sB = tensor.extract_slice %B[g((%thread_id_1, %thread_id_2))]: tensor<?x?xT> to tensor<?x?xT>
  %sC = tensor.extract_slice %C[h((%thread_id_1, %thread_id_2))]: tensor<?x?xT> to tensor<?x?xT>
  %sD = matmul ins(%sA, %sB) outs(%sC)

  %spointwise = subtensor %pointwise[i((%thread_id_1, %thread_id_2))]: tensor<?xT> to tensor<?xT>
  %sE = add ins(%spointwise) outs(%sD)

  xxx.parallel_insert_slice %sD into %C[h((%thread_id_1, %thread_id_2))]: tensor<?x?xT> into tensor<?x?xT>
  xxx.parallel_insert_slice %spointwise into %pointwise[i((%thread_id_1, %thread_id_2))]: tensor<?xT> into tensor<?xT>
}

or even

%matmul_and_pointwise:2 = xxx.foreach_thread %thread_id in %num_threads -> (tensor<?x?xT>, tensor<?xT>) {
  %sA = tensor.extract_slice %A[f((%thread_id_1, %thread_id_2))]: tensor<?x?xT> to tensor<?x?xT>
  %sB = tensor.extract_slice %B[g((%thread_id_1, %thread_id_2))]: tensor<?x?xT> to tensor<?x?xT>
  %sC = tensor.extract_slice %C[h((%thread_id_1, %thread_id_2))]: tensor<?x?xT> to tensor<?x?xT>
  %sD = matmul ins(%sA, %sB) outs(%sC)
  xxx.parallel_insert_slice %sD into %C[h((%thread_id_1, %thread_id_2))]: tensor<?x?xT> into tensor<?x?xT>

  %spointwise = subtensor %pointwise[i((%thread_id_1, %thread_id_2))]: tensor<?xT> to tensor<?xT>
  %sE = add ins(%spointwise) outs(%sD)
  xxx.parallel_insert_slice %spointwise into %pointwise[i((%thread_id_1, %thread_id_2))]: tensor<?xT> into tensor<?xT>
}

is perfectly okay in my mind. perform_concurrently doesn’t seem to add any meaningful semantics as currently proposed. It does give some convenience by grouping combinator ops in one region (block? unclear if its region is single-block), and a conceptual cue that insertions might happen in an arbitrary order, but that’s all it does.

I see the point about control flow, but perform_concurrently does not restrict its body in any way. One can have loops and functions calls that have parallel_insert_slice inside with the same effect as having loops with parallel_insert_slice inside the body of foreach_thread with no terminator. Furthermore, this particular restriction can be implemented by parallel_insert_slice itself requiring to have foreach_thread as immediate parent.

Those rules need to be specified. It then becomes possible to consider whether they can only be implemented by a terminator.

1 Like

An example would be helpful…

IMO, it’s not perform_concurrently itself that specifies the contribution of each thread, but the ops contained in it. Otherwise, I would expect it to take the tensors representing contributions as operands, for example. Rather, the “reduction” operation is taken then. Usually, reductions are binary functions, that’s why I’d like to understand how exactly it is supposed to be represented here given the clear desire not to give access to the partial value / accumulator.

1 Like

I imagine it would look like this:

// assuming func @sum(f32, f32) -> f32, and func @prod(i32, i32) -> i32
%sum_netural = constant 0.0 : f32
%prod_neutral = constant 1 : i32
%sum, %prod = xxx.foreach_thread %thread_id in %num_threads -> (f32, i32) {
  %thread_partial_sum = ... : f32
  %thread_partial_prod = ... : i32
  xxx.perform_concurrently {
    xxx.parallel_reduce %sum_neutral @sum %thread_partial_sum : f32
    xxx.parallel_reduce %prod_neutral @prod %thread_partial_prod : i32
  }
}

The issue with this IR is that @sum and @prod are referencing the result of the overall operation, which violates ssa properties. In the design of reduction in scf.parallel (which unfortunately still does not have great tensor level semantics) we had the same issue and decided to use the order in which the scf.reduce operations appear. Likely not ideal but similar in nature. scf.reduce also carried a body with the actual reduction function, so that you know how to combine the partial sums.

An alternative would be to not model parallel reductions, at all. Instead, one can rewrite the program into one that uses sequential reductions within a parallel xxx.foreach_thread that computes a vector of partial sums. The final reduction of said vector could then be a simple sequential loop. That encodes how parallel reduction is implemented but given that the foreach_thread primitive is fairly low-level wrt. concurrency, making the approach explicit seems fine.

Another interesting case to think about is a scatter operation with conflict resolution or more general patterns that do (partial) updates to a tensor. These are either non-deterministic or need access to previous values of the partially updated tensor. I think that a specialized terminator (or similar) is the way to go in that case. Whenever possible, I’d try to not give it access to the full tensor that is being produced but limiting to specific access patterns. That makes later lowering to atomics etc. easier.

And, supporting @ftynse questions, we did not use terminators in scf.parallel, instead relying on the order of operations within the body region and immediate nesting. As long as the partial updates are not visible and as long as there is a single update operation per result, there is no real difference between the two approaches. In either case, their effect is only visible once the entire parallel operation completes. (In case of the scan example, once the current thread of compute completes but that does not matter much as ordering between threads is undefined).

@pifon2a FYI

The issue with this IR is that @sum and @prod are referencing the result of the overall operation, which violates ssa properties.

Sorry, I don’t understand that. Can you please elaborate? The “result of the overall operation” is stored in %sum and %prod, which is not in scope in @sum/@prod or at parallel_reduces, so I don’t see the violation.

An alternative would be to not model parallel reductions, at all.

I don’t think that would be a good outcome. This is meant to be useful for representing tiling. Now, if we’re tiling a “map” dimension, then parallel_insert_slice is the natural operator to target. But if you tile a “reduce” dimension, then you need to model reductions.

I agree they can later get lowered to another form, such as the sequential loop you suggested, but that shouldn’t be necessary.

Another interesting case to think about is a scatter operation with conflict resolution or more general patterns that do (partial) updates to a tensor. … I think that a specialized terminator (or similar) is the way to go in that case. … Whenever possible, I’d try to not give it access to the full tensor that is being produced but limiting to specific access patterns.

Completely agree. The two last sentences I included above are precisely the main design principles behind this primitive.

And, supporting @ftynse questions, we did not use terminators in scf.parallel , instead relying on the order of operations within the body region and immediate nesting. …

I like the terminator approach, and especially its name perform_concurrently, because it indicates that the effects happen in any order. In particular, there is no sequencing even within a single thread, which seems like a natural interpretation when the ops are ordered within a single parallel block. So while both representations might target the same eventual semantics, I like the explicitness of this approach.

You are right. I misread the @sum for a %sum and did not notice the comment on top.

Even for tiling, you could make the strategy of tiling the reduction (i.e. how to further reduce partial sums) explicit. What do you gain from expressing this with the presented construct? For the parallel case, I see the benefit that the construct adds the ability to produce a tensor from components with clean semantics. Reductions do not have that issue.

I am not arguing against designing it that way, I am curious to understand the motivation as we lean towards modeling tiled reductions explicitly.

I don’t quite get why this is relevant in a value-based semantics (tensor level)? None of this seems observable to me in any case.

You may want to take a look at how reductions are modeled in the OpenMP dialect then - 'omp' Dialect - MLIR. In short, it’s a special declaration construct (you can still call a function from within it) that connects the combinators and the neutral value.

I am having a hard time reconciling the reduction functions that expectedly have two arguments with the claimed desire to avoid providing access to the partially-reduced value. At least one of the function operands will be just that in some of the calls. It is possible to underspecify the op in a way that makes the access to the partial value useless (e.g., there is no requirement for there to be an accumulator, or for it to be passed at a specific argument position, to allow for things like tree reductions), but it can be accessed.

So far, it sounds like this boils down to esthetics more than to pragmatics. Using the terminator as a hook for additional restrictions (one can only attach the verifier to the op) on what can go inside, such as only the ops that do not produce results or ops that implement some “parallel combiner” interface, or alternatively disallow the ops inside to use values defined in the same region so there is no value communication possible, would make having the terminator more compelling. So would proposing a second terminator, even if it is not going to be implemented initially. Readability of the textual IR is a secondary concern and even accessing the operations from the flat form would be shorter (foreachThreadOp.getBody().getOps<ParallelInsertSliceOp>() than from the nested form (foreachThreadOp.getBody().getTerminator().cast<PerformConcurrently>().getBody().without_terminator()).

One win of not making this explicit is that foreach_thread can always be sequentialized afterwards. That is, in the pure form foreach_thread can express potential parallelism, not guaranteed parallelism.

Right, lack of observability is actually a good argument for the terminator approach IMO. Otherwise, if you put ops such as parallel_insert_slice in the body of foreach_thread then:

  • if they don’t return anything, then how do you even get your hands on the result? Should the fact that ops are merely present in the body block, but don’t terminate it change the signature of the surrounding op?
  • if they do return something, what does this value represent? Is it a partially updated initial value? How do those partial values get reconciled into a full view after the parallel op ends?

Sure, I’m never claimed that the representation I suggested is the only valid one. You could easily use the OpenMP way with foreach_thread as well.

If you want to customize reductions, then there’s no way around providing a partial accumulator to the reduction function one way or another. You can make a similar argument that if someone gives you a non-associative function you’ll similarly do something weird, but that’s just an assumption this op makes. For as long as the reducer is pure and associative, it should work well. (As a sidenote, in Dex we have an effect system that truly constraints the reducer to be pure, eliminating the need for this discussion.)

There’s always a subjective component in IR design, so I agree to disagree. Ultimately you could just scrape all this MLIR nonsense and go straight to assembly, since that’s the most pragmatic form of representing computation. :wink:

As far as I understand the whole idea behind incremental lowering is to make small steps with well defined intermediate stages, and I’d argue that foreach_thread is a nice way to represent one of those stages. And scf.parallel (or a slight modification of it) might be as well!

1 Like

The latter is much more efficient though.