The op is aa abstraction to enable transformations on tensors with an explicit parallel context.
Implication on transformations relate to crossing the sequential / parallel boundary; for instance, hoisting a temporary buffer allocated within the op out of the parallel context requires allocating a copy per thread.
There is no strictly required parallelism and the abstraction does not prevent one from shooting oneself in the foot. Lowering the wait/signal
example to a sequential loop will deadlock.
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/ftynse/48/18644_2.png)
can there be some combinators that require non-trivial reconciliation such as adding co-indexed elements (think outer-product + parallel-add reduction to implement matmul)?
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/mehdi_amini/48/18_2.png)
I’m also interested in how any kind of reduction would work?
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/dcaballe/48/932_2.png)
+1. It would be great if we finally had a powerful model for reductions that includes arbitrary initializers and combiners.
This proposal does not address reductions or general representations of parallelism. We have multiple lower-level representations, all working on buffers (async
, omp
, gpu
dialects). The proposal is directed at bridging the abstraction gap that prevents us from representing parallelism + subset + tensors, as was discussed previously.
Regarding parallel reductions, we had discussed offline with @dcaballe and @ftynse adding a reduction operation that does not depend on an enclosing op; this proposal does not touch on this.
The terminator extensibility mention is related to the fact that these can also work with other types that will require other op spellings to combine in parallel (e.g. sparse), not to different combining semantics.
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/ftynse/48/18644_2.png)
Why is the
perform_concurrently
terminator necessary and why it has to be a terminator? It looks like the combination semantics are those ofparallel_insert_slice
that feels like it can be placed just anywhere in theforeach_thread
body.
This goes back to the discussion on full and partial tensors: the operation needs to yield full tensors and compute partial tensors internally. parallel_insert_slice
collectively produce a full tensor but none of the instances returns a full tensor to the local thread: threads never have access to a partially updated full tensor. I am not sure what else than a terminator is appropriate to encode such behavior, do you have a suggestion as to some alternative way of spelling this?
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/ftynse/48/18644_2.png)
It looks like the combination semantics are those of
parallel_insert_slice
that feels like it can be placed just anywhere in theforeach_thread
body.
I am not sure how that would work with e.g. surrounding control flow, encapsulating in a terminator with strict rules seems safer to me.
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/tschuett/48/776_2.png)
How does it relate to the work on the OpenMP resp. OpenAcc dialects?
Similarly to the way it relates to async
and gpu
dialects, it can lower to them.
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/cbate/48/14544_2.png)
Is there a branch or patch somewhere where the prototype is accessible?
This is prototyped in IREE, here is a test (it is called linalg_ext.in_parallel
there) and has been connected to e2e parallel execution in the sandbox.
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/cbate/48/14544_2.png)
Why wouldn’t it go in the SCF dialect?
It could go there or in other places that other suggested, no strong opinion on the location on my end.
A parallel
dialect is interesting but it seems a significantly larger endeavor than the operation in this RFC.
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/cbate/48/14544_2.png)
Related to 2, the
_thread
in the name seems to clash with how one might want to think about what the operation is representing (e.g. mapping over blocks vs subgroups vs threads on a GPU).
In the prototype impl, the op is call in_parallel
, I am not hung up on the name and am happy to change it.
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/dcaballe/48/932_2.png)
Is also one of the goals to provide a common landing pad for the implementation of multiple parallel constructs (worksharing, tasks, nd-range, etc.)
Not at this point: this seems significantly larger scope than what is proposed here and I am not sure how to spell these in tensor-land. I have found that there is a very fine line on can tread to build abstractions that connect end-to-end from tensors to LLVM.
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/dcaballe/48/932_2.png)
I infer that the proposed ops will be very specific to thread creation/forking execution. They won’t abstract away details about iteration space distribution or data visibility/ownership (shared, private, etc.) among threads. This makes the level of abstraction relatively low and explicit, right?
It is still a retargetable abstraction so it still sits on top of async
and similar dialects. The thread “creation/forking execution” are abstracted away until lowered into a particular parallelism implementation dialect.
![](https://sea1.discourse-cdn.com/business4/user_avatar/discourse.llvm.org/dcaballe/48/932_2.png)
What about thread synchronization? Do we need new ops to describe barriers and groups of threads that can be synchronized? Could we have a barrier within the op region? Is there an implicit barrier at the end of the region?
There is an implicit barrier at the end of the region but no barriers or synchronization within the region: the tensor values do not escape and partial updates or temporary tensor values are not visible across threads. I do not know that such synchronization concepts that are very dependent on side effects and memory translate to tensor-land.