[RFC] Parallel Abstraction For Tensors and Buffers

The op is aa abstraction to enable transformations on tensors with an explicit parallel context.
Implication on transformations relate to crossing the sequential / parallel boundary; for instance, hoisting a temporary buffer allocated within the op out of the parallel context requires allocating a copy per thread.
There is no strictly required parallelism and the abstraction does not prevent one from shooting oneself in the foot. Lowering the wait/signal example to a sequential loop will deadlock.

This proposal does not address reductions or general representations of parallelism. We have multiple lower-level representations, all working on buffers (async, omp, gpu dialects). The proposal is directed at bridging the abstraction gap that prevents us from representing parallelism + subset + tensors, as was discussed previously.

Regarding parallel reductions, we had discussed offline with @dcaballe and @ftynse adding a reduction operation that does not depend on an enclosing op; this proposal does not touch on this.

The terminator extensibility mention is related to the fact that these can also work with other types that will require other op spellings to combine in parallel (e.g. sparse), not to different combining semantics.

This goes back to the discussion on full and partial tensors: the operation needs to yield full tensors and compute partial tensors internally. parallel_insert_slice collectively produce a full tensor but none of the instances returns a full tensor to the local thread: threads never have access to a partially updated full tensor. I am not sure what else than a terminator is appropriate to encode such behavior, do you have a suggestion as to some alternative way of spelling this?

I am not sure how that would work with e.g. surrounding control flow, encapsulating in a terminator with strict rules seems safer to me.

Similarly to the way it relates to async and gpu dialects, it can lower to them.

This is prototyped in IREE, here is a test (it is called linalg_ext.in_parallel there) and has been connected to e2e parallel execution in the sandbox.

It could go there or in other places that other suggested, no strong opinion on the location on my end.
A parallel dialect is interesting but it seems a significantly larger endeavor than the operation in this RFC.

In the prototype impl, the op is call in_parallel, I am not hung up on the name and am happy to change it.

Not at this point: this seems significantly larger scope than what is proposed here and I am not sure how to spell these in tensor-land. I have found that there is a very fine line on can tread to build abstractions that connect end-to-end from tensors to LLVM.

It is still a retargetable abstraction so it still sits on top of async and similar dialects. The thread “creation/forking execution” are abstracted away until lowered into a particular parallelism implementation dialect.

There is an implicit barrier at the end of the region but no barriers or synchronization within the region: the tensor values do not escape and partial updates or temporary tensor values are not visible across threads. I do not know that such synchronization concepts that are very dependent on side effects and memory translate to tensor-land.