[RFC] Parallel Abstraction For Tensors and Buffers

You’re right that the design of this RFC concerns a simple 1-1 mapping of scf.foreach_thread without cyclic/block-cyclic concerns and focuses on getting the tensor + parallelism semantics to work e2e without abstraction gaps.

On my end I should be able to start experimenting with these questions in ~2 weeks.

The way I was thinking evolving this is by folding the “virtual” thread indices created by scf.foreach_thread onto fewer physical threads when lowering the op away. The folding function could then be quite general but this would probably need to happen after bufferization.

Having something with more advanced distribution semantics like you propose also makes sense to me. The distribution would be set, which has pros and cons. The abstraction would also work on tensors + actual thread ids. This then become a “required parallelism” construct and seems to come with further tradeoffs.

I’m happy to iterate to a form that works well, informed by having all pieces connecting.