Is synchronization missed for RAW dependent ops during thread distribution inside iree?

@ThomasRaoux Thanks for explanations. It’s very clear.

Could I ask another question? You have achieved a lot on optimizing matmul for cuda. And have you thought of implementing efficient conv2d for cuda inside IREE? (This drives me to learn a bit more of linalg, and then thinking of above question.)

I believe you know, implicit gemm is one efficient way to implement conv in GPU, which seems not trivial for linalg. Once I asked related question and got some explanations/answer on Is it possible to add parameter for indexing_maps of linalg.generic?, roughly mod and div of index mapping is not supported, as hyper-tangular is required for subview /tensor_insert /tensor_extract operations.

As I see several attractive features of IREE, I would like to rethink, whether it is feasible to implement implicit gemm for gpu inside IREE. But have no answer yet: able to extend with supporting non hyper-tangular, or some solution without breaking hyper-tangular?
Thanks.