Tiling linalg into MVMs and map each one to different compute unit

Hello,

I am currently trying to tile Linalg operations, specifically matrix multiplication (matmul) and 2D convolution (conv2d). My objective is to tile these operations into many matrix-vector multiplications (MVMs) and then map each resulting MVM to a separate compute unit.

Despite researching existing transformation techniques, I have not found a suitable solution for my use case. Could you provide any insights or recommendations on the best approach to achieve this goal?

Thanks!