Can you explain this aspect a little more? As it is proposed, I see that we only have an operation to load a matrix but no way to compute on these. The new formulation is incompatible with the gpu dialect ops, as it uses a different type.
So how would you model the computation itself? By exposing the result of the load as a vector, the IR creates the impression that you can actually access it like any regular vector. A similar approach is taken in the AMX dialect with its tiles, so maybe we should not treat this as related to the gpu dialect with its aim to abstract over hardware and instead make it part of the vector dialect family.
My understanding was that we were okay with the
mmaops because they didn’t require exposing details on the implementation. Therefore we can imagine that they could be use to abstract AMD matrix core operations or other vendors. The representation is also compatible with SPIR-V cooperative matrix.Here it feels like crossing a line as we need to expose how the hardware is expected to map the data on the warp lanes so it is unlikely that any other case.
I do not see how this is exposed in the IR. How is this conceptually different to optimizing memory layout for your cache hierarchy in tiling, even though the ops do not expose the specifics of the hardware you are targeting.
Is the goal to use a specific operation so that it is visible which target the IR is being compiled for by choice of operation?