[RFC] Add NV-GPU dialect (HW specific extension of GPU dialect for Nvidia GPUs)

Can you explain this aspect a little more? As it is proposed, I see that we only have an operation to load a matrix but no way to compute on these. The new formulation is incompatible with the gpu dialect ops, as it uses a different type.

So how would you model the computation itself? By exposing the result of the load as a vector, the IR creates the impression that you can actually access it like any regular vector. A similar approach is taken in the AMX dialect with its tiles, so maybe we should not treat this as related to the gpu dialect with its aim to abstract over hardware and instead make it part of the vector dialect family.

I do not see how this is exposed in the IR. How is this conceptually different to optimizing memory layout for your cache hierarchy in tiling, even though the ops do not expose the specifics of the hardware you are targeting.
Is the goal to use a specific operation so that it is visible which target the IR is being compiled for by choice of operation?