Limitations of the vector dialects

Hello,

I’m trying the limits of the vector dialects, and I discover that elementary arithmetic operations are missing, such as add, xor, etc. Is this to be expected?

Also, what is the complexity of adding such an operation to the dialect? I guess that just adding it is not too complicated, but adding the lowering routines (e.g. towards AVX or NEON) is.

Furthermore, it appears that the specialized dialects (e.g. x86vector, arm_neon) only contain operations that are not already part of the general-purpose vector dialect. Which seems to mean that lowering from vector goes directly to LLVM. Is this the case? If yes, where can I find documentation on what vector extensions are implemented in LLVM?

If I decide to contribute on these topics (with a few operations on one or two ISAs) is it acceptable to perform lowering only for the ISAs I’m interested in? Here is a list of AVX opcodes I’m interested in and which do not seem to be covered (if I’m wrong, please point me to the good operations):

  • May probably go to vector: vpxor, vaddps
  • Should probably go to x86vector:
    • 256-bit: vmovsldup, vmovshdup, vunpcklps, vunpckhps, vunpcklpd, vunpckhpd
    • 128-bit: punpckldq, punpckhdq, punpcklqdq, punpckhqdq.

Another issue: I don’t really understand how the MLIR LLVM dialect works as an interface between the vector dialects of MLIR and LLVM. For instance, it’s not clear to me where the architecture-specific ops of x86vector are mapped into the LLVM dialect…

Best regards,
Dumitru

Have you read any of the previously posted Case Studies Docs on Vector Dialect CPU Codegen?

Arithmetic operations are supported by using vector types on the operators of the standard dialect.
For example

%a = addf %m, %v2 : vector<16xf32>
store %a, %arg2[] : memref<vector<16xf32>>

lowers to something like this.

vaddps zmm2, zmm2, zmmword ptr [rax + 128]
vmovaps zmmword ptr [rax + 128], zmm2

Thanks @aartbik! I had read them, but had a lapsus.

However, this makes for only two of the operations I need (addf and xor).

In the past weeks I spent some time trying to understand where BLAS is getting its performance from. I looked into the matrix multiplication. And I discovered that the packing and unpacking vector operations mentioned above are crucial in getting performance… Hence my question.

Please note that even here our first approach should be to try to keep the vector dialect as architectural-neutral as possible (for example, by passing generic data rearranging intrinsics to the backend and relying on LLVM to eventually pick the best shuffle/unpack sequence for a particular target platform).

We could, of course, provide an almost 1:1 correspondence between target-specific vector dialect operations and SIMD instructions (as we did in a few restricted cases), but we have to be careful that we don’t turn MLIR into a glorified assembler for targets like Neon or AVX512! Rather, we would like to find out about situations where MLIR lowering + LLVM backend support does not result in the right SIMD instructions, and learn from such mistakes to improve the overall infrastructure in a generic way.

1 Like

How can I specify “generic data rearranging intrinsics”? In matrix multiplication, there are a few I have discovered during the past weeks which I consider particularly difficult (understanding a truly efficient matrix multiplication was really a sobering experience).

The first one rearranges 4x2xf32 tiles of the input matrix into 1x8xf32 tiles of the output matrix using this pattern:
ae → abcdefgh
bf
cg
dh
How would you encode this and pass it to LLVM so that it tiles a full matrix with?

(addendum: Ok, this one seems to be feasible by some tweaking involving vector.transpose and some conversions from vector<2x4xf32> to vector<8xf32>. I will try it later, to check whether it works.)

The second performs a dot product by loading (with duplication) lower and upper halves of a vector on separate YMM registers which are then individually multiplied with other vectors containing other duplicated values. I can see how this improves data locality and reduce the sheer number of operations. I don’t see how a back-end, by itself, can automatically come up with this sort of code.

(addendum2: I will look into the code generated using vector.contract, but the problem is here loading the data, not processing it).

Should you be willing, I can show you snippets of code already converted from assembly to intrinsics, which only await for conversion to MLIR.

Of course, the final game is to not be stuck at 30% or 40% of peak performance (as a naïve vectorization does) but go all the way to 95-100% performance of BLAS.

My assumption was that MLIR can provide the general vector constructs (which are always quite simple), and then let back-ends implement one or another, and smart lowering engines choose them, possibly automatically
(but you have to be able to express the palette of transformations).

Of course! I am very glad you already found some architectural-neutral ops that bring you closer to a solution, but you may be right that peak performance can only be obtained by introducing a few architectural-specific ops (and please feel free to do so, we have done that ourselves several times when we thought it was necessary). I would like to see your examples and learn from them!