I try to follow the very nice tutorial on Vector Dialect AArch64 Codegen ForMatrix-Matrix Multiplication. However, I’m not generating code for AArch64, but for my x86_64 PC (under MacOS).
When compiling the first example (in page 2) I noticed that the generated code contains no fused multiply-add instruction. I thought x86_64 has such instructions for vectorized code.
The compilation command I’m using is:
mlir-opt --convert-vector-to-llvm='reassociate-fp-reductions=1' \
--convert-std-to-llvm sample.mlir | \
mlir-translate --mlir-to-llvmir | \
llc -o sample.s
Is this normal?
PS: would it be possible to also provide the code that calls function
@vector_outerproduct_matmul_2d_4x4x4xf32_kernel to perform a 32x32x32 or 512x512x32 matrix multiplication? I’d like to see what the correct way of loading the vector registers is.
PS2: I noticed (in the last table) the performance loss when moving the kernel size from 64 to 512. Does this mean that the outer loops are not well optimized?
--fp-contract=on with llc - FMA instructions won’t be used by default. (You’ll find more information at https://llvm.org/docs/LangRef.html#floating-point-environment or llc --help.)
--fp-contract=on to the
llc call, still no change. The beginning of the generated code looks like this:
movq 16(%rsp), %rax
movaps (%rsi), %xmm4
movaps 16(%rsi), %xmm5
movaps 32(%rsi), %xmm6
movaps 48(%rsi), %xmm11
movaps (%r8), %xmm7
movaps 16(%r8), %xmm10
movaps 32(%r8), %xmm9
movaps 48(%r8), %xmm8
movaps %xmm4, %xmm0
shufps $0, %xmm4, %xmm0 ## xmm0 = xmm0[0,0],xmm4[0,0]
mulps %xmm7, %xmm0
addps (%rax), %xmm0
movaps %xmm5, %xmm1
shufps $0, %xmm5, %xmm1 ## xmm1 = xmm1[0,0],xmm5[0,0]
mulps %xmm7, %xmm1
addps 16(%rax), %xmm1
There are 16
mulps/addps pairs in the whole code, I didn’t copy all of them here.
I’m on the
71b823dd68f67d9594d83f8b33c46f7a60d1b305 commit of
llvm-project, from March 22.
PS: Is it correct to assume that the code generator uses SSE instead of AVX because the input vector size is 4x4? If the input vectors were 8xf32, would/should it have used the ymm registers?
It’s probably because of the target. (You are using llc with the default options). Please be aware of the
cpu it’s generating instructions for.
--march=x86-64 -mcpu=core-avx2 (even without
--fp-contract=on) it issues
fmadd instructions. However, it still remains on XMM registers (SSE). What can I do to move to YMM (AVX)?
I think you should really post this on the LLVM forum - this already had really nothing to do with MLIR but with LLVM code generation!