When compiling the first example (in page 2) I noticed that the generated code contains no fused multiply-add instruction. I thought x86_64 has such instructions for vectorized code.
PS: would it be possible to also provide the code that calls function @vector_outerproduct_matmul_2d_4x4x4xf32_kernel to perform a 32x32x32 or 512x512x32 matrix multiplication? I’d like to see what the correct way of loading the vector registers is.
PS2: I noticed (in the last table) the performance loss when moving the kernel size from 64 to 512. Does this mean that the outer loops are not well optimized?
There are 16 mulps/addps pairs in the whole code, I didn’t copy all of them here.
I’m on the 71b823dd68f67d9594d83f8b33c46f7a60d1b305 commit of llvm-project, from March 22.
PS: Is it correct to assume that the code generator uses SSE instead of AVX because the input vector size is 4x4? If the input vectors were 8xf32, would/should it have used the ymm registers?
It’s probably because of the target. (You are using llc with the default options). Please be aware of the arch and cpu it’s generating instructions for.
Thanks! With --march=x86-64 -mcpu=core-avx2 (even without --fp-contract=on) it issues fmadd instructions. However, it still remains on XMM registers (SSE). What can I do to move to YMM (AVX)?