Hi all,
I am not sure this is known or easily sorted. I was trying to reproduce the tiling strategy implemented by a famous Arm library, ACL. They use a 8x12 inner tile (see the kernel implementation here.
I was trying to reproduce (through GitHub - iree-org/iree-llvm-sandbox: A sandbox for quick iteration and experimentation on projects related to IREE, MLIR, and LLVM) the same implementation. However, digging more and more, I observed heavy spilling in the generated assembly, which deteriorates performance quite significantly.
I was able to come up with a simple example written in Vector dialect:
func @outerproduct(%arg0: vector<8xf32>, %arg1: vector<12xf32>) -> vector<8x12xf32> {
%c0 = constant 0 : index
%c2048 = constant 2048 : index
%c2 = constant 2 : index
%2 = vector.outerproduct %arg0, %arg1 : vector<8xf32>, vector<12xf32>
%3 = scf.for %arg5 = %c0 to %c2048 step %c2 iter_args(%arg6=%2) -> vector<8x12xf32>{
%4 = vector.outerproduct %arg0, %arg1, %arg6 : vector<8xf32>, vector<12xf32>
scf.yield %4 : vector<8x12xf32>
}
return %3 : vector<8x12xf32>
}
If I compile this with
mlir-opt test_outer_product.mlir --convert-vector-to-scf --lower-affine --convert-scf-to-std --convert-vector-to-llvm --convert-std-to-llvm --reconcile-unrealized-casts| mlir-translate --mlir-to-llvmir | ~/llvm-project/build/bin/llc -O3 -march=aarch64 -mcpu=neoverse-n1 -filetype=asm -o outer.s
I can observe quite a lot of spilling going on. I will post a small extract here:
ldr s7, [sp, #1200]
mov v6.s[2], v7.s[0]
mov v16.s[2], v7.s[0]
mov v21.s[2], v4.s[0]
ldr s7, [sp, #1232]
mov v12.s[2], v7.s[0]
mov v3.s[2], v7.s[0]
ldr s7, [sp, #1208]
mov v6.s[3], v7.s[0]
mov v16.s[3], v7.s[0]
mov v21.s[3], v5.s[0]
ldr s7, [sp, #1240]
mov v12.s[3], v7.s[0]
mov v3.s[3], v7.s[0]
.Ltmp0:
.loc 1 13 11 prologue_end // <stdin>:13:11
fmul v11.4s, v21.4s, v0.s[0]
fmul v9.4s, v16.4s, v0.s[0]
.loc 1 21 11 // <stdin>:21:11
fmul v8.4s, v21.4s, v0.s[1]
fmul v31.4s, v16.4s, v0.s[1]
.loc 1 29 11 // <stdin>:29:11
fmul v29.4s, v21.4s, v0.s[2]
fmul v28.4s, v16.4s, v0.s[2]
.loc 1 37 11 // <stdin>:37:11
fmul v26.4s, v21.4s, v0.s[3]
fmul v25.4s, v16.4s, v0.s[3]
.loc 1 45 11 // <stdin>:45:11
fmul v23.4s, v21.4s, v1.s[0]
fmul v22.4s, v16.4s, v1.s[0]
.loc 1 53 11 // <stdin>:53:11
fmul v20.4s, v21.4s, v1.s[1]
fmul v19.4s, v16.4s, v1.s[1]
.loc 1 61 11 // <stdin>:61:11
fmul v18.4s, v21.4s, v1.s[2]
fmul v17.4s, v16.4s, v1.s[2]
This is the same behaviour I see on my more complex Matrix Multiply examples.
My questions are:
a) Is this a problem also for x86 machines?
b) Is this MLIR’s fault or LLVM’s fault?
c) Do you think this can be fixed (not necessarily easily)?
Thank you for any kind of support!
Additional info
I spent my day playing with this, and I got more info about this:
a) PowerOfTwo outer loops don’t spill: 8x4, 8x8 work fine
b) NonPowerOfTwo outer loops always spill: 8x5, 8x6, 8x12 always spill
c) If I remove the loop in MLIR, and I only leave two outer-products, it does not spill
d) I debugged the llvm pipeline, and it seems that the pass aarch64-local-dynamic-tls-cleanup introduces the spilling. In partcular, for 8x5xf32 outer product we have:
liveins: $x8, $q0, $q1, $s2, $s3, $s4, $s5, $s6
%129:fpr32 = COPY $s6
%128:fpr32 = COPY $s5
%127:fpr32 = COPY $s4
%126:fpr32 = COPY $s3
%125:fpr32 = COPY $s2
%124:fpr128 = COPY $q1
%123:fpr128 = COPY $q0
%122:gpr64 = COPY $x
while for 8x8f32 we have:
liveins: $x8, $q0, $q1, $q2, $q3
%54:fpr128 = COPY $q3
%53:fpr128 = COPY $q2
%52:fpr128 = COPY $q1
%51:fpr128 = COPY $q0
So it looks like the 5xf32 vector is using 5 single-register, while the 8xf32 vector is using 2 quad-registers.
I will dig more on this topic, but the root cause is likely to be:
a) LLVM vectorizer usually produces “normal” sizes vector, mostly power-of-two. It would never produce a 5xf32 vector
b) With MLIR those vectors seem likely, because we want specific tiling strategies
Still preliminary, but I see two different solutions:
a) Make the LLVM backend support weird vector lengths (5x, 6x, 12x)
b) Legalize the MLIR backend to use only PowerOfTwo vectors
Maybe you guys are already aware of some legalizations? I wasn’t able to find much. If not, what would be the best approach?
Thanks again!