Hi everyone,
I’ve got a question about how the (confusingly named) “min-legal-vector-width” attribute mixes with the vector ABI.
I’m not sure if this is specific to the x86 backend or if it’s a common infrastructure issue. I know the problem I’m about to describe occurs with the x86 backend, but the general design applies to other backends as well. I just don’t know how they handle it.
The problem is, if some IR pass transforms a series of scalar calls into a single vector call (using the vector variants attribute, for example) and the vector width chosen for the call doesn’t match the preferred vector width for the target architecture, the backend will split the vector to match the preferred vector width for the function’s subtarget unless the “min-legal-vector-width” is set to the width of the vector argument. For example:
define dso_local void @foo(i32* nocapture %a) local_unnamed_addr #0 {
entry:
%vec_a = bitcast i32* %a to <16 x i32>*
%wide.load = load <16 x i32>, <16 x i32>* %vec_a, align 4
%foo = call <16 x i32> @_ZGVeN16v__Z3fooi(<16 x i32> %wide.load)
ret void
}
attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="skylake-avx512" }
Becomes
foo: # @foo
push rax
vmovups ymm0, ymmword ptr [rdi]
vmovups ymm1, ymmword ptr [rdi + 32]
call _ZGVeN16v__Z3fooi
pop rax
vzeroupper
ret
The big problem with this is that the call created does not match the expected vector ABI for any subtarget. If the subtarget didn’t support 512-bit vectors, the 512-bit vector argument would be passed in memory, not in two 256-bit registers. Since the target does support 512-bit vectors, the ABI says the 512-bit argument should be passed in a 512-bit register, regardless of the preferred vector width for the compilation unit.
If I compile a function that’s defined with a 512-bit vector argument in my source code, the front end will set “min-legal-vector-width” to 512 because the front end knows about the ABI requirements and knows how to manipulate things to get the right ABI. However, the backend and the optimizer have no such knowledge of the ABI and so it aren’t able to handle this situation.
Note that if I set the “min-legal-vector-width”=“512” for the call, it works correctly if the subtarget has 512-bit registers, but if the subtarget does not have 512-bit registers, the backend will again generate a call (incorrectly) passing the 512-bit vector argument in two 256-bit registers.
See: Compiler Explorer
It seems to me that the current design wherein the backend will split arguments to make them “legal” in accordance with the “min-legal-vector-width” attribute (or the known preferred vector width for the subtarget) is broken. Also, splitting arguments because the subtarget really doesn’t support the requested vector size is wrong (in at least some cases).
I see two options:
-
The backend can report an error if it encounters a vector argument or return value that is larger than the largest legal size for the target. (Something like this happens with scalar floating point arguments in some cases.)
-
Something in the optimizer or backend needs knowledge of ABI constraints so that it can fixup calls generated in the optimizer
I’d like to see option 2 as the long term solution, but I guess option 1 is much easier to implement.
Thoughts?