Vector ABI and min-legal-vector-width

Hi everyone,

I’ve got a question about how the (confusingly named) “min-legal-vector-width” attribute mixes with the vector ABI.

I’m not sure if this is specific to the x86 backend or if it’s a common infrastructure issue. I know the problem I’m about to describe occurs with the x86 backend, but the general design applies to other backends as well. I just don’t know how they handle it.

The problem is, if some IR pass transforms a series of scalar calls into a single vector call (using the vector variants attribute, for example) and the vector width chosen for the call doesn’t match the preferred vector width for the target architecture, the backend will split the vector to match the preferred vector width for the function’s subtarget unless the “min-legal-vector-width” is set to the width of the vector argument. For example:

define dso_local void @foo(i32* nocapture %a) local_unnamed_addr #0 {
entry:
  %vec_a = bitcast i32* %a to <16 x i32>*
  %wide.load = load <16 x i32>, <16 x i32>* %vec_a, align 4
  %foo = call <16 x i32> @_ZGVeN16v__Z3fooi(<16 x i32> %wide.load)
  ret void
}
attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="skylake-avx512" }

Becomes

foo:                                    # @foo
        push    rax
        vmovups ymm0, ymmword ptr [rdi]
        vmovups ymm1, ymmword ptr [rdi + 32]
        call    _ZGVeN16v__Z3fooi
        pop     rax
        vzeroupper
        ret

The big problem with this is that the call created does not match the expected vector ABI for any subtarget. If the subtarget didn’t support 512-bit vectors, the 512-bit vector argument would be passed in memory, not in two 256-bit registers. Since the target does support 512-bit vectors, the ABI says the 512-bit argument should be passed in a 512-bit register, regardless of the preferred vector width for the compilation unit.

If I compile a function that’s defined with a 512-bit vector argument in my source code, the front end will set “min-legal-vector-width” to 512 because the front end knows about the ABI requirements and knows how to manipulate things to get the right ABI. However, the backend and the optimizer have no such knowledge of the ABI and so it aren’t able to handle this situation.

Note that if I set the “min-legal-vector-width”=“512” for the call, it works correctly if the subtarget has 512-bit registers, but if the subtarget does not have 512-bit registers, the backend will again generate a call (incorrectly) passing the 512-bit vector argument in two 256-bit registers.

See: Compiler Explorer

It seems to me that the current design wherein the backend will split arguments to make them “legal” in accordance with the “min-legal-vector-width” attribute (or the known preferred vector width for the subtarget) is broken. Also, splitting arguments because the subtarget really doesn’t support the requested vector size is wrong (in at least some cases).

I see two options:

  1. The backend can report an error if it encounters a vector argument or return value that is larger than the largest legal size for the target. (Something like this happens with scalar floating point arguments in some cases.)

  2. Something in the optimizer or backend needs knowledge of ABI constraints so that it can fixup calls generated in the optimizer

I’d like to see option 2 as the long term solution, but I guess option 1 is much easier to implement.

Thoughts?

Do you mean the fp80? Unfortunately we are mainly relying on the front end check. Backend doesn’t check for it till now. Compiler Explorer

I think the intent of “min-legal” is requiring backend to be able to provide physical registers no less than the required size. A 512 bit requirment conflicts with the non AVX512 target. So it is a misuse/UB rather than broken.

I think the design of arguments handling in LLVM assumes all arguments are passing in registers (memory pointers are treated as integer). The splitting is by designed, which calls getNumRegistersForCallingConv and getRegisterTypeForCallingConv for any type that can be passed by one or more registers. FWIW, there’s no such an interface for turning illegel type into memory.
OTOH, passing illegel type by memory is specified by C/C++ ABI. ABI from other FE may have different declarations and may have already relied on such behavior. So I don’t think we can change the handling here just for one ABI.

No, I mean using float return types when SSE is disabled. Compiler Explorer

If that’s what “min-legal-vector-width” means (I couldn’t find it documented anywhere), then I can see that setting “min-legal-vector-width”=“512” when the subtarget doesn’t support 512-bit registers would be an error in the IR. In that case, an error should be reported somewhere.

However, that doesn’t change the fact that the optimizer doesn’t currently have the ability to transform the IR to handle the vector ABI correctly.

I’m not interested in whether this is happening by design or not. My point is, the result is wrong. Therefore, if it is by design, the design is also wrong.

We do need to know something about the ABI that’s intended. That’s true. However, we can’t just do something arbitrary (which creates a new de facto ABI) when the types don’t fit our expectations. In this case, when the call is created we know that the call is supposed to follow the vector ABI, but the optimizer doesn’t currently have the facilities to pick the arguments to match that ABI. In our current design, that knowledge is all in the front end.

Sorry, I made a mistake here. The target independent code splits illegel type into legal types by virtual registers rather than physical ones. It’s backend’s calling conversion that determines they are on physical registers or stack. It means we have chance in backend that turns illegel 512bit vectors into stack for AVX2 target.

However, there are at least 3 big obstacles that stoping us.

  1. The ABI declaration has already much complicated. Currently, we declare ABIs in backend by [32,64] x [Win,Linux,Darwin,…] x [Common,VectorCall,Fast,…]. To make the illegel vectors work, we have to mul [Scalar,V128,V256,V512]. I think the X86CallingConv.td may need to re-design rather than directly expand 4 times of the current ABI.
  2. Even we decide to use different vector size ABI, it still doesn’t work under the current mechanism. We have to workaround the illegel vector splitting in the target independent code. This may affect other targets. We have to use complex combinations in TLI hooks for each of them.
    For example, when we choose a X64_Linux_V256_Common ABI for AVX2 X64 target, we have 8 vector registers for passing 128 or 256 bits vectors. However, if the target independent code splits the 512 bits vector into 2 256 bits, they still will be passed by registers.
  3. Another problem is the returning arguments. Clang turns illegel vector returning into byval, which becomes the first argument in the function. And backend handles returning and passing argument in different phases. I’m not sure if we can do the same thing as FE does here.

I think one of the reasons is we need to handle arbitrary type, e.g., i65, <17 x float>, <9 x i1> etc. I agree we have a de facto LLVM ABI, which is relied on the backend implementation. But we should admit the widening + splitting is the concise way to handle all of them.
OTOH, backend doesn’t expect these types come from any known ABIs. In a word, we shouldn’t have expectations to the codegen of a 512bit IR type on AVX2.

What do you mean by “we shouldn’t have expectations”? Do you mean we shouldn’t expect it to work? Or do you mean that we should expect the backend to be free to do anything it wants? If we shouldn’t expect it to work, the backend should report an error when it occurs. I would not agree that the backend is free to do anything it wants, but that is the current behavior.

I think this means that while the backend can handle arbitrary vector sizes in most instructions and transform them in any way that maintains the semantics, we don’t have freedom to change the signature of a function call or function definition. So if we see a type that isn’t legal as an argument or return value, shouldn’t that be an error?

Let me clarify that I’m not particularly interested in the case where a 512-bit vector is used as an argument with an AVX2 target. That was just an extra example to demonstrate the extent of the current problems. The case I’m most interested in is the case where a 512-bit vector is used as an argument with an AVX512 target, but the “min-legal-vector-width” isn’t set to 512. In this case, the 512-bit vector type isn’t actually illegal for the target, but we pretend it is because of the way the “prefer-vector-width” handling was implemented. In this case, there is no justification for changing the argument or return value type.

@efriedma-quic brought up the ABI issue in this patch: ⚙ D41096 [X86] Initial support for prefer-vector-width function attribute If I understand the comments from @topperc there correctly, he was suggesting that there should be some pass run before the vectorizer that sets the “min-legal-vector-width” to 512 if any code required vectors that wide. The problem with that is that it changes the preferred vector width everywhere in the function, which is not really desirable.