Calling conventions for YMM registers on AVX

Hi,

What is the calling conventions for YMM. According to documents I saw till now, the YMMs are scratch and not saved in callee.
This is also the default behavior of the Intel Compiler.

In X86InstrControl.td the YMMs are not in "defs" set of call.

- Elena

Hi,

What is the calling conventions for YMM. According to documents I saw till now, the YMMs are scratch and not saved in callee.
This is also the default behavior of the Intel Compiler.

x86_64 Non-windows targets use the rules defined in the x86_64 abi!

In X86InstrControl.td the YMMs are not in "defs" set of call.

The XMMs are subregisters of YMMs, and they are in the list, that
should be sufficient for clobbering the YMM ones.

I'll explain what we see in the code.
1. The caller saves XMM registers across the call if needed (according to DEFS definition).
YMMs are not in the set, so caller does not take care.

2. The callee preserves XMMs but works with YMMs and clobbering them.
3. So after the call, the upper part of YMM is gone.

- Elena

I'll explain what we see in the code.
1. The caller saves XMM registers across the call if needed (according to DEFS definition).
YMMs are not in the set, so caller does not take care.

This is not how the register allocator works. It saves the registers holding values, it doesn't care which alias is clobbered.

Are you saying that only the xmm part of a ymm register gets spilled before a call?

2. The callee preserves XMMs but works with YMMs and clobbering them.
3. So after the call, the upper part of YMM is gone.

Are you on Windows? As Bruno said, all xmm and ymm registers are call-clobbered on non-Windows platforms.

/jakob

This thread has lots of interesting information: http://software.intel.com/en-us/forums/showthread.php?t=59291

I wasn't able to find a formal Win64 ABI spec, but according to http://www.agner.org/optimize/calling_conventions.pdf, xmm6-xmm15 are callee-saved on win64, but the high bits in ymm6-ymm15 are not.

That's not currently correctly modelled in LLVM. To fix it, create a pseudo-register YMMHI_CLOBBER that aliases ymm6-ymm15. Then add YMMHI_CLOBBER to the registers clobbered by WINCALL64*.

/jakob

We support Win64, that's right.
We defined the upper part of YMM like this

  // XMM Registers, used by the various SSE instruction set extensions.
  // Theses are actually only needed for implementing the Win64 CC with AVX.
  def XMM0b: Register<"xmm0b">, DwarfRegNum<[17, 21, 21]>;
  def XMM1b: Register<"xmm1b">, DwarfRegNum<[18, 22, 22]>;
  def XMM2b: Register<"xmm2b">, DwarfRegNum<[19, 23, 23]>;
  def XMM3b: Register<"xmm3b">, DwarfRegNum<[20, 24, 24]>;
  def XMM4b: Register<"xmm4b">, DwarfRegNum<[21, 25, 25]>;
  def XMM5b: Register<"xmm5b">, DwarfRegNum<[22, 26, 26]>;
  def XMM6b: Register<"xmm6b">, DwarfRegNum<[23, 27, 27]>;
  def XMM7b: Register<"xmm7b">, DwarfRegNum<[24, 28, 28]>;

  // X86-64 only
  def XMM8b: Register<"xmm8b">, DwarfRegNum<[25, -2, -2]>;
  def XMM9b: Register<"xmm9b">, DwarfRegNum<[26, -2, -2]>;
  def XMM10b: Register<"xmm10b">, DwarfRegNum<[27, -2, -2]>;
  def XMM11b: Register<"xmm11b">, DwarfRegNum<[28, -2, -2]>;
  def XMM12b: Register<"xmm12b">, DwarfRegNum<[29, -2, -2]>;
  def XMM13b: Register<"xmm13b">, DwarfRegNum<[30, -2, -2]>;
  def XMM14b: Register<"xmm14b">, DwarfRegNum<[31, -2, -2]>;
  def XMM15b: Register<"xmm15b">, DwarfRegNum<[32, -2, -2]>;

  // YMM Registers, used by AVX instructions
  let SubRegIndices = [sub_xmm, sub_xmmb] in {
  def YMM0: RegisterWithSubRegs<"ymm0", [XMM0, XMM0b]>, DwarfRegNum<[17, 21, 21]>;
  def YMM1: RegisterWithSubRegs<"ymm1", [XMM1, XMM1b]>, DwarfRegNum<[18, 22, 22]>;
  def YMM2: RegisterWithSubRegs<"ymm2", [XMM2, XMM2b]>, DwarfRegNum<[19, 23, 23]>;
  def YMM3: RegisterWithSubRegs<"ymm3", [XMM3, XMM3b]>, DwarfRegNum<[20, 24, 24]>;
  def YMM4: RegisterWithSubRegs<"ymm4", [XMM4, XMM4b]>, DwarfRegNum<[21, 25, 25]>;
  def YMM5: RegisterWithSubRegs<"ymm5", [XMM5, XMM5b]>, DwarfRegNum<[22, 26, 26]>;
  def YMM6: RegisterWithSubRegs<"ymm6", [XMM6, XMM6b]>, DwarfRegNum<[23, 27, 27]>;
  def YMM7: RegisterWithSubRegs<"ymm7", [XMM7, XMM7b]>, DwarfRegNum<[24, 28, 28]>;
  def YMM8: RegisterWithSubRegs<"ymm8", [XMM8, XMM8b]>, DwarfRegNum<[25, -2, -2]>;
  def YMM9: RegisterWithSubRegs<"ymm9", [XMM9, XMM9b]>, DwarfRegNum<[26, -2, -2]>;
  def YMM10: RegisterWithSubRegs<"ymm10", [XMM10, XMM10b]>, DwarfRegNum<[27, -2, -2]>;
  def YMM11: RegisterWithSubRegs<"ymm11", [XMM11, XMM11b]>, DwarfRegNum<[28, -2, -2]>;
  def YMM12: RegisterWithSubRegs<"ymm12", [XMM12, XMM12b]>, DwarfRegNum<[29, -2, -2]>;
  def YMM13: RegisterWithSubRegs<"ymm13", [XMM13, XMM13b]>, DwarfRegNum<[30, -2, -2]>;
  def YMM14: RegisterWithSubRegs<"ymm14", [XMM14, XMM14b]>, DwarfRegNum<[31, -2, -2]>;
  def YMM15: RegisterWithSubRegs<"ymm15", [XMM15, XMM15b]>, DwarfRegNum<[32, -2, -2]>;
  }

- Elena

This is the wrong code:

declare <16 x float> @foo(<16 x float>)

define <16 x float> @test(<16 x float> %x, <16 x float> %y) nounwind {
entry:
  %x1 = fadd <16 x float> %x, %y
  %call = call <16 x float> @foo(<16 x float> %x1) nounwind
  %y1 = fsub <16 x float> %call, %y
  ret <16 x float> %y1
}
./llc -mattr=+avx -mtriple=x86_64-win32 < test.ll
        .def test;
        .scl 2;
        .type 32;
        .endef
        .text
        .globl test
        .align 16, 0x90
test: # @test
# BB#0: # %entry
        pushq %rbp
        movq %rsp, %rbp
        subq $64, %rsp
        vmovaps %xmm7, -32(%rbp) # 16-byte Spill
        vmovaps %xmm6, -16(%rbp) # 16-byte Spill
        vmovaps %ymm3, %ymm6
        vmovaps %ymm2, %ymm7
        vaddps %ymm7, %ymm0, %ymm0
        vaddps %ymm6, %ymm1, %ymm1
        callq foo
        vsubps %ymm7, %ymm0, %ymm0
        vsubps %ymm6, %ymm1, %ymm1
        vmovaps -16(%rbp), %xmm6 # 16-byte Reload
        vmovaps -32(%rbp), %xmm7 # 16-byte Reload
        addq $64, %rsp
        popq %rbp
        ret

ymm6,ymm7 are not saved across the call.

I have a fix, can send it to review.

- Elena

This is the wrong code:

declare <16 x float> @foo(<16 x float>)

define <16 x float> @test(<16 x float> %x, <16 x float> %y) nounwind {
entry:
%x1 = fadd <16 x float> %x, %y
%call = call <16 x float> @foo(<16 x float> %x1) nounwind
%y1 = fsub <16 x float> %call, %y
ret <16 x float> %y1
}

Thanks.

./llc -mattr=+avx -mtriple=x86_64-win32 < test.ll
test: # @test
# BB#0: # %entry
       pushq %rbp
       movq %rsp, %rbp
       subq $64, %rsp
       vmovaps %xmm7, -32(%rbp) # 16-byte Spill
       vmovaps %xmm6, -16(%rbp) # 16-byte Spill
       vmovaps %ymm3, %ymm6
       vmovaps %ymm2, %ymm7
       vaddps %ymm7, %ymm0, %ymm0
       vaddps %ymm6, %ymm1, %ymm1
       callq foo
       vsubps %ymm7, %ymm0, %ymm0
       vsubps %ymm6, %ymm1, %ymm1
       vmovaps -16(%rbp), %xmm6 # 16-byte Reload
       vmovaps -32(%rbp), %xmm7 # 16-byte Reload
       addq $64, %rsp
       popq %rbp
       ret

ymm6,ymm7 are not saved across the call.

The xmm spills and reloads are correct, that is prolog and epilog code preserving xmm registers.

However, you are correct that ymm6 and ymm7 can't be used as callee-saved registers.

We support Win64, that's right.
We defined the upper part of YMM like this

// XMM Registers, used by the various SSE instruction set extensions.
// Theses are actually only needed for implementing the Win64 CC with AVX.
def XMM0b: Register<"xmm0b">, DwarfRegNum<[17, 21, 21]>;
def XMM1b: Register<"xmm1b">, DwarfRegNum<[18, 22, 22]>;
def XMM2b: Register<"xmm2b">, DwarfRegNum<[19, 23, 23]>;
def XMM3b: Register<"xmm3b">, DwarfRegNum<[20, 24, 24]>;
def XMM4b: Register<"xmm4b">, DwarfRegNum<[21, 25, 25]>;
def XMM5b: Register<"xmm5b">, DwarfRegNum<[22, 26, 26]>;
def XMM6b: Register<"xmm6b">, DwarfRegNum<[23, 27, 27]>;
def XMM7b: Register<"xmm7b">, DwarfRegNum<[24, 28, 28]>;

// X86-64 only
def XMM8b: Register<"xmm8b">, DwarfRegNum<[25, -2, -2]>;
def XMM9b: Register<"xmm9b">, DwarfRegNum<[26, -2, -2]>;
def XMM10b: Register<"xmm10b">, DwarfRegNum<[27, -2, -2]>;
def XMM11b: Register<"xmm11b">, DwarfRegNum<[28, -2, -2]>;
def XMM12b: Register<"xmm12b">, DwarfRegNum<[29, -2, -2]>;
def XMM13b: Register<"xmm13b">, DwarfRegNum<[30, -2, -2]>;
def XMM14b: Register<"xmm14b">, DwarfRegNum<[31, -2, -2]>;
def XMM15b: Register<"xmm15b">, DwarfRegNum<[32, -2, -2]>;

There is no need to define all these fake registers. One is enough:

def YMM_UPPER : Register<"ymmupper"> {
  let Aliases = [ YMM0, YMM1, ..., YMM15 ];
};

It doesn't need to be a sub-register either. Aliasing is good enough.

/jakob