Possible bug in x86 frame lowering with SSE instructions?

Hello, everyone.

I'm looking for some insight into a bug I encountered while testing
some custom IR passes on Solaris (x86) and Linux. I don't know if it's
a bug with the x86 backend or the way the frame is set up by Solaris
-- or if I'm simply doing something I shouldn't be doing. The bug
manifests even if I don't run any of my passes, so I'm certain those
aren't the issue.

Given the following test C code:

    int main(int argc, char **argv) {
      int x[10] = {1,2,3};
      return 0;
    }

I compile it to IR with the following arguments:

  clang --target=i386-sun-solaris -S -emit-llvm -Xclang
-disable-O0-optnone -x c -c array-test.c -o array-test.ll

This yields the following IR:

    target datalayout =
"e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128"
    target triple = "i386-sun-solaris"

    ; Function Attrs: noinline nounwind
    define dso_local i32 @main(i32 %0, i8** %1) #0 {
      %3 = alloca i32, align 4
      %4 = alloca i32, align 4
      %5 = alloca i8**, align 4
      %6 = alloca [10 x i32], align 4
      store i32 0, i32* %3, align 4
      store i32 %0, i32* %4, align 4
      store i8** %1, i8*** %5, align 4
      %7 = bitcast [10 x i32]* %6 to i8*
      call void @llvm.memset.p0i8.i32(i8* align 4 %7, i8 0, i32 40, i1 false)
      %8 = bitcast i8* %7 to [10 x i32]*
      %9 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 0
      store i32 1, i32* %9, align 4
      %10 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 1
      store i32 2, i32* %10, align 4
      %11 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 2
      store i32 3, i32* %11, align 4
      ret i32 0
    }

    ; Function Attrs: argmemonly nounwind willreturn writeonly
    declare void @llvm.memset.p0i8.i32(i8* nocapture writeonly, i8,
i32, i1 immarg) #1

    attributes #0 = { noinline nounwind
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-tail-calls"="false" "frame-pointer"="all"
"less-precise-fpmad"="false" "min-legal-vector-width"="0"
"no-infs-fp-math"="false" "no-jump-tables"="false"
"no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false"
"no-trapping-math"="true" "stack-protector-buffer-size"="8"
"target-cpu"="pentium4"
"target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87"
"unsafe-fp-math"="false" "use-soft-float"="false" }
    attributes #1 = { argmemonly nounwind willreturn writeonly }

Normally, I would run custom passes at this point via opt. But the
error I'm getting occurs with or without this step.

Without changing anything else, I run this IR through llc with the
following arguments:

    llc --x86-asm-syntax=intel --filetype=asm array-test.ll -o=array-test.s

This results in the following assembly:

            .text
            .intel_syntax noprefix
            .file "/home/user/code/array-test.ll"
            .globl main # -- Begin function main
            .p2align 4, 0x90
            .type main,@function
    main: # @main
    # %bb.0:
            push ebp
            mov ebp, esp
            sub esp, 56
            mov dword ptr [ebp - 4], 0
            xorps xmm0, xmm0
            movaps xmmword ptr [ebp - 56], xmm0
            movaps xmmword ptr [ebp - 40], xmm0
            mov dword ptr [ebp - 20], 0
            mov dword ptr [ebp - 24], 0
            mov dword ptr [ebp - 56], 1
            mov dword ptr [ebp - 52], 2
            mov dword ptr [ebp - 48], 3
            xor eax, eax
            add esp, 56
            pop ebp
            ret
    .Lfunc_end0:
            .size main, .Lfunc_end0-main
                                            # -- End function
            .ident "clang version 12.0.0
(https://github.com/llvm/llvm-project.git
62dbbcf6d7c67b02fd540a5a1e55c494bf88adea)"
            .section ".note.GNU-stack","",@progbits

Other than target being i386-sun-solaris, this is exact same code
generated in both instances if I target i386-pc-linux-gnu.

If I run this on Linux (Ubuntu 18.04 in this case), there are no
problems. If I run this on Solaris, however, a segfault occurs on the
first `movaps` instruction. I believe the issue is because the stack
is 4-byte aligned on Solaris whereas it's 8-bit aligned on Linux, so
the 56- and 40-byte offsets for the array stores just happen to work
on Linux -- while they end up being 8 bytes off on Solaris.

Running llc with --stackrealign fixes the problem:

    main: # @main
    # %bb.0:
            push ebp
            mov ebp, esp
            and esp, -16
            sub esp, 64
            mov dword ptr [esp + 12], 0
            xorps xmm0, xmm0
            movaps xmmword ptr [esp + 16], xmm0
            movaps xmmword ptr [esp + 32], xmm0
            mov dword ptr [esp + 52], 0
            mov dword ptr [esp + 48], 0
            mov dword ptr [esp + 16], 1
            mov dword ptr [esp + 20], 2
            mov dword ptr [esp + 24], 3
            xor eax, eax
            mov esp, ebp
            pop ebp
            ret

Running clang with -fomit-frame-pointer also fixes the problem, but I
have no idea why. Adding --stack-alignment=16 does *not* fix the
problem. If I explicitly add the -O0 flag to llc, the
`X86TargetLowering::getOptimalMemOpType()` function doesn't lower the
array stores to `movaps`:

    main: # @main
    # %bb.0:
            push ebp
            mov ebp, esp
            push esi
            sub esp, 68
            mov eax, dword ptr [ebp + 12]
            mov ecx, dword ptr [ebp + 8]
            xor edx, edx
            mov dword ptr [ebp - 8], 0
            lea esi, [ebp - 48]
            mov dword ptr [esp], esi
            mov dword ptr [esp + 4], 0
            mov dword ptr [esp + 8], 40
            mov dword ptr [ebp - 52], eax # 4-byte Spill
            mov dword ptr [ebp - 56], ecx # 4-byte Spill
            mov dword ptr [ebp - 60], edx # 4-byte Spill
            call memset
            mov dword ptr [ebp - 48], 1
            mov dword ptr [ebp - 44], 2
            mov dword ptr [ebp - 40], 3
            mov eax, dword ptr [ebp - 60] # 4-byte Reload
            add esp, 68
            pop esi
            pop ebp
            ret

I've spent the better part of ten hours trying to debug the X86
backend code (and I am, admittedly, not the best at knowing where to
look). I determined the `X86FrameLowering::emitPrologue()` function
will *only* emit the proper offset adjustment if
`X86RegisterInfo::needsStackRealignment()` returns `true`, and the
only thing that seems to force it to return `true` is if
--stackrealign is used (which sets the "stackrealign" function
attribute on `main`).

I don't know if this is truly a bug in the X86 backend (an assumption
about the ABI on Linux vs. Solaris? Maybe? I'm truly guessing...) or
if this is a result of me using -disable-O0-optnone in Clang without
-O0 in llc.

Any insight would be helpful, and thanks for reading my rather verbose message.

Hi Jonathan,

It seems the trunk code solves this problem. https://godbolt.org/z/Y1Wdbj
I took a look at the x86 ABI: https://gitlab.com/x86-psABIs/i386-ABI/-/tree/hjl/x86/1.1#
It says "In other words, the value (%esp + 4) is always a multiple of 16 (32 or 64) when control is transferred to the function entry point."
So if the OS follows the ABI, the ESP's value should always be 0xXXXXXXXC when enters to a function, and it turns to be 0xXXXXXXX8 after "push ebp". Which happens to be aligned to 8.

Thanks
Pengfei

Interesting. Thank you.

I'm still curious to know what commit fixed this problem, although it
sounds like it's also a problem with how Solaris is implementing the
ABI.

I suppose it's time for me to go hunting through commits.

For what it's worth, I found the patch:

https://reviews.llvm.org/D87615