Problems with custom calling convention on Mac OS X

Hi all,

I'm working on using LLVM as a back-end for the Haskell GHC compiler. As part of that work I have modified LLVM to include a new custom calling convention for performance reasons as outlined previously in a conversation on this mailing list:

http://nondot.org/sabre/LLVMNotes/GlobalRegisterVariables.txt

This custom calling convention on x86-32 needs to handle just 4 parameters, passing them in %ebx, %ebp, %esi, %edi. These are all callee saved registers. To implement the custom calling convention I change llvm in two places:

1) X86CallingConv.td : Add new calling convention.
2) X86RegisterInfo.cpp : Modify 'getCalleeSavedRegs' to remove the above registers from the callee saved registers.

This works fine mostly. On Linux, the generated code passes the GHC testsuite. On Mac however the GHC testsuite fails on any code which uses the ffi (which is implemented by libffi [http://sourceware.org/libffi/\]). Programs which fail segfault with the error '__dyld_misaligned_stack_error'. The issue seems to be from my investigations that the ffi call should be 16-byte aligned as per Mac OS X's ABI.

I'm hoping someone is able to confirm that my changes would have introduced this bug and how to go about fixing it.

Another minor issue is that the generated code has a strong tendency to manipulate the stack pointer when its not required. For a large amount of functions, the generated code will start and finish with sp manipulation to give the function some space despite the function not otherwise using the stack.

e.g

Utils_doStatefulOp1_entry:
   subl $4, %esp
   movl 4(%ebp), %eax
   movl 8(%ebp), %ecx
   movl (%ebp), %esi
   movl %eax, 8(%ebp)
   movl %ecx, 4(%ebp)
   addl $4, %ebp
   addl $4, %esp
   jmp stg_ap_pp_fast

It would be nice to fix this up as well.

Cheers,
David

David Terei wrote:

Hi all,

I'm working on using LLVM as a back-end for the Haskell GHC compiler. As
part of that work I have modified LLVM to include a new custom calling
convention for performance reasons as outlined previously in a
conversation on this mailing list:

http://nondot.org/sabre/LLVMNotes/GlobalRegisterVariables.txt

This custom calling convention on x86-32 needs to handle just 4
parameters, passing them in %ebx, %ebp, %esi, %edi. These are all callee
saved registers. To implement the custom calling convention I change
llvm in two places:

1) X86CallingConv.td : Add new calling convention.
2) X86RegisterInfo.cpp : Modify 'getCalleeSavedRegs' to remove the above
registers from the callee saved registers.

This works fine mostly. On Linux, the generated code passes the GHC
testsuite. On Mac however the GHC testsuite fails on any code which uses
the ffi (which is implemented by libffi
[http://sourceware.org/libffi/\]). Programs which fail segfault with the
error '__dyld_misaligned_stack_error'. The issue seems to be from my
investigations that the ffi call should be 16-byte aligned as per Mac OS
X's ABI.

That's correct. Therefore, when generating code, you must ensure that
the stack is 16-byte aligned, one way or another, if you want to make a
dylib call on Mac OS X.

I'm hoping someone is able to confirm that my changes would have
introduced this bug and how to go about fixing it.

Something I'm doing right now may be of interest to you.

Just today I added support for a new 'alignstack' function attribute.
With it, you can force the stack to be 16-byte aligned (or n-byte
aligned, if you so desire) in your functions. This way, you can make
calls to dylibs on Mac OS X without triggering the misaligned stack error.

Of course, it's a no-op right now (I still have to do the backend work).
But if you add this attribute to the emitted LLVM functions, when I do
finish this, the stack will be properly aligned and you won't get this
error anymore. It will cost you, though, in terms of code size and
performance.

Another minor issue is that the generated code has a strong tendency to
manipulate the stack pointer when its not required. For a large amount
of functions, the generated code will start and finish with sp
manipulation to give the function some space despite the function not
otherwise using the stack.

e.g

Utils_doStatefulOp1_entry:
   subl $4, %esp
   movl 4(%ebp), %eax
   movl 8(%ebp), %ecx
   movl (%ebp), %esi
   movl %eax, 8(%ebp)
   movl %ecx, 4(%ebp)
   addl $4, %ebp
   addl $4, %esp
   jmp stg_ap_pp_fast

It would be nice to fix this up as well.

That's normal. But it would be nice to fix this.

Chip

Sounds great. I'd appreciate if you would ping me or the list when
this lands. Do you think it will make the 2.7 release?

I was hoping to fix this up now though and solve the actual issue are
you or anyone else aware of how to do this?

~ David

David Terei wrote:

Sounds great. I'd appreciate if you would ping me or the list when
this lands. Do you think it will make the 2.7 release?

Landed! (For x86, at least.)

Chip

Charles Davis wrote:

Just today I added support for a new 'alignstack' function attribute.
With it, you can force the stack to be 16-byte aligned (or n-byte
aligned, if you so desire) in your functions. This way, you can make
calls to dylibs on Mac OS X without triggering the misaligned stack error.

I finally got around to properly playing around with 'alignstack' today and encountered a problem. It works as specified, indeed aligning the stack properly but interacts badly with the GHC calling convention. The problem is the GHC calling convention uses unconventional registers for argument passing. On x86-32 this is the four registers, ebx, ebp, esi, edi. The 'alignstack' attribute causes the ebp register to be clobbered.

e.g

_s1eJ_ret:
## BB#0:
    pushl %ebp
    movl %esp, %ebp
    andl $-16, %esp
    subl $32, %esp
    movl %ebp, 16(%esp) ## 4-byte Spill
    [...]
    movl 16(%esp), %eax ## 4-byte Reload
    movl %edi, -4(%eax)

I'm not really sure what to do about this at the moment, will keep investigating, perhaps you have an idea though?

Cheers,
David

Hello, David

I'm not really sure what to do about this at the moment, will keep
investigating, perhaps you have an idea though?

What is used for frame pointer register for GHC CC?