Best way to interface with MSVC _ftol2 runtime function for fptoui?

Hi everyone. On i386--win32 targets, LLVM tries to use the MSVCRT
routine _ftol2 for floating-point to unsigned conversions, but this
function has a nonstandard calling convention LLVM doesn't understand.
It takes its input operand on the x87 stack as ST0, which it pops off
of the stack before returning. The return value is given in EDX:EAX.
In effect, I need to call it like this:

%1 = call i64 asm "call __ftol2",
"=A,{st},~{dirflag},~{fpsr},~{flags}" (double %x) nounwind

but with the added consideration that the input operand is popped by
the call, so the callee can't emit its own fstp instruction afterward.
LLVM inline asm doesn't appear to be capable of communicating this. In
#llvm it was suggested to write a custom instruction for the call, but
it looks like there are a few layers of abstraction in the X86 target
for dealing with x87, and I can't quite grasp exactly how instruction
stack effects are communicated. What would be the best approach to
implement proper support for this runtime call?

-Joe

Hi everyone. On i386--win32 targets, LLVM tries to use the MSVCRT
routine _ftol2 for floating-point to unsigned conversions, but this
function has a nonstandard calling convention LLVM doesn't understand.
It takes its input operand on the x87 stack as ST0, which it pops off
of the stack before returning. The return value is given in EDX:EAX.
In effect, I need to call it like this:

%1 = call i64 asm "call __ftol2",
"=A,{st},~{dirflag},~{fpsr},~{flags}" (double %x) nounwind

but with the added consideration that the input operand is popped by
the call, so the callee can't emit its own fstp instruction afterward.
LLVM inline asm doesn't appear to be capable of communicating this.

This should work:

%1 = call i64 asm "call __ftol2", "=A,{st},~{dirflag},~{fpsr},~{flags},~{st}" (double %x) nounwind

See http://llvm.org/viewvc/llvm-project/llvm-gcc-4.2/trunk/gcc/reg-stack.c?view=markup
And the INLINEASM handling in http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86FloatingPoint.cpp?view=markup

In
#llvm it was suggested to write a custom instruction for the call, but
it looks like there are a few layers of abstraction in the X86 target
for dealing with x87, and I can't quite grasp exactly how instruction
stack effects are communicated. What would be the best approach to
implement proper support for this runtime call?

If the inline asm works for you, use it. It is currently the only way of supporting caller-popped arguments. Normal instructions can't do it because it depends on treating the inline asm clobber list differently from otherwise clobbered registers.

/jakob

Thanks Jakob, the ~{st} constraint does the trick. It wasn't clear to
me that "clobbers" means "pops" for x87 registers.

-Joe

Forgive me for being slow, but what would be the best way to implement
the equivalent of that inline asm as a custom lowering for an
instruction? Can I just create a CallInst and tell it to lower that
instead, or do I need to replicate the functionality manually as DAG
nodes?

-Joe

On second thought, it might not be the best approach to prodice inline asm during lowering.

How many of these libcalls do you need to implement? What exactly is the calling convention? Which registers are clobbered etc.

/jakob

There is only one (that I know about so far). The MSVCRT `_ftol2`
function implements floating-point-to-unsigned conversion for i386
targets, and LLVM 3.0 calls it with the cdecl calling convention for
`fptoui to i64` when targeting i386-pc-win32. However, it has its own
calling convention: The input value is taken from ST0 and popped off
of the x87 stack, and the return value is given in EDX:EAX. EAX, EDX,
and ST0 are clobbered (the latter by popping the stack). The function
creates a stack frame. It messes with the x87 control word internally,
but the original control word is restored before returning.

-Joe

Alright. We definitely don't want to model it as a general call, then. Normal calls clobber lots of registers.

The options are:

1. Use a pseudo-instruction that X86FloatingPoint understands and turns into a call after arranging for the argument to be in ST0.
   You should emit:

   %ST0 = COPY %vreg13; RFP80:%vreg13
   %EAX, %EDX = FTOL2 %ST0<kill>
   %vreg16 = COPY %EAX<kill>
   %vreg17 = COPY %EDX<kill>

   Then teach X86FloatingPoint that FTOL2 pops its argument, like FISTP64m.

2. Use inline asm. Which is pretty gross. You would need to construct a SelectionDAG node identical to the one produced by real inline asm. There should be some help in include/llvm/InlineAsm.h

I am not sure which is worse, but if there are multiple libcalls like this, you should go with something like 1.

Please investigate if there are other libcalls like this. If so, we should work out a proper solution instead of the inlineasm hack.

/jakob

The integer runtime functions (_allmul, _alldiv, etc. for 64-bit
integer arithmetic) all appear to be straight-up stdcall. _ftol2 is
the only weird one. (There is an _ftol routine with the same calling
convention as _ftol2, but AFAIK it's only for backward compatibility
with older MSVC runtimes.) I'm far from an MSVC expert, though.

Are there any docs for X86FloatingPoint? At a glance the FISTP etc.
definitions look just like the FIST etc. definitions; where is the
stack handled?

-Joe

The integer runtime functions (_allmul, _alldiv, etc. for 64-bit
integer arithmetic) all appear to be straight-up stdcall. _ftol2 is
the only weird one. (There is an _ftol routine with the same calling
convention as _ftol2, but AFAIK it's only for backward compatibility
with older MSVC runtimes.) I'm far from an MSVC expert, though.

Thanks.

Are there any docs for X86FloatingPoint?

X86FloatingPoint.cpp with comments is all you get.

/jakob

Thanks for your help, Jakob. Attached is a first-pass attempt at a
patch. I don't want to post to -commits yet because I have no idea if
this is fully correct, but it seems to work in simple test cases. Am I
on the right track? Could this patch ever break in cases where the
operand's vreg doesn't happen to get mapped to ST0? I'm still a bit
foggy on the internals of X86FloatingPoint.

One thing I noticed is that fptosi and fptoui both seem to always emit
a redundant SSE load/store when SSE is enabled, because of the check
at Target/X86/X86ISelLowering.cpp:7948. Can this check be easily
modified so it doesn't store if the operand is already in memory and
not actually in an SSE register? Should FP_TO_INTHelper switch over to
using CVTTS?2SI insns when SSE is available?

-Joe

llvm-ftol2.diff (12.9 KB)

X86FloatingPoint.cpp with comments is all you get.

Thanks for your help, Jakob. Attached is a first-pass attempt at a
patch. I don't want to post to -commits yet because I have no idea if
this is fully correct, but it seems to work in simple test cases. Am I
on the right track?

Yes, your definition of the new instruction looks sane.

However, you shouldn't expand the instruction right away in EmitInstrWithCustomInserter(), and leaving the pseudo and call instructions side by side is not going to work.

Just leave the pseudo-instruction alone until it hits X86FloatingPoint, where you can rewrite it.

Could this patch ever break in cases where the
operand's vreg doesn't happen to get mapped to ST0?

Yes, exactly. You need to make some more complicated test cases.

I'm still a bit
foggy on the internals of X86FloatingPoint.

Look at the code handling INLINE_ASM. You need to do the same, except you have fixed arguments STUses=1 and STClobbers=1, ST*=0. That should greatly simplify the code you need.

One thing I noticed is that fptosi and fptoui both seem to always emit
a redundant SSE load/store when SSE is enabled, because of the check
at Target/X86/X86ISelLowering.cpp:7948. Can this check be easily
modified so it doesn't store if the operand is already in memory and
not actually in an SSE register? Should FP_TO_INTHelper switch over to
using CVTTS?2SI insns when SSE is available?

When SSE is available, x87 registers are only ever used for f80.

/jakob

Yes, your definition of the new instruction looks sane.

However, you shouldn't expand the instruction right away in EmitInstrWithCustomInserter(), and leaving the pseudo and call instructions side by side is not going to work.

Just leave the pseudo-instruction alone until it hits X86FloatingPoint, where you can rewrite it.

Look at the code handling INLINE_ASM. You need to do the same, except you have fixed arguments STUses=1 and STClobbers=1, ST*=0. That should greatly simplify the code you need.

That makes sense; thanks for the tip. Are the getCopyToReg(ST0) and
addReg(ST0, ImplicitKill) calls on the expanded MI at all necessary
then since X86FloatingPoint seems to manage that all internally?

When SSE is available, x87 registers are only ever used for f80.

It looks like it always tries to use fisttp when converting to i64.
This bitcode:

define i64 @foo(double %x) nounwind readnone {
init:
  %0 = fptosi double %x to i64
  ret i64 %0
}

gets compiled by LLVM 3.0 to:

_foo: # @foo
# BB#0: # %init
  subl $20, %esp
  movsd 24(%esp), %xmm0
  movsd %xmm0, 8(%esp)
  fldl 8(%esp)
  fisttpll (%esp)
  movl (%esp), %eax
  movl 4(%esp), %edx
  addl $20, %esp
  ret

with a seemingly redundant movsd pair before the fisttp instruction.

-Joe

Here is a patch that integrates the WIN_FTOL pseudo-insn a bit better
with the floating-point register allocator by following a simplified
version of what InlineAsm does, as you suggested. Does this look
closer to being right? Thanks again for all your help.

-Joe

llvm-ftol2-2.diff (13.4 KB)