Hi everyone. On i386--win32 targets, LLVM tries to use the MSVCRT
routine _ftol2 for floating-point to unsigned conversions, but this
function has a nonstandard calling convention LLVM doesn't understand.
It takes its input operand on the x87 stack as ST0, which it pops off
of the stack before returning. The return value is given in EDX:EAX.
In effect, I need to call it like this:
%1 = call i64 asm "call __ftol2",
"=A,{st},~{dirflag},~{fpsr},~{flags}" (double %x) nounwind
but with the added consideration that the input operand is popped by
the call, so the callee can't emit its own fstp instruction afterward.
LLVM inline asm doesn't appear to be capable of communicating this. In
#llvm it was suggested to write a custom instruction for the call, but
it looks like there are a few layers of abstraction in the X86 target
for dealing with x87, and I can't quite grasp exactly how instruction
stack effects are communicated. What would be the best approach to
implement proper support for this runtime call?
-Joe
Hi everyone. On i386--win32 targets, LLVM tries to use the MSVCRT
routine _ftol2 for floating-point to unsigned conversions, but this
function has a nonstandard calling convention LLVM doesn't understand.
It takes its input operand on the x87 stack as ST0, which it pops off
of the stack before returning. The return value is given in EDX:EAX.
In effect, I need to call it like this:
%1 = call i64 asm "call __ftol2",
"=A,{st},~{dirflag},~{fpsr},~{flags}" (double %x) nounwind
but with the added consideration that the input operand is popped by
the call, so the callee can't emit its own fstp instruction afterward.
LLVM inline asm doesn't appear to be capable of communicating this.
This should work:
%1 = call i64 asm "call __ftol2", "=A,{st},~{dirflag},~{fpsr},~{flags},~{st}" (double %x) nounwind
See http://llvm.org/viewvc/llvm-project/llvm-gcc-4.2/trunk/gcc/reg-stack.c?view=markup
And the INLINEASM handling in http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86FloatingPoint.cpp?view=markup
In
#llvm it was suggested to write a custom instruction for the call, but
it looks like there are a few layers of abstraction in the X86 target
for dealing with x87, and I can't quite grasp exactly how instruction
stack effects are communicated. What would be the best approach to
implement proper support for this runtime call?
If the inline asm works for you, use it. It is currently the only way of supporting caller-popped arguments. Normal instructions can't do it because it depends on treating the inline asm clobber list differently from otherwise clobbered registers.
/jakob
Thanks Jakob, the ~{st} constraint does the trick. It wasn't clear to
me that "clobbers" means "pops" for x87 registers.
-Joe
Forgive me for being slow, but what would be the best way to implement
the equivalent of that inline asm as a custom lowering for an
instruction? Can I just create a CallInst and tell it to lower that
instead, or do I need to replicate the functionality manually as DAG
nodes?
-Joe
On second thought, it might not be the best approach to prodice inline asm during lowering.
How many of these libcalls do you need to implement? What exactly is the calling convention? Which registers are clobbered etc.
/jakob
There is only one (that I know about so far). The MSVCRT `_ftol2`
function implements floating-point-to-unsigned conversion for i386
targets, and LLVM 3.0 calls it with the cdecl calling convention for
`fptoui to i64` when targeting i386-pc-win32. However, it has its own
calling convention: The input value is taken from ST0 and popped off
of the x87 stack, and the return value is given in EDX:EAX. EAX, EDX,
and ST0 are clobbered (the latter by popping the stack). The function
creates a stack frame. It messes with the x87 control word internally,
but the original control word is restored before returning.
-Joe
Alright. We definitely don't want to model it as a general call, then. Normal calls clobber lots of registers.
The options are:
1. Use a pseudo-instruction that X86FloatingPoint understands and turns into a call after arranging for the argument to be in ST0.
You should emit:
%ST0 = COPY %vreg13; RFP80:%vreg13
%EAX, %EDX = FTOL2 %ST0<kill>
%vreg16 = COPY %EAX<kill>
%vreg17 = COPY %EDX<kill>
Then teach X86FloatingPoint that FTOL2 pops its argument, like FISTP64m.
2. Use inline asm. Which is pretty gross. You would need to construct a SelectionDAG node identical to the one produced by real inline asm. There should be some help in include/llvm/InlineAsm.h
I am not sure which is worse, but if there are multiple libcalls like this, you should go with something like 1.
Please investigate if there are other libcalls like this. If so, we should work out a proper solution instead of the inlineasm hack.
/jakob
The integer runtime functions (_allmul, _alldiv, etc. for 64-bit
integer arithmetic) all appear to be straight-up stdcall. _ftol2 is
the only weird one. (There is an _ftol routine with the same calling
convention as _ftol2, but AFAIK it's only for backward compatibility
with older MSVC runtimes.) I'm far from an MSVC expert, though.
Are there any docs for X86FloatingPoint? At a glance the FISTP etc.
definitions look just like the FIST etc. definitions; where is the
stack handled?
-Joe
The integer runtime functions (_allmul, _alldiv, etc. for 64-bit
integer arithmetic) all appear to be straight-up stdcall. _ftol2 is
the only weird one. (There is an _ftol routine with the same calling
convention as _ftol2, but AFAIK it's only for backward compatibility
with older MSVC runtimes.) I'm far from an MSVC expert, though.
Thanks.
Are there any docs for X86FloatingPoint?
X86FloatingPoint.cpp with comments is all you get.
/jakob
Thanks for your help, Jakob. Attached is a first-pass attempt at a
patch. I don't want to post to -commits yet because I have no idea if
this is fully correct, but it seems to work in simple test cases. Am I
on the right track? Could this patch ever break in cases where the
operand's vreg doesn't happen to get mapped to ST0? I'm still a bit
foggy on the internals of X86FloatingPoint.
One thing I noticed is that fptosi and fptoui both seem to always emit
a redundant SSE load/store when SSE is enabled, because of the check
at Target/X86/X86ISelLowering.cpp:7948. Can this check be easily
modified so it doesn't store if the operand is already in memory and
not actually in an SSE register? Should FP_TO_INTHelper switch over to
using CVTTS?2SI insns when SSE is available?
-Joe
llvm-ftol2.diff (12.9 KB)
X86FloatingPoint.cpp with comments is all you get.
Thanks for your help, Jakob. Attached is a first-pass attempt at a
patch. I don't want to post to -commits yet because I have no idea if
this is fully correct, but it seems to work in simple test cases. Am I
on the right track?
Yes, your definition of the new instruction looks sane.
However, you shouldn't expand the instruction right away in EmitInstrWithCustomInserter(), and leaving the pseudo and call instructions side by side is not going to work.
Just leave the pseudo-instruction alone until it hits X86FloatingPoint, where you can rewrite it.
Could this patch ever break in cases where the
operand's vreg doesn't happen to get mapped to ST0?
Yes, exactly. You need to make some more complicated test cases.
I'm still a bit
foggy on the internals of X86FloatingPoint.
Look at the code handling INLINE_ASM. You need to do the same, except you have fixed arguments STUses=1 and STClobbers=1, ST*=0. That should greatly simplify the code you need.
One thing I noticed is that fptosi and fptoui both seem to always emit
a redundant SSE load/store when SSE is enabled, because of the check
at Target/X86/X86ISelLowering.cpp:7948. Can this check be easily
modified so it doesn't store if the operand is already in memory and
not actually in an SSE register? Should FP_TO_INTHelper switch over to
using CVTTS?2SI insns when SSE is available?
When SSE is available, x87 registers are only ever used for f80.
/jakob
Yes, your definition of the new instruction looks sane.
However, you shouldn't expand the instruction right away in EmitInstrWithCustomInserter(), and leaving the pseudo and call instructions side by side is not going to work.
Just leave the pseudo-instruction alone until it hits X86FloatingPoint, where you can rewrite it.
Look at the code handling INLINE_ASM. You need to do the same, except you have fixed arguments STUses=1 and STClobbers=1, ST*=0. That should greatly simplify the code you need.
That makes sense; thanks for the tip. Are the getCopyToReg(ST0) and
addReg(ST0, ImplicitKill) calls on the expanded MI at all necessary
then since X86FloatingPoint seems to manage that all internally?
When SSE is available, x87 registers are only ever used for f80.
It looks like it always tries to use fisttp when converting to i64.
This bitcode:
define i64 @foo(double %x) nounwind readnone {
init:
%0 = fptosi double %x to i64
ret i64 %0
}
gets compiled by LLVM 3.0 to:
_foo: # @foo
# BB#0: # %init
subl $20, %esp
movsd 24(%esp), %xmm0
movsd %xmm0, 8(%esp)
fldl 8(%esp)
fisttpll (%esp)
movl (%esp), %eax
movl 4(%esp), %edx
addl $20, %esp
ret
with a seemingly redundant movsd pair before the fisttp instruction.
-Joe
Here is a patch that integrates the WIN_FTOL pseudo-insn a bit better
with the floating-point register allocator by following a simplified
version of what InlineAsm does, as you suggested. Does this look
closer to being right? Thanks again for all your help.
-Joe
llvm-ftol2-2.diff (13.4 KB)