FENV_ACCESS and floating point LibFunc calls

Hi Andy,

I’m interested to try out your patches…

I understand the scope of FENV_ACCESS is relatively wide, however I’m still curious if you managed to figure out how to prevent the SelectionDAGLegalize::ExpandNode() FP_TO_UINT lowering of the FPToUI intrinsic from producing the predicate logic that incorrectly sets the floating point accrued exceptions due to unconditional execution of the ISD::FSUB node attached to the SELECT node. It’s a little above my head to try to solve this issue with my current understanding of LLVM but I could give it a try. I’d need some guidance as to how the lowering of SELECT can be controlled. i.e. where LLVM decides whether and how to lower a select node as a branch vs predicate logic.

I’d almost forgotten that we microbenchmarked this and found the branching version is faster with regular input (< LLONG_MAX).

All, Where does LLVM decide to lower select as predicate logic vs branches and how does the cost model work? I’m curious about a tuning flag to generate branches instead of computing both values and using conditional moves…

Best,
Michael.

Hi Michael,

To be honest I haven’t started working on FP to integer conversions for FENV_ACCESS yet.

To some extent I would consider the issue that you found independent of what I’m doing to constrain FP behavior. That is, I think we ought to make the change you’re asking for even apart from the FENV_ACCESS work. When I get to the conversions for FENV_ACCESS support it may require some additional constraints, but I think if the branching conversion is usually faster (and it looks like it will be) then that should be the default behavior.

I’ll try to look into that. I’d offer to give you advice on putting together a patch, but I’m still learning my way around the ISel code myself. I think I know enough to figure out what to do but not enough to tell someone else how to do it without a bunch of wrong turns.

-Andy

Sounds like the select lowering issue is definitely separate from the FENV work.

Is there a bug report with a C or IR example? You want to generate compare and branch instead of a cmov for something like this?

int foo(float x) {
if (x < 42.0f)
return x;
return 12;
}

define i32 @foo(float %x) {
%cmp = fcmp olt float %x, 4.200000e+01
%conv = fptosi float %x to i32
%ret = select i1 %cmp, i32 %conv, i32 12
ret i32 %ret
}

$ clang -O2 cmovfp.c -S -o -
movss LCPI0_0(%rip), %xmm1 ## xmm1 = mem[0],zero,zero,zero
ucomiss %xmm0, %xmm1
cvttss2si %xmm0, %ecx
movl $12, %eax
cmoval %ecx, %eax
retq

Hi Sanjay,

The issue that Michael is talking about is that in SelectionDAGLegalize::ExpandNode() we’re doing this:

case ISD::FP_TO_UINT: {

SDValue True, False;

EVT VT = Node->getOperand(0).getValueType();

EVT NVT = Node->getValueType(0);

APFloat apf(DAG.EVTToAPFloatSemantics(VT),

APInt::getNullValue(VT.getSizeInBits()));

APInt x = APInt::getSignMask(NVT.getSizeInBits());

(void)apf.convertFromAPInt(x, false, APFloat::rmNearestTiesToEven);

Tmp1 = DAG.getConstantFP(apf, dl, VT);

Tmp2 = DAG.getSetCC(dl, getSetCCResultType(VT),

Node->getOperand(0),

Tmp1, ISD::SETLT);

True = DAG.getNode(ISD::FP_TO_SINT, dl, NVT, Node->getOperand(0));

// TODO: Should any fast-math-flags be set for the FSUB?

False = DAG.getNode(ISD::FP_TO_SINT, dl, NVT,

DAG.getNode(ISD::FSUB, dl, VT,

Node->getOperand(0), Tmp1));

False = DAG.getNode(ISD::XOR, dl, NVT, False,

DAG.getConstant(x, dl, NVT));

Tmp1 = DAG.getSelect(dl, NVT, Tmp2, True, False);

Results.push_back(Tmp1);

break;

}

This results in code that speculatively executes the FSUB and then uses a cmov instruction (based on whether or not the original value was less than LLONG_MAX) to select between the direct FP_TO_SINT result and the result of the FSUB. If we assume that for most data sets a significant majority of values being converted this way will be less than LLONG_MAX then the processor should be able to predict a branch here pretty reliably and so branching code would be faster than code that does a speculative FSUB followed by a cmov.

The question is, how do we need to change this to get the back end to generate a branch instead of a cmov. I’m guessing that there are circumstances in which the target-specific backend would make that decision on its own and sink the FSUB, but for the X86 backend that doesn’t seem to be happening in this case. My inclination was to just rewrite the ExpandNode() code above to explicitly create a branch, but I haven’t tried it yet.

As a side effect, which is what originally drew Michael’s attention to this conversion, the INEXACT floating point flag is set even for exact conversions as a result of the speculative FSUB. That’s how this got related to FENV_ACCESS support.

I don’t know if there has been a bug opened for this issue. There was an earlier discussion here: Redirecting to Google Groups

-Andy

Thanks, Andy. I’m not sure how to solve that or my case given the DAG’s basic-block limit. Probably CodeGenPrepare or SelectionDAGBuilder…or we wait until after isel and try to split it up in a machine instruction pass.

I filed my example here:
https://bugs.llvm.org/show_bug.cgi?id=33013

Feel free to comment there and/or open a new bug for the FP_TO_UINT case.

Hi Sanjay,

I can’t comment on the bug as I don’t have an LLVM bugzilla account. I’ve emailed bugs-admin to gain access to bugzilla so I can comment on the bug.

I note that on your bug that you have stated that the branch is faster than the conditional move. Faster code is a side effect of the fix in this particular case. The real issue is as Andrew mentioned, it that the predicate logic; the ISD:FSUB in particular is causing FE_INEXACT accrued exception state for exact conversions.

Here is some code that shows the error and has a workaround by forcing the compiler to create a branch by using inline asm for the conversion.

https://godbolt.org/g/huIgK1

The bug affects both float to u64 and double to u64 conversions.

Expected output:

$ clang fcvt-bug.cc
$ ./a.out
1 exact
1 inexact
1 exact
1 inexact
1 exact
1 inexact
1 exact
1 inexact

Erroneous output:

$ clang -DDISABLE_WORKAROUND fcvt-bug.cc
$ ./a.out
1 exact
1 inexact
1 inexact
1 inexact
1 exact
1 inexact
1 inexact
1 inexact

Michael.

On the contrary: the faster code is pretty much the only reason this
can happen before the rest of the FENV support lands.

It's been said before, but I'll reiterate: LLVM IR does not model the
FENV on its instructions. CodeGen and other passes are free to
de-conditionalize exceptions, remove them, or add spurious ones just
for the giggles. What LLVM does now is not incorrect.

Tim.

Hi Michael,

My main concern is performance in default mode, so I see the FENV fix as the side effect. :slight_smile:

I looked at this very briefly before leaving my dev machine for the day. I think we’re going to need a cmov conversion pass that happens very late (maybe this already exists but just isn’t firing for the cases I looked at). Trying to convert these any sooner will likely be too harmful to the DAG. PowerPC has this functionality, so we should already have a model to share or steal from.

OK. So we are in fact lucky that the correct case is actually faster, and it’s a bug in the predicate lowering i.e. speculative execution and conditional move being slower than a branch.

I’m curious how the select lowering models the cost, when I figure out where to look in the codebase…

Michael.

Just as a few data points on the x86 branch predictor.

I have 6 small integer benchmarks that I am using to test a RISC-V to x86 binary translator and I was using perf last night to read the performance counters. I had these stats in my command line history as I was curious about branch predictor accuracy. It seems branch prediction accuracy in all my experiments is > 99%. Note the test programs are compiled by RISC-V GCC. RISC-V has no conditional moves and branch mis-predict latency is only 3 cycles on Rocket, so its also an architecture that prefers branches over predication. We are translating RISC-V branches to x86 branches. We don’t use conditional moves in any of our translations. I believe a predicted branch is just 1 cycle latency on x86. Here is the translator: http://rv8.io/ (BTW - the RISC-V interpreter rv-sim seems to be a pathological test case for the Clang/LLVM optimiser, with the Clang/LLVM code running at just over half the speed of the GCC generated code, of course the translator is not really affected by the speed of Clang, as we spend most time in the JIT code).

$ perf stat -e cycles,instructions,branches,branch-misses ./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-sha512

Performance counter stats for ‘./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-sha512’:

2,386,668,826 cycles
8,226,368,806 instructions # 3.45 insn per cycle
556,426,385 branches
1,120,630 branch-misses # 0.20% of all branches

0.766480608 seconds time elapsed

$ perf stat -e cycles,instructions,branches,branch-misses ./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-aes

Performance counter stats for ‘./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-aes’:

3,390,012,091 cycles
8,165,055,539 instructions # 2.41 insn per cycle
166,612,327 branches
393,687 branch-misses # 0.24% of all branches

0.999783799 seconds time elapsed

$ perf stat -e cycles,instructions,branches,branch-misses ./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-primes

Performance counter stats for ‘./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-primes’:

585,513,229 cycles
1,570,274,312 instructions # 2.68 insn per cycle
199,550,674 branches
1,373,005 branch-misses # 0.69% of all branches

0.180905897 seconds time elapsed

$ perf stat -e cycles,instructions,branches,branch-misses ./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-miniz

Performance counter stats for ‘./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-miniz’:

7,181,383,837 cycles
12,171,106,005 instructions # 1.69 insn per cycle
1,309,704,230 branches
10,246,710 branch-misses # 0.78% of all branches

2.120649526 seconds time elapsed

$ perf stat -e cycles,instructions,branches,branch-misses ./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-dhrystone

Performance counter stats for ‘./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-dhrystone’:

1,705,866,284 cycles
5,902,622,960 instructions # 3.46 insn per cycle
852,430,738 branches
65,576 branch-misses # 0.01% of all branches

0.530201822 seconds time elapsed

$ perf stat -e cycles,instructions,branches,branch-misses ./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-qsort

Performance counter stats for ‘./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-qsort’:

951,218,523 cycles
2,060,457,742 instructions # 2.17 insn per cycle
432,171,433 branches
3,844,290 branch-misses # 0.89% of all branches

0.288089656 seconds time elapsed