Hi Roman,
Is “test” actually an implementation of a 64-bit-wide multiplication
compiler-rt builtin?
Then i’d think the main problem is that it is being optimized in the
first place, you could end up with endless recursion…
No, this is not a compiler-rt builtin. My example is of course incidentally taken from the implementation of a signed multiply, but as said, it has nothing to do with rt-builtins, I’m just using that code to show the issue. This function can’t create a recursion because it’s named ’test’, unlike any rt-buitin. You can replace the multiply in the source code by an addition, if you want to avoid calling rt-functions, but this does not change what I attempt to show. Also It’s not meant to be 64 bit wide, but 32 bit wide, because the targets I’m testing are 16 bit, so ints are 16 bit and longs are 32 bit. This is again the function I am testing:
long test (long a, long b)
{
int neg = 0;
long res;
if (a < 0)
{
a = -a;
neg = 1;
}
if (b < 0)
{
b = -b;
neg = !neg;
}
res = a*b;
if (neg)
res = -res;
return res;
}
LLVM, not clang.
I’m not sure about what you mean by that. The shown LLVM IR code is created by executing "clang” command line, so that’s what I attempt to show. So it’s actually the front-end that does such undesired optimisations sometimes, not only the LLVM back-end. This is in part why I am saying this is not right. See copied again the IR code that gets generated for the C code that I posted before. This IR code, including the presence of expensive shifts ( %a.lobit = lshr i32 %a, 31) is generated when -mllvm -phi-node-folding-threshold=1 is specified in the command line, or when the Target implements getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) to return TCC_Expensive for operator types that are bigger than the default target register size.
; ModuleID = ‘main.c’
source_filename = “main.c”
target datalayout = “e-m:e-p:16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S16”
target triple = “msp430”
; Function Attrs: norecurse nounwind optsize readnone
define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
entry:
%cmp = icmp slt i32 %a, 0
%sub = sub nsw i32 0, %a
%spec.select = select i1 %cmp, i32 %sub, i32 %a
%a.lobit = lshr i32 %a, 31
%0 = trunc i32 %a.lobit to i16
%cmp1 = icmp slt i32 %b, 0
br i1 %cmp1, label %if.then2, label %if.end4
if.then2: ; preds = %entry
%sub3 = sub nsw i32 0, %b
%1 = xor i16 %0, 1
br label %if.end4
if.end4: ; preds = %if.then2, %entry
%b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ]
%neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ]
%mul = mul nsw i32 %b.addr.0, %spec.select
%tobool5 = icmp eq i16 %neg.1, 0
%sub7 = sub nsw i32 0, %mul
%spec.select18 = select i1 %tobool5, i32 %mul, i32 %sub7
ret i32 %spec.select18
}
attributes #0 = { norecurse nounwind optsize readnone “correctly-rounded-divide-sqrt-fp-math”=“false” “disable-tail-calls”=“false” “less-precise-fpmad”=“false” “min-legal-vector-width”=“0” “no-frame-pointer-elim”=“false” “no-infs-fp-math”=“false” “no-jump-tables”=“false” “no-nans-fp-math”=“false” “no-signed-zeros-fp-math”=“false” “no-trapping-math”=“false” “stack-protector-buffer-size”=“8” “unsafe-fp-math”=“false” “use-soft-float”=“false” }
!llvm.module.flags = !{!0}
!llvm.ident = !{!1}
!0 = !{i32 1, !“wchar_size”, i32 2}
!1 = !{!“clang version 9.0.0 (https://github.com/llvm/llvm-project.git 6f7deba43dd25fb7b3eca70f9c388ec9174f455a)”}
As you can see, Clang produces a 31 bit wide shift right ( %a.lobit = lshr i32 %a, 31) That’s the fourth instruction on the IR code above. So a shift is produced instead of creating a jump to a new block, as it should be the case as per the C source code.
Just as a matter of information. This is the implementation of the getOperationCost function that causes ‘clang’ to correctly replace selects by branches (desirable), but to generate shifts to fold expensive selects (undesirable)
unsigned CPU74TTIImpl::getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy)
{
// Big types are expensive
unsigned OpSize = Ty->getScalarSizeInBits();
if ( OpSize > 16 )
return TTI::TCC_Expensive;
return BaseT::getOperationCost(Opcode, Ty, OpTy);
}
If the getOperationCost above function was not implemented, then clang would generate the usual series of ‘selects’. But this is even worse because selects imply speculative execution of expensive instructions, or duplicate branching created by the backend, which can’t be easily avoided.
Ideally, the IR code above should just place an ‘if.then’ block for the if (a < 0) statement in the C source code, instead of attempting to replace a select by a shift (!)
If you want to play with these two scenarios, (1) IR code generated with branches, and (2) IR code generated with selects. This can easily be reproduced for the MSP430 target by compiling with the following options
(1) -mllvm -phi-node-folding-threshold=1 -c -S -Os
(2) -mllvm -phi-node-folding-threshold=2 -c -S -Os
For 16 bit targets without selects, or expensive selects, the overall code is better with (1) because that prevents the creation of a different jump for every ‘select’ that (2) would cause. However, the presence of the ‘shift’ instruction for (1) spoils it all.
Again, ideally, the use of shifts as a replacement of selects should be avoided, and an “if.then" block should be used as per the original C code.
I hope this is clear now.
John.