How to tell LLVM to treat Commutable library calls as such, for example multiplication?

A few library calls are commutable by definition, for example multiplications.

I defined them as LibCalls for my architecture. However, I found that arguments are always passed in the order they are generated by Clang thus missing possible optimisations. For example, the following IR code

; Function Attrs: minsize norecurse nounwind optsize readnone
define dso_local i16 @multTest(i16 %a, i16 %b) local_unnamed_addr #0 {
entry:
%mul = mul i16 %b, %a
ret i16 %mul
}

is lowered as it is to a __mulhi3 library call with the arguments reversed as shown (b,a). This is suboptimal because later on this requires an intermediate register to swap them.

SelectionDAG has 9 nodes:
t0: ch = EntryToken
t4: i16,ch = CopyFromReg t0, Register:i16 %1
t2: i16,ch = CopyFromReg t0, Register:i16 %0
t5: i16 = mul t4, t2
t7: ch,glue = CopyToReg t0, Register:i16 $r0, t5
t8: ch = CPU74ISD::RET_FLAG t7, Register:i16 $r0, t7:1

Given the commutative nature of multiplication, the arguments could be passed in the same order they are received (a,b) so that they would not need to be swapped before the __mulhi3 call

What am I missing?,

Thanks,

John Lluch.

Hi Joan,

I defined them as LibCalls for my architecture. However, I found that arguments are always passed in the order they are generated by Clang thus missing possible optimisations. For example, the following IR code

Have you tried to assess the potential benefit over a larger codebase?
It looks like it could be something that comes up in trivial tests,
but disappears in larger functions where the register allocator has
more freedom.

Given the commutative nature of multiplication, the arguments could be passed in the same order they are received (a,b) so that they would not need to be swapped before the __mulhi3 call

It would be pretty ugly to implement during instruction selection,
given how call lowering works in LLVM. The call's registers (if a call
even puts its arguments in registers) aren't decided until you get
into LowerCall in target-specific code. You'd probably need some kind
of target-hook to answer whether it was worthwhile.

If I had my heart set on this optimization, I'd probably implement it
as a MachineIR pass after register allocation that searches upwards
from any call to __mulhi3 etc and deletes compatible COPYs.

Cheers.

Tim.

A few library calls are commutable by definition, for example multiplications.

I defined them as LibCalls for my architecture. However, I found that arguments are always passed in the order they are generated by Clang thus missing possible optimisations. For example, the following IR code

; Function Attrs: minsize norecurse nounwind optsize readnone
define dso_local i16 @multTest(i16 %a, i16 %b) local_unnamed_addr #0 {
entry:
  %mul = mul i16 %b, %a
  ret i16 %mul
}

is lowered as it is to a __mulhi3 library call with the arguments reversed as shown (b,a). This is suboptimal because later on this requires an intermediate register to swap them.

SelectionDAG has 9 nodes:
  t0: ch = EntryToken
      t4: i16,ch = CopyFromReg t0, Register:i16 %1
      t2: i16,ch = CopyFromReg t0, Register:i16 %0
    t5: i16 = mul t4, t2
  t7: ch,glue = CopyToReg t0, Register:i16 $r0, t5
  t8: ch = CPU74ISD::RET_FLAG t7, Register:i16 $r0, t7:1

Given the commutative nature of multiplication, the arguments could be passed in the same order they are received (a,b) so that they would not need to be swapped before the __mulhi3 call

What am I missing?,

I don't think that you're missing anything, but I don't think that we have anything that currently allows for this in a generic way. You should be able to account for this in your target's call lowering for functions that you happen to know to be commutative, at least in cases where it's possible to recognize the source of the arguments as coming from incoming function arguments (this seems similar to how tail-call lowering works).

We do have the ability to mark instructions is commutative and the register coalescer can use that capability. The problem is that, by the time we reach that stage, the call arguments have already (likely) been lowered into copies to physical registers.

-Hal

Thanks,

John Lluch.

Hi Finkel and Tim

Thanks for your responses.

About this topic, and partly as a matter of experimentation, I have implemented a function that gets called for ‘commutable’ libcalls just at the beginning of my LowerCall implementation. I conveniently called it ‘AnalizeCommutableLibCall’. This function performs a recursive search up the operand tree starting from the libcall operands until it finds a ‘CopyFromReg’ or another LibCall operation action. The function determines the register number encountered at the end of the search, either physical or virtual, for each of the two commutable operands. In case the first operand got a bigger register number than the second operand, then the operands are swapped. This apparently simple thing seems to work in all cases because, ultimately, physical register assignations go in ascending number for formal arguments and function returns. By placing the (commutable) operands in ascending register numbers, even if these register numbers were found way up on the operand tree, the compiler never needs to perform any explicit or implicit register swaps, so this often results in less register overhead and shorter code. Much like if the compiler already knew that the call operands were ‘commutable’.

For larger codebases, certainly the compiler has some extra opportunities to swap registers through intermediate instruction results, but the problem remains that registers or intermediate results must still be swapped at some time if Clang dictated so. This circumstance repeats again at every new libcall instruction, regardless of the length of the function, so even if some calls may not get affected, most apparently are.

For now I’m leaving this (initially experimental) function on my target implementation, I can post the code in case there’s some interest, as I found that it systematically produces shorter code and it can't produce any side effects (all what it ultimately does is swapping commutative arguments).

John.