Possible missed optimization? 2.0

Hello, i’ve noticed a new possible missed optimization while testing more trivial code.
This time it’s not a with a xor but with a multiplication instruction and the example is little bit more involved.

C code:

typedef short t;
t foo(t a, t b)
{
t a4 = a*b;
return a4;
}

argument “a” is passed in R15:R14, argument “b” in R13:R12, the return value is stored in R15:R14.
The mul instruction takes in two 8bit regs and returns a 16bit result in R1:R0, this is handled in the selectionDAG same way as x86 (btw mul is marked as commutable).

Asm code:

mul r12, r15
mov r8, r0
mul r12, r14
mov r9, r0
mov r10, r1
add r10, r8
mul r13, r14
mov r15, r0
add r15, r10
mov r14, r9

This can be tuned further to the following:

mov r8, r14
mov r9, r15
mul r12, r8
mov r14, r0
mov r15, r1
mul r12, r9
add r15, r0
mul r13, r8
add r15, r0

The difference between both versions is that the second has one instruction less and saves a scratch register.
If we start by multiplying the lower parts of both arguments instead of mixing upper and lower parts from a start we can save r8 in the first example and a later move, notice that the second version stores directly the result of a.low*b.low into R15:R14. I’m unsure if this is related to http://llvm.org/bugs/show_bug.cgi?id=8112
I’ve attached a txt file with the regcoalescing output incase it’s useful like requested in the previous emails.

Thanks

reg.txt (5.02 KB)

Hello, i've noticed a new possible missed optimization while testing more trivial code.
This time it's not a with a xor but with a multiplication instruction and the example is little bit more involved.

C code:

typedef short t;
t foo(t a, t b)
{
    t a4 = a*b;
    return a4;
}

argument "a" is passed in R15:R14, argument "b" in R13:R12, the return value is stored in R15:R14.
The mul instruction takes in two 8bit regs and returns a 16bit result in R1:R0, this is handled in the selectionDAG same way as x86 (btw mul is marked as commutable).

Note that the isCommutable flag is only really useful for two-address instructions. If the two inputs are not constrained, nothing is really won by swapping them.

[...]

The difference between both versions is that the second has one instruction less and saves a scratch register.
If we start by multiplying the lower parts of both arguments instead of mixing upper and lower parts from a start we can save r8 in the first example and a later move, notice that the second version stores directly the result of a.low*b.low into R15:R14. I'm unsure if this is related to 8112 – Missed optimization during register coalescing
I've attached a txt file with the regcoalescing output incase it's useful like requested in the previous emails.

I haven't looked closely, but on the surface it doesn't sound like a coalescing issue.

It sounds like you want different scheduling, or even different selection DAGs.

Does the -view-*-dags output look correct?

Note that the isCommutable flag is only really useful for two-address instructions. If the two inputs are not constrained, nothing is really won by swapping them.
Ahh i see, good to know that.

Does the -view-*-dags output look correct?
They do look correct, there are three Xmul_lohi blocks, one returns the low part copied into R14 and the rest of combinations get added and merged into R15.

Here is my selectionDAG code, i used X86’s MUL code and adapted it to my target:

case ISD::SMUL_LOHI:
case ISD::UMUL_LOHI:
{
SDValue Op1 = N->getOperand(0);
SDValue Op2 = N->getOperand(1);
unsigned LoReg = R0, HiReg = R1;
unsigned Opc = MULRdRr;

SDValue InFlag = SDValue(CurDAG->getMachineNode(Opc,
dl,
MVT::Flag,
Op1,
Op2), 0);

// Copy the low half of the result, if it is needed.
if (!SDValue(N, 0).use_empty())
{
SDValue Result = CurDAG->getCopyFromReg(CurDAG->getEntryNode(),
dl,
LoReg,
NVT,
InFlag);
InFlag = Result.getValue(2);
ReplaceUses(SDValue(N, 0), Result);
}

// Copy the high half of the result, if it is needed.
if (!SDValue(N, 1).use_empty())
{
SDValue Result = CurDAG->getCopyFromReg(CurDAG->getEntryNode(),
dl,
HiReg,
NVT,
InFlag);
InFlag = Result.getValue(2);
ReplaceUses(SDValue(N, 1), Result);
}

return NULL;
}

ISD::MUL is set to be expanded.

Hello Jakob, i’ve been trying different things like adding customized code for the multiplication instruction instead of letting LLVM expand it however it produced exactly the same code in all the alternatives i tested.
Please let me know when you have time if there’s a way to improve the emitted code or if it’s a missed optimization. I dont know in which other ways i could improve this since it’s so basic code.

As a side note, it would be great if you could answer some doubts i wrote in the “Register design decision for backend” thread.

Thanks for your help.

I gave you some pointers to get you started on your own analysis of this. Take a look at the DAGs and the scheduling.

I don't want to take away this opportunity for you to learn more about how the LLVM code generator works.

/jakob