Backend optimizations

Hi,

I'm writting an intrinsics for the X86 plateform that replace every
'call' instruction by a 'push ret_addr' followed by a 'jmp func_addr'.
I'm doing this in the X86ISelLowering class with a custom inserter.

So if I have something like this:

0x0 call foobar
0x1 ...

the call will be replaced like this:

0x0 push 0x2
0x1 jmp foobar_addr
0x2 ...

This works fine in -O0, but when I try to compile in -O1,2,3, my program
fails. Since there is no more 'call' instruction, the optimizations in
the backend remove the operation that manage the function's arguments (I
guess it thinks the arguments are not needed anymore).

How can I resolve this problem? Is there a specific pass that remove
unused values that I can disabled?

Cheers

From the department of ignorance and stupid suggestions: Run this

conversion after other passes that deal with call instructions?

Side question: Is this targeting some unusual x86 architecture, as I
believe this would be quite detrimental to performance on modern CPUs,
as they use the pair of call/return to do predictive execution, so if
you remove the CALL, the return will "unbalance" the call stack
management, and lead to slower execution.

From the department of ignorance and stupid suggestions: Run this
conversion after other passes that deal with call instructions?

Yes, but my modifications are not made in a MachineFunctionPass, it's a
custom inserter for an intrinsic... I'm not sure when intrinsic lowering
is applied, but I guess before any MachineFunctionPass? So I'm not sure
I can chosse the order at this point.

Side question: Is this targeting some unusual x86 architecture, as I
believe this would be quite detrimental to performance on modern CPUs,
as they use the pair of call/return to do predictive execution, so if
you remove the CALL, the return will "unbalance" the call stack
management, and lead to slower execution.

The goal here is to add some obfuscation to the final binary, so some
performance loss is excepted!

Cheers

This is almost certainly the way to do it.

The immediate problem is probably that the jmp isn't being marked with
the usual call operands telling LLVM that it uses that
argument-marshalling registers and defines the result ones. But even
if you fix that x86's JMP is marked isTerminator, isBarrier and so on;
that means LLVM's optimisers assume it only comes at the end of a
basic block. Putting it in the middle is just wrong code waiting to
happen.

So I'd suggest either expanding a normal call in X86MCInstLower.cpp,
or creating a new call pseudo-instruction (with sensible flags) that
gets expanded there.

Cheers.

Tim.

Actually my code is made to work that way, each 'jmp' is at the end of a
basic block. For example this test function:

int main(int argc, char **argv) {
  plop2("PLOP");
  plop(1,2);
  return 0;
}

turn into (before optimisations, that's a dump of the MachineFunction in
the custom inserter):

# Machine code for function main: SSA
Frame Objects:
  fi#0: size=8, align=8, at location [SP+8]
  fi#1: size=8, align=8, at location [SP+8]

BB#0: derived from LLVM BB %entry
        MOV64mi32 <fi#0>, 1, %noreg, 0, %noreg, <ga:@plop>;
mem:ST8[%ptrToFunc1]
        %vreg2<def> = MOV32ri64 <ga:@plop>; GR32:%vreg2
        %vreg3<def> = SUBREG_TO_REG 0, %vreg2<kill>, 4; GR64:%vreg3
GR32:%vreg2
        MOV64mi32 <fi#1>, 1, %noreg, 0, %noreg, <ga:@plop2>;
mem:ST8[%ptrToFunc]
        %vreg4<def> = MOV32ri64 <ga:@plop2>; GR32:%vreg4
        %vreg5<def> = SUBREG_TO_REG 0, %vreg4<kill>, 4; GR64:%vreg5
GR32:%vreg4
        ADJCALLSTACKDOWN64 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>,
%RSP<imp-use>
        %vreg6<def> = MOV32ri64 <ga:@.str2>; GR32:%vreg6
        %vreg7<def> = SUBREG_TO_REG 0, %vreg6<kill>, 4; GR64:%vreg7
GR32:%vreg6
        %RDI<def> = COPY %vreg7; GR64:%vreg7
        %vreg11<def> = MOV64ri <BB#1>; GR64:%vreg11
        PUSH64r %vreg11, %RSP<imp-def>, %RSP<imp-use>; GR64:%vreg11
        JMP64r %vreg5; GR64:%vreg5
    Successors according to CFG: BB#1

BB#1:
    Predecessors according to CFG: BB#0
        ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>,
%EFLAGS<imp-def,dead>, %RSP<imp-use>
        ADJCALLSTACKDOWN64 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>,
%RSP<imp-use>
        %vreg8<def> = MOV32ri 1; GR32:%vreg8
        %vreg9<def> = MOV32ri 2; GR32:%vreg9
        %EDI<def> = COPY %vreg8; GR32:%vreg8
        %ESI<def> = COPY %vreg9; GR32:%vreg9
        %vreg12<def> = MOV64ri <BB#2>; GR64:%vreg12
        PUSH64r %vreg12, %RSP<imp-def>, %RSP<imp-use>; GR64:%vreg12
        JMP64r %vreg3; GR64:%vreg3
    Successors according to CFG: BB#2

BB#2:
    Predecessors according to CFG: BB#1
        ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>,
%EFLAGS<imp-def,dead>, %RSP<imp-use>
        %vreg10<def> = MOV32r0 %EFLAGS<imp-def,dead>; GR32:%vreg10
        %EAX<def> = COPY %vreg10; GR32:%vreg10
        RETQ %EAX

# End machine code for function main.

So maybe 'just' telling the optimiser that some registers are used is
sufficient?

Cheers

You could solve the unbalanced call/return by replacing the return with a pop and jump, but I don't think that's going to fool anybody for long...

But this brings forth an idea: one interesting optimization for for internal subroutines that are not internally recursive would be to:

1. Put the return address into a temp.
2. Jump to the entry point (using a PHI instruction to collect the arguments).
3. Use a computed branch on the temp instead of return.

This would be applicable for things that are too big to inline and don't allocate a lot of variables on the stack frame. One wouldn't want to do it all the time, because kernel coders will occasionally segregate stub routines that need large stack for temporary storage out of the main data path to keep the frame size down in hot paths (to reduce cache footprint). I've gotten 5-10% improvement in overall performance by doing that (with putnext, which was allocating a very large frame to format an error message that essentially never occurred).

Of course, now somebody will tell me LLVM is already doing this one :slight_smile:

The goal here is to add some obfuscation to the final binary, so some
performance loss is excepted!

You could solve the unbalanced call/return by replacing the return with
a pop and jump, but I don't think that's going to fool anybody for long...

Yeah I know, you should not use that alone... But I have some other
stuffs :wink:

But this brings forth an idea: one interesting optimization for for
internal subroutines that are not internally recursive would be to:

1. Put the return address into a temp.
2. Jump to the entry point (using a PHI instruction to collect the
arguments).
3. Use a computed branch on the temp instead of return.

It should be possible to do that also I guess.

But actually I have some others problems. I tried to expand the call
like Tim Northover said, but I was not able to make it works. You have
to push the return address and I did not find a solution to add the
future address of the next instruction in the 'push'.

I tried to keep my solution in ISelLowering, so I can create the new
basic block, get the address and push it, add the 'jmp', leave the
normal call so the arguments are not destroyed, and then, in
MCInstLower, everytime there is a call, I detect if my intrinsic is
present and delete the call. But this don't work. Something else reorder
the basic block and it fails at link time because it cannot find the
basic block address :frowning:

Cheers