[RFC] [X86] Mov to push transformation in x86-32 call sequences

Hello all,

In r223757 I’ve committed a patch that performs, for the 32-bit x86 calling convention, the transformation of MOV instructions that push function arguments onto the stack into actual PUSH instructions.

For example, it will transform this:

subl $16, %esp

movl $4, 12(%esp)

movl $3, 8(%esp)

movl $2, 4(%esp)

movl $1, (%esp)

calll _func

addl $16, %esp

Into this:

pushl $4

pushl $3

pushl $2

pushl $1

calll _func

addl $16, %esp

The main motivation for this is code size (a “pushl $4” is 2 bytes, a “movl $4, 12(%esp)” is 7 bytes), but there are some other advantages, as shown below.

The way this works in r223757 is by intercepting call frame simplification in the Prolog/Epilog Inserter, and replacing the mov sequence with pushes. Right now it only handles functions which do not have a reserved call frame (a small minority of cases), and I’d like to extend it to cover other cases where it is profitable.

The currently implemented approach has a couple of drawbacks:

  1. Push vs. having a reserved call frame:
    This transformation is always profitable when we do not have a reserved call frame. When a reserved frame can be used, however, there is a trade-off. For example, in a function that contains only one call site, and no other stack allocations, pushes are a clear win, since having a reserved call frame wouldn’t save any instructions. On the other hand, if a function contains 10 call sites, and only one of them can use pushes, then it is most probably a loss – not reserving a call frame will cost us 10 add instructions, and the pushes gain very little. I’d like to be able to make the decision on whether we want to have a reserved frame or pushes by considering the entire function. I don’t think this can be done in the context of PEI.

Note that in theory we could have both a reserved call frame and have some specific call sites in the function use pushes, but this is fairly tricky because it requires much more precise tracking of the stack pointer state. That is something I’m not planning to implement at this point.

  1. Register allocation inefficiency:
    Ideally, pushes can be used to make direct memory-to-memory movs, freeing up registers, and saving quite a lot of code.

For example, for this (this is obviously a constructed example, but code of this kind does exist in the wild):

void foo(int a, int b, int c, int d, int e, int f, int g, int h);

struct st { int arr[8]; };

void bar(struct st* p)

{

foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3], p->arr[4], p->arr[5], p->arr[6], p->arr[7]); }

We currently generate (with -m32 -O2) this:

pushl %ebp

movl %esp, %ebp

pushl %ebx

pushl %edi

pushl %esi

subl $44, %esp

movl 8(%ebp), %eax

movl 28(%eax), %ecx

movl %ecx, -20(%ebp) # 4-byte Spill

movl 24(%eax), %edx

movl 20(%eax), %esi

movl 16(%eax), %edi

movl 12(%eax), %ebx

movl 8(%eax), %ecx

movl %ecx, -24(%ebp) # 4-byte Spill

movl (%eax), %ecx

movl %ecx, -16(%ebp) # 4-byte Spill

movl 4(%eax), %eax

movl -20(%ebp), %ecx # 4-byte Reload

movl %ecx, 28(%esp)

movl %edx, 24(%esp)

movl %esi, 20(%esp)

movl %edi, 16(%esp)

movl %ebx, 12(%esp)

movl -24(%ebp), %ecx # 4-byte Reload

movl %ecx, 8(%esp)

movl %eax, 4(%esp)

movl -16(%ebp), %eax # 4-byte Reload

movl %eax, (%esp)

calll _foo

addl $44, %esp

popl %esi

popl %edi

popl %ebx

popl %ebp

retl

Which is fairly horrible.

Some parameters get mov-ed up to four times - a mov from the struct into a register, a register spill, a reload, and finally a mov onto the stack.

What we’d like to generate is something like:

pushl %ebp

movl %esp, %ebp

movl 8(%ebp), %eax

pushl 28(%eax)

pushl 24(%eax)

pushl 20(%eax)

pushl 16(%eax)

pushl 12(%eax)

pushl 8(%eax)

pushl 4(%eax)

pushl (%eax)

calll _foo

addl $32, %esp

popl %ebp

retl

To produce the code above, the transformation has to run pre-reg-alloc. Otherwise, even if we fold loads into the push, it’s too late to recover from spills.

The direction I’d like to take with this is:

  1. Add an X86-specific MachineFunctionPass that does the mov → push transformation and runs pre-reg-alloc.

It will:

  • Make a decision on whether promoting some (or all) of the call sites to use pushes is worth giving up on the reserved call frame.

  • If it is, perform the mov ->push transformation for the selected call sites.

  • Fold loads into the pushes while doing the transformation.

As an alternative, I could try to teach the peephole optimizer to do it (right now it won’t even try to do this folding because PUSHes store to memory), but getting it right in the general case seems complex.

I think I’d rather do folding of the simple (but common) cases on the fly.

  1. Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly.

Doing the mov->push transformation before PEI means I’d have to leave the ADJCALLSTACKDOWN/UP pair unbalanced.

E.g. something like:

ADJCALLSTACKDOWN32 0, %ESP, %EFLAGS<imp-def,dead>, %ESP

%vreg9<def,dead> = COPY %ESP; GR32:%vreg9

PUSH32rmm %vreg0, 1, %noreg, 28, %noreg, %ESP, %ESP; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP, %ESP; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, %ESP, %ESP; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 16, %noreg, %ESP, %ESP; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 12, %noreg, %ESP, %ESP; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP, %ESP; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, %ESP, %ESP; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 0, %noreg, %ESP, %ESP; GR32:%vreg0

CALLpcrel32 ga:foo, , %ESP, %ESP

ADJCALLSTACKUP32 32, 0, %ESP, %EFLAGS<imp-def,dead>, %ESP

This, rightly, gets flagged by the verifier.

My proposal is to add an additional parameter to ADJCALLSTACKDOWN to express the amount of adjustment the call sequence itself does. This is somewhat similar to the second parameter of ADKCALLSTACKUP which allows adjustment for callee stack-clean-up.

So, in this case, we will get a “ADJCALLSTACKDOWN32 32, 32” instead of the “ADJCALLSTACKDOWN32 0”. The verifier will be happy, and PEI will know it doesn’t need to do any stack pointer adjustment.

Does this sound like the right approach?

Any suggestions, as well as warnings of potential pitfalls, are welcome. :slight_smile:

Thanks,

Michael

According to the Intel performance guidelines, pushes are significantly slower than moves to the extent they should be avoided as much as possible. It's been a decade since I was dealing with this; so, I don't remember the numbers, but I'm pretty sure the changes you are proposing will slow the code down.

People who care about speed more than code size are probably not going to like this very much...

Hi Michael,

You're probably in a better position than anyone to check this:

My recollection is that the microcode on some Intel processors used hidden registers that are integrated into the register renaming pipeline for the top few stack slots, but that all of these are pushed into the store buffer and become real writes if the stack pointer is moved or the stack-top cache line is subject to any coherency messages. This means that esp-relative movs are significantly cheaper than pushs and pops. Is this still true with any modern x86 processors? (Or was it never true and I misremember)

David

Which performance guidelines are you referring to?

I’m not that familiar with decade-old CPUs, but to the best of my knowledge, this is not true on current hardware.

There is one specific circumstance where PUSHes should be avoided – for Atom/Silvermont processors, the memory form of PUSH is inefficient, so the register-freeing optimization below may not be profitable (see 14.3.3.6 and 15.3.1.2 in the Intel optimization reference manual).

Having said that, one distinct possibility is to have the heuristic make different decisions depending on the optimization flags, that is be much more aggressive for optsize functions.

Hi David,

I'm not that familiar with the micro-architecture (and even if I were, I doubt I could comment on this sort of detail :slight_smile: ), but I don't believe there is a penalty on modern x86.

In any case, if it turns out this will introduce performance regressions, either for this or for another reason, then we can enable it for optsize only.
I don't expect that to happen, but it's a possible fallback. I'll do the benchmarking once I have a sane prototype.

Michael

Table C-21 in “Intel® 64 and IA-32 Architectures Optimization Reference Manual”, September 2014. It hasn’t changed. It still lists push and pop instructions as 2-3 times more expensive as mov. And that’s not taking into account any optimizations that might get undone by the stack pointer changing. I’m just speculating, but I suspect that move being faster has to do with not having to modify a register every time. With that as a basis, the fastest entry/exit sequences just use subl/addl on the stack pointer and don’t use push at all. For most C functions, you don’t even have to materialize a frame pointer (if the unwind mechanisms are set up to handle that). Not that I am recommending changing the x86_32 code generation to do that.

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu]
On Behalf Of Herbie Robinson
Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

> Which performance guidelines are you referring to?

Table C-21 in "Intel(r) 64 and IA-32 Architectures Optimization Reference Manual", September 2014.

It hasn't changed. It still lists push and pop instructions as 2-3 times more expensive as mov.

And verified by Agner Fog's independent measurements:

The relevant Haswell numbers are on pages 186 - 187.

-Chuck

Very true (and confirmed downthread), but this code size optimization is still very interesting for functions using the minsize function attribute, which is specifically designed to favor code size over performance:
http://llvm.org/docs/LangRef.html#function-attributes

The linux kernel folks really care about the size of the bootloader, for example.

-Chris

The official Intel guide has less resolution here than we need.
Consider that the MOV instruction described in C-21 only uses the ALU execution unit, not memory.

If we look at Agner's table for Haswell, the relevant form of the MOV instruction (m,r) has a latency of 3 cycles and a reciprocal throughput of 1, same as a register PUSH.
The same applies for other recent x86 processors.

Every FreeBSD clang upgrade involves having to remove some code from the bootloader, so we'd also be very interested in -Oz working well to reverse this change. There's some low-hanging fruit (I did manage to improve on the size by running clang -O0 and then opt with a hand-tweaked set of passes), but it would be great to have some regression tests for this.

We'd be happy to provide the FreeBSD bootloader code as the source for one of these. It's pretty small and self-contained, but if the size of the output grows over a certain limit then we can no longer boot.

David

But the r,m move has a reciprocal throughput of 0.5 while the pop is 1.

It sounds like this should be optional as you have already proposed.

Going by Agner's, the register form of pop, on Haswell, has a reciprocal throughput of 0.5, just like the equivalent mov.
The memory form of pop has a reciprocal throughput of 1, but that has no equivalent mov instruction, since x86 can't do a MOVmm. It's equivalent to to a MOVrm followed by a MOVmr.
In any case, that's not really relevant since I'm planning on only adding pushes, not pops.

I'm not saying this can't result in any performance issues (as David noted there may be more complex microarchitectural issues here), just that it's not expected to be a problem on an instruction-per-instruction basis.