RegAllocFast uses too much stack

I discovered recently that RegAllocFast spills all the registers before every function call. This is the root cause of one of our recursive functions that takes about 150 bytes of stack when built with gcc (same at -O0 and -O2, or 120 bytes at llc -O2) taking 960 bytes of stack when built by llc -O0. That’s pretty bad for situations where you have small stacks, which is not uncommon for threaded software on a 32-bit architecture, or a signal handler.

I realize that some of this is intentional and we don’t want to do optimization at -O0, but I’m really hoping there’s something we could sensibly do to improve this. Here’s an example:

extern “C” void foo(int);

void test() {
foo(0);
foo(1);
foo(2);
}

This doesn’t just spill out all the registers to the stack before each call, we also set up 0, 1 and 2 into regs first, then spill them and don’t even get a chance to reuse stack slots. That’s just bad:

pushq %rax

movl $2, %edi

movl $1, %eax
movl $0, %ecx
movl %edi, 4(%rsp) # 4-byte Spill
movl %ecx, %edi
movl %eax, (%rsp) # 4-byte Spill
callq foo
movl (%rsp), %edi # 4-byte Reload
callq foo
movl 4(%rsp), %edi # 4-byte Reload
callq foo
popq %rax
ret

Does anyone have any ideas what we could do that doesn’t add to the compile time?

Nick

Does anyone have any ideas what we could do that doesn't add to the
compile time?

There has been some discussion about it on PR10075.

Nick

Cheers,
Rafael

We probably won't be having virtual registers live across calls. That requires quite a bit of analysis which would be too expensive compile time wise.

Which means that we should avoid creating live ranges spanning calls. This is what fast isel does now:

# After Instruction Selection:
# Machine code for function test:

BB#0: derived from LLVM BB %entry
        %vreg0<def> = MOV32ri 2; GR32:%vreg0
        %vreg2<def> = MOV32ri 1; GR32:%vreg2
        %vreg4<def> = MOV32ri 0; GR32:%vreg4
        ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
        %EDI<def> = COPY %vreg4; GR32:%vreg4
        %AL<def> = MOV8ri 0
        CALL64pcrel32 <ga:@foo>, %AL, %EDI, %RAX<imp-def>, %RDI<imp-def,dead>, %RSP<imp-use>, ...
        ADJCALLSTACKUP64 0, 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
        %vreg5<def> = COPY %EAX; GR32:%vreg5
        ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
        %EDI<def> = COPY %vreg2; GR32:%vreg2
        %AL<def> = MOV8ri 0
        CALL64pcrel32 <ga:@foo>, %AL, %EDI, %RAX<imp-def>, %RDI<imp-def,dead>, %RSP<imp-use>, ...
        ADJCALLSTACKUP64 0, 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
        %vreg3<def> = COPY %EAX; GR32:%vreg3
        ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
        %EDI<def> = COPY %vreg0; GR32:%vreg0
        %AL<def> = MOV8ri 0
        CALL64pcrel32 <ga:@foo>, %AL, %EDI, %RAX<imp-def>, %RDI<imp-def,dead>, %RSP<imp-use>, ...
        ADJCALLSTACKUP64 0, 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
        %vreg1<def> = COPY %EAX; GR32:%vreg1
        RET

I don't know why it materializes those constants at the top of the block. I think they should just be materialized immediately before they are used.

/jakob

This seems odd. I'd think that fast-isel should be able to materialize the constants when we want them rather than at the beginning of the block.

-eric

I'm not entirely sure why, but FastISel does intentionally materialize
constants at the beginning of the block. See
FastISel::enterLocalValueArea etc. Maybe Dan knows why?

-Eli

Yeah, I'd seen that. Curious. Might be a remnant of the old top down method to avoid materializing constants too many times.

-eric

Going bottom-up, FastISel doesn't know when it'll see the first use of a value
in a block. Cleverer schemes are possible.

Dan

Or less clever by not caching the result? :slight_smile:

-eric

At least, we shouldn't cache stuff across function calls.

/jakob

Special casing "instructions that clobber everything", or just calls, would make a lot of sense.

-Chris