Function calls keep increasing the stack usage

Hi everyone,

I found that LLVM generates redundant code when calling functions with constant parameters, with optimizations disabled.

Consider the following C code snippet:

int foo(int x, int y);

void bar()
{
foo(1, 2);
foo(3, 4);
}

Clang/LLVM 6.0 generates the following assembly code:
_bar:
subl $32, %esp
movl $1, %eax
movl $2, %ecx
movl $1, (%esp)
movl $2, 4(%esp)
movl %eax, 28(%esp)
movl %ecx, 24(%esp)
calll _foo
movl $3, %ecx
movl $4, %edx
movl $3, (%esp)
movl $4, 4(%esp)
movl %eax, 20(%esp)
movl %ecx, 16(%esp)
movl %edx, 12(%esp)
calll _foo
movl %eax, 8(%esp)
addl $32, %esp
retl

Note how the constants are stored in registers but when saving the parameters on the stack for the call the immediate values are used. The registers are still stored on the stack probably because it’s the caller’s responsibility once they were used (which seems expected).
I think the problem comes from the fact that LLVM unconditionally allocates a register for each parameter value regardless if it’s used later or not.
If the stack space of the program is sufficiently large this is probably not a problem, but otherwise if there is a large number of such calls, despite not recursive, it can lead to stack overflow. Do you think I should create a bug report for this?

(Similarly, the return value of the function could be not saved but the LLVM IR code that Clang generates has the call with assignment so at this point LLVM couldn’t possibly know.

define void @bar() #0 {
%call = call i32 @foo(i32 1, i32 2)
%call1 = call i32 @foo(i32 3, i32 4)
ret void
}

)

Thanks,
Alpar

Can’t say I’ve observed that behavior (though I’m just building from top-of-tree rather than 6.0, compiling for x86-64 on linux), perhaps you could provide more detail (what target are you compiling for - possibly provide the -cc1 command line, etc).

bar: # @bar
.cfi_startproc

%bb.0: # %entry

pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
subq $16, %rsp
movl $1, %edi
movl $2, %esi
callq foo
movl $3, %edi
movl $4, %esi
movl %eax, -4(%rbp) # 4-byte Spill
callq foo
movl %eax, -8(%rbp) # 4-byte Spill
addq $16, %rsp
popq %rbp
.cfi_def_cfa %rsp, 8
retq

Or on 32-bit X86:

bar: # @bar
.cfi_startproc

%bb.0: # %entry

pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
subq $16, %rsp
movl $1, %edi
movl $2, %esi
callq foo
movl $3, %edi
movl $4, %esi
movl %eax, -4(%rbp) # 4-byte Spill
callq foo
movl %eax, -8(%rbp) # 4-byte Spill
addq $16, %rsp
popq %rbp
.cfi_def_cfa %rsp, 8
retq

I think this is a bug in fastisel. It sometimes generates unnecessary materializations like this. It’s unclear why they end up getting spilled, though.

Sorry I missed that important detail. The relevant part of the command line is:
-cc1 -S -triple i386-pc-win32

I don’t expect it matters if it’s for Windows or Linux in this case.

Still not seeing it on ToT, so maybe it’s been fixed?

$ clang-tot -cc1 -S -triple i386-pc-win32 stack.c

_bar:
subl $16, %esp
movl $1, (%esp)
movl $2, 4(%esp)
calll _foo
movl $3, (%esp)
movl $4, 4(%esp)
movl %eax, 12(%esp)
calll _foo
movl %eax, 8(%esp)
addl $16, %esp
retl

$ clang-tot --version
clang version 8.0.0 (trunk 342200) (llvm/trunk 342202)
Target: x86_64-unknown-linux-gnu
Thread model: posix

Thanks for confirming that it seems to be a bug.
I was studying the code you mentioned and X86FastISel::fastLowerCall() calls getRegForValue() for each argument. Then the second iteration over the arguments (after it called CCInfo.AnalyzeCallOperands()) may decide to call X86FastEmitStore() if not VA.isRegLoc() and ArgVal is a constant (ignoring the register, but that was already marked used I guess by getRegForValue).
http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86FastISel.cpp?view=markup

Could we maybe delay the call to getRegForValue() after the analyze so that we don’t call it when not needed?

Thanks for checking, I suppose it may have been fixed then, I don’t have the latest version to try it now.
Curious what could have fixed it, because X86FastISel::fastLowerCall() still has the calls to getRegForValue() (or maybe that’s not the problem).

I suspect it was fixed by my local value sinking change, which delete unused local value materializations like these.

  • Questions like this are better suited for bugs.llvm.org

  • Please provide the full clang invocation. In this case I suspect your problem will go away if you use -O2 or similar.

  • Matthias

Not sure - I think it’s pretty fine for discussion here, when someone’s not sure it’s a bug/intended, etc.

  • Please provide the full clang invocation. In this case I suspect your problem will go away if you use -O2 or similar.

Right, I think the original post calls this out as specifically being about -O0.

  • Dave

Not sure - I think it’s pretty fine for discussion here, when someone’s not sure it’s a bug/intended, etc.

  • Please provide the full clang invocation. In this case I suspect your problem will go away if you use -O2 or similar.

Right, I think the original post calls this out as specifically being about -O0.

Ah sorry, missed that bit.

It’s a bit harder to argue about unoptimize/-O0 code quality. If you can improve the generated code without making the compiler slower (and without increasing complexity) then that may be a change we want to make, but generally we rather want fast/simple code for the -O0 case.
Concrete case here does indeed come out with dead values out of fast-isel. Deciding whether this can be improved inside fast-isel or whether adding a dead code elimination pass will affect compile time would need someone to investigate/benchmark…

  • Matthias