Cross-module function inlining

I've developed a working LLVM back-end (based on LLVM 2.6) for a custom architecture with its own tool chain. This tool chain creates stand-alone programs from a single assembly. We used to use GCC, which supported producing a single machine assembly from multiple source files.

I modified Clang to accept the architecture, but discovered that clang-cc (or the Clang Tool subclass inside Clang) doesn't allow multiple source files to be lowered to a single machine assembly. The ToolChain subclasses inside Clang make use of the normal system linker to combine multiple modules, but this isn't possible on our system.

So, I created a new Clang ToolChain subclass that forms a tool pipeline based on the following:
- Run the existing Clang tool on each source file, using -emit-llvm to generate a .bc file for each module.
- Run llvm-link to merge them into a single .bc file.
- Run llc to generate a complete machine assembly.
The last two were implemented together in a single Tool, performing the job of the linker. Optimisation options are passed onto each tool.

This does the trick.

However, with optimisations enabled, the resulting code is not as efficient as it would be if all the code were in a single module. In particular, function inlining is only performed by clang (i.e. only on a module-by-module basis), and not by llvm-link or llc. This can be seen in the resulting pass options with -O3 (obtained using '-Xclang -debug-only=Execution' and '-Xlinker -debug-only=Execution'):

Clang:
Pass Arguments: -raiseallocs -simplifycfg -domtree -domfrontier -mem2reg -globalopt -globaldce -ipconstprop -deadargelim -instcombine -simplifycfg -basiccg -prune-eh -functionattrs -inline -argpromotion -simplify-libcalls -instcombine -jump-threading -simplifycfg -domtree -domfrontier -scalarrepl -instcombine -break-crit-edges -condprop -tailcallelim -simplifycfg -reassociate -domtree -loops -loopsimplify -domfrontier -lcssa -loop-rotate -licm -lcssa -loop-unswitch -instcombine -scalar-evolution -lcssa -iv-users -indvars -loop-deletion -lcssa -loop-unroll -instcombine -memdep -gvn -memdep -memcpyopt -sccp -instcombine -break-crit-edges -condprop -domtree -memdep -dse -adce -simplifycfg -strip-dead-prototypes -print-used-types -deadtypeelim -constmerge

llc:
Pass Arguments: -preverify -domtree -verify -loops -loopsimplify -scalar-evolution -iv-users -loop-reduce -lowerinvoke -unreachableblockelim -codegenprepare -stack-protector -machine-function-analysis -machinedomtree -machine-loops -machinelicm -machine-sink -unreachable-mbb-elimination -livevars -phi-node-elimination -twoaddressinstruction -liveintervals -simple-register-coalescing -livestacks -virtregmap -linearscan-regalloc -stack-slot-coloring -prologepilog -machinedomtree -machine-loops -machine-loops

I'm sure I can hack away to manually add these passes, but I'd prefer an informed opinion on the best way to achieve this, or if there's a more proper way to achieve the same thing (i.e. inter-module function inlining).

Also, I've noticed another problem with this approach: when function declarations are 'inline __attribute__((always_inline))' in header files, where the corresponding function definition is in a separate module to where the function is being called, LLVM will not inline the function call at the call site, but will happily strip away the function body, resulting in broken code. Is there a way to stop this?

Any guidance is much appreciated.

Regards,

- Mark

Mark Muir wrote:

I've developed a working LLVM back-end (based on LLVM 2.6) for a custom architecture with its own tool chain. This tool chain creates stand-alone programs from a single assembly. We used to use GCC, which supported producing a single machine assembly from multiple source files.

I modified Clang to accept the architecture, but discovered that clang-cc (or the Clang Tool subclass inside Clang) doesn't allow multiple source files to be lowered to a single machine assembly. The ToolChain subclasses inside Clang make use of the normal system linker to combine multiple modules, but this isn't possible on our system.

So, I created a new Clang ToolChain subclass that forms a tool pipeline based on the following:
- Run the existing Clang tool on each source file, using -emit-llvm to generate a .bc file for each module.
- Run llvm-link to merge them into a single .bc file.
- Run llc to generate a complete machine assembly.
The last two were implemented together in a single Tool, performing the job of the linker. Optimisation options are passed onto each tool.

This does the trick.

However, with optimisations enabled, the resulting code is not as efficient as it would be if all the code were in a single module. In particular, function inlining is only performed by clang (i.e. only on a module-by-module basis), and not by llvm-link or llc. This can be seen in the resulting pass options with -O3 (obtained using '-Xclang -debug-only=Execution' and '-Xlinker -debug-only=Execution'):

It sounds like you're not running the LTO optimizations. You could try replacing llvm-link with llvm-ld which will, or run 'opt -std-link-opts' between llvm-link and llc.

Clang:
Pass Arguments: -raiseallocs -simplifycfg -domtree -domfrontier -mem2reg -globalopt -globaldce -ipconstprop -deadargelim -instcombine -simplifycfg -basiccg -prune-eh -functionattrs -inline -argpromotion -simplify-libcalls -instcombine -jump-threading -simplifycfg -domtree -domfrontier -scalarrepl -instcombine -break-crit-edges -condprop -tailcallelim -simplifycfg -reassociate -domtree -loops -loopsimplify -domfrontier -lcssa -loop-rotate -licm -lcssa -loop-unswitch -instcombine -scalar-evolution -lcssa -iv-users -indvars -loop-deletion -lcssa -loop-unroll -instcombine -memdep -gvn -memdep -memcpyopt -sccp -instcombine -break-crit-edges -condprop -domtree -memdep -dse -adce -simplifycfg -strip-dead-prototypes -print-used-types -deadtypeelim -constmerge

This pass list is fine, it's equivalent to 'opt -std-compile-opts'.

Nick

Yep, that sorted inlining. Thanks.

But... now there's a small problem with library calls. Symbols such as 'memset', 'malloc', etc. are being removed by global dead code elimination. They are implemented in one of the bitcode modules that are linked together (implementations are based on newlib). I get the same behaviour of them being stripped even when they are live, by the following:

opt -internalize -globaldce

Other (not standard-library) functions implemented in different modules than where they are called, are correctly seen as live. So, could this be something to do with what is declared as a built-in? I haven't provided any list of built-ins (or overridden the defaults), nor could I figure out how exactly to do that.

I've also noticed other problems related to built-ins - in one example, code made use of abs(), but didn't #include <stdlib.h>. The resulting code compiled without warning or error, but the resulting code was broken, due to the arguments not being seen as live, e.g.:

Without #include <stdlib.h>:

  0x181e8b0: i32 = TargetGlobalAddress <i32 (...)* @abs> 0 [TF=1]
=> JUMP_CALLi <ga:abs>[TF=1], %r2<imp-def>, %r3<imp-def>, %r4<imp-def,dead>, %r5<imp-def,dead>, %r6<imp-def,dead>, %r7<imp-def,dead>, %r8<imp-def,dead>, %r9<imp-def,dead>, %r10<imp-def,dead>

With #include <stdlib.h>:

  0x181e8b0: i32 = TargetGlobalAddress <i32 (i32)* @abs> 0 [TF=1]
=> JUMP_CALLi <ga:abs>[TF=1], %r3<kill>, %r2<imp-def>, %r3<imp-def>, %r4<imp-def,dead>, %r5<imp-def,dead>, %r6<imp-def,dead>, %r7<imp-def,dead>, %r8<imp-def,dead>, %r9<imp-def,dead>, %r10<imp-def,dead>

Where r2 is the link register, and r3 to r10 are argument/retval registers. LowerFormalArguments() doesn't see any arguments in the former, and consequently doesn't add input register nodes to the DAG.

I guess I need help with the concept of built-ins, and what code is related to them in the Clang driver and back-end.

Regards,

- Mark

Mark Muir wrote:

  • Run the existing Clang tool on each source file, using -emit-llvm to generate a .bc file for each module.
  • Run llvm-link to merge them into a single .bc file.
  • Run llc to generate a complete machine assembly.

However, with optimisations enabled, the resulting code is not as efficient as it would be if all the code were in a single module. In particular, function inlining is only performed by clang (i.e. only on a module-by-module basis), and not by llvm-link or llc.

It sounds like you’re not running the LTO optimizations. You could try replacing llvm-link with llvm-ld which will, or run ‘opt -std-link-opts’ between llvm-link and llc.

Yep, that sorted inlining. Thanks.

But… now there’s a small problem with library calls. Symbols such as ‘memset’, ‘malloc’, etc. are being removed by global dead code elimination. They are implemented in one of the bitcode modules that are linked together (implementations are based on newlib).

And what problems does that cause? If malloc is linked in, we’re free to inline it everywhere and delete the symbol. If you meant for it to be visible to the optimizers but you don’t want it to be part of the code generated for your program (ie., you’ll link it against newlib later), you should mark the functions with available_externally linkage.

I get the same behaviour of them being stripped even when they are live, by the following:

opt -internalize -globaldce

Other (not standard-library) functions implemented in different modules than where they are called, are correctly seen as live. So, could this be something to do with what is declared as a built-in? I haven’t provided any list of built-ins (or overridden the defaults), nor could I figure out how exactly to do that.

Alternately, if you wanted malloc, memset and friends to be externally visible (compiled as part of your program and dlsym’able), you could create a public api file which contains a one per line list of the names of the functions that may not be marked internal linkage by internalize. Pass that in to opt with -internalize-public-api-file filename …other flags…

Nick

But… now there’s a small problem with library calls. Symbols such as ‘memset’, ‘malloc’, etc. are being removed by global dead code elimination. They are implemented in one of the bitcode modules that are linked together (implementations are based on newlib).

And what problems does that cause? If malloc is linked in, we’re free to inline it everywhere and delete the symbol. If you meant for it to be visible to the optimizers but you don’t want it to be part of the code generated for your program (ie., you’ll link it against newlib later), you should mark the functions with available_externally linkage.

Sorry, I should’ve been more clear - the calls to _malloc and _free weren’t being inlined (see example below). I’m not sure why (happens with or without -simplify-libcalls). So, the resulting .bc file from ‘opt’ contains live references to symbols that were in its input .bc, but for some reason it stripped them.

#include <stdlib.h>

int entries = 3;
int result;

int main()
{
int i;

// Allocate and populate the initial array.
int* values = malloc(entries * sizeof(int));
for (i = 0; i < entries; i ++)
values[i] = i + 1;

// Calculate the sum, using a dynamically allocated accumulator.
int* acc = malloc(sizeof(int));
*acc = 0;
for (i = 0; i < entries; i ++)
*acc += values[i];
result = *acc;

// Deallocate the memory.
free(values);
free(acc);

return 0;
}

Here’s a fragment of the final machine assembly (with -O3):

_main:
ADDCOMP out=r1 in1=r1 in2=4 conf=ADDCOMP_SUB WMEM in=r2 in_addr=r1 conf=WMEM_SI
CONST_16B out=r3 conf=12
JUMP nl_out=r2/RA/ addr_in=&_malloc conf=`JUMP_ALWAYS_ABS // Call

In case this is important, here is the relevant declarations from the ‘stdlib.h’ that is in use:

_PTR _EXFUN(malloc,(size_t __size));
_VOID _EXFUN(free,(_PTR));

where:

#define _PTR void *
#define _EXFUN(name, proto) name proto

and from ‘newlib.c’:

void *
malloc (size_t sz)
{

}

i.e. They look like any other function call, which is why I suspect it has something to do with special behaviour given to built-ins.

Alternately, if you wanted malloc, memset and friends to be externally visible (compiled as part of your program and dlsym’able), you could create a public api file which contains a one per line list of the names of the functions that may not be marked internal linkage by internalize. Pass that in to opt with -internalize-public-api-file filename …other flags…

I saw that. I was thinking of only using that option as a last resort, due to maintainability.

I guess I need help with the concept of built-ins, and what code is related to them in the Clang driver and back-end.

Thanks.

  • Mark

Mark Muir wrote:

    But... now there's a small problem with library calls. Symbols
    such as 'memset', 'malloc', etc. are being removed by global dead
    code elimination. They are implemented in one of the bitcode
    modules that are linked together (implementations are based on
    newlib).

And what problems does that cause? If malloc is linked in, we're free
to inline it everywhere and delete the symbol. If you meant for it to
be visible to the optimizers but you don't want it to be part of the
code generated for your program (ie., you'll link it against newlib
later), you should mark the functions with available_externally linkage.

Sorry, I should've been more clear - the calls to _malloc and _free
weren't being inlined (see example below). I'm not sure why (happens
with or without -simplify-libcalls). So, the resulting .bc file from
'opt' contains live references to symbols that were in its input .bc,
but for some reason it stripped them.

Okay. Could you post an .ll (run 'llvm-dis < foo.bc') example of where this happens? Just the input and opt commands to run is fine. It's very frustrating to look at C and assembly when the problem is in the IR -> IR transform itself.

Nick

I've attached the relevant IR (stripped down to the bare minimum). The following commands will reproduce the problem (using vanilla 2.6 versions of the LLVM tools):

  llvm-as test_malloc.ll -o - | opt -std-link-opts -o - | llvm-dis -o -

That strips everything except for @main. The stripping of the two global variables is fine, and there are no references to them left in the IR. But there are live references to @malloc and @free.

The minimum options required for this behaviour are:

  llvm-as test_malloc.ll -o - | opt -internalize -globaldce -o - | llvm-dis -o -

If I use -disable-internalize with -std-link-opts, then global dead code elimination doesn't remove anything, but inlining still takes place. So that is the solution I'm using at the moment. But I'd like to know why this behaviour is happening, and it would be nice to have global DCE so that the resulting machine assembly is easier to work with (for manual debugging on this architecture).

Thanks for looking at this.

Regards,

- Mark

test_malloc.ll.bz2 (2.36 KB)

Mark Muir wrote:

calls to _malloc and _free
weren't being inlined (see example below). I'm not sure why (happens
with or without -simplify-libcalls). So, the resulting .bc file from
'opt' contains live references to symbols that were in its input .bc,
but for some reason it stripped them.

Okay. Could you post an .ll (run 'llvm-dis< foo.bc') example of where this happens? Just the input and opt commands to run is fine. It's very frustrating to look at C and assembly when the problem is in the IR -> IR transform itself.

I've attached the relevant IR (stripped down to the bare minimum). The following commands will reproduce the problem (using vanilla 2.6 versions of the LLVM tools):

  llvm-as test_malloc.ll -o - | opt -std-link-opts -o - | llvm-dis -o -

That strips everything except for @main. The stripping of the two global variables is fine, and there are no references to them left in the IR. But there are live references to @malloc and @free.

The minimum options required for this behaviour are:

  llvm-as test_malloc.ll -o - | opt -internalize -globaldce -o - | llvm-dis -o -

If I use -disable-internalize with -std-link-opts, then global dead code elimination doesn't remove anything, but inlining still takes place. So that is the solution I'm using at the moment. But I'd like to know why this behaviour is happening, and it would be nice to have global DCE so that the resulting machine assembly is easier to work with (for manual debugging on this architecture).

Thanks for looking at this.

Thanks, I think it's now pretty clear what's going on. The .ll you posted has a @free function with no calls to it. Since it's never called, it can be deleted after -internalize.

What happened to your free() calls in the C code is that they turned into free instructions in LLVM. You can fix that by passing -ffreestanding, but realize that this may trigger other missed optimizations as clang/gcc will cease assuming that functions with certain names do certain things.

Nick