LLVM 3.4 performance regressed?

Hi,

It was suggested that I post my question regarding a LLVM 3.4 performance
regression to this mailing list, rather than stackoverflow. So here is
the link:

  https://stackoverflow.com/questions/22902034/llvm-3-4-performance-regressed

Thanks :slight_smile:
Jens

Hi,

One reason for the regression might be that the SROA pass is now used instead of mem2reg; consider replacing your use of mem2reg by SROA.

I also think it’s meaningless to talk about performance without actually enabling any optimization passes. You mention “Inlining failed?” but don’t enable any passes that would inline functions. If performance matters for you, consider using -O3 or a similar flag.

Best,
Jonas

Hi,

(adding llvm-dev again)

Hi,

(adding llvm-dev again)

I did enable optimization, but that didn't have an effect on the runtime
performance numbers.

Can you elaborate? For a program such as bzip2, I'd expect the program to be
at least twice as fast with -O3 than with -O0.

I also noticed that you use LLC in the final step.

This is a thing I'd find suspect as well - the correct sequence of
passes changes from time to time (certainly over several revisions).
I'd be more inclined to do a normal clang (all the way to object files
- or lto as described below) comparison, that'd rule out any dated
pass sequencing you might have.

Thanks Jonas,

I wasn't aware of the gold linker plugin. Here's what I do, in my
current workflow. First, I use clang to compile each .c file (e.g. for
the bzip2 benchmark, or any other) into a .bc file:

  specmake clean 2> make.clean.err | tee make.clean.out
  rm -rf bzip2 bzip2.exe *.o *.fppized.f*
  find . \( -name \*.o -o -name '*.fppized.f*' \) -print | xargs rm -rf
  rm -rf core
  specmake build 2> make.err | tee make.out
  clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm -c -o spec.o -DSPEC_CPU -DNDEBUG spec.c
  clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm -c -o blocksort.o -DSPEC_CPU -DNDEBUG blocksort.c
  clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm -c -o bzip2.o -DSPEC_CPU -DNDEBUG bzip2.c
  bzip2.c:487:27: warning: incompatible pointer to integer conversion
  assigning to 'int' from 'void *' [-Wint-conversion]
     outputHandleJustInCase = NULL;
                          ^ ~~~~
  bzip2.c:614:27: warning: incompatible pointer to integer conversion
  assigning to 'int' from 'void *' [-Wint-conversion]
     outputHandleJustInCase = NULL;
                          ^ ~~~~
  2 warnings generated.
  clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm -c -o bzlib.o -DSPEC_CPU -DNDEBUG bzlib.c
  clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm -c -o compress.o -DSPEC_CPU -DNDEBUG compress.c
  clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm -c -o crctable.o -DSPEC_CPU -DNDEBUG crctable.c
  clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm -c -o decompress.o -DSPEC_CPU -DNDEBUG decompress.c
  clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm -c -o huffman.o -DSPEC_CPU -DNDEBUG huffman.c
  clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm -c -o randtable.o -DSPEC_CPU -DNDEBUG randtable.c

Once that's done, the Spec "linker" actually calls to a script of mine
which uses llvm-link to merge all bitcode files into one, and then calls
opt.

Ordinarily this opt call would use

  -simplifycfg -mem2reg <my-passes>

At that point I played around with various optimization switches to find
out how I can get my performance back to that of LLVM 3.1 compiled code.
Using just plain -std-compile-opts to replace my command line didn't
work.

Once opt has produced an optimized bitcode file, I call llc to lower it.

Cheers,
Jens

Hi,

I wasn’t aware of the gold linker plugin.

It doesn’t seem to be that well known indeed… yet it’s been the best way to compile source to bitcode and bitcode to programs, in my experience. Here are a few more hints (that may also be useful to others who stumble upon this thread).

  • To generate bitcode files from Clang, use -flto (or -emit-llvm)
  • During linking, use -flto to accept bitcode files as input.
    Note that their extension matters (.bc gets compiled, LTO’d and linked, .o gets just LTO’d and linked)
  • You can obtain the merged bitcode file that corresponds to the final executable as follows:
  • on Mac, pass -Wl,-save-temps to the compiler. This will give you both a prog.lto.bc and a prog.lto.opt.bc file. One corresponds to the program before link-time optimization passes were applied, the other after link-time optimization.
  • on Linux, the corresponding command is -Wl,-plugin-opt=also-emit-llvm . This only gives you the bitcode before link-time optimizations, unfortunately. You can use the attached patch to make the behavior more consistent between Linux and Mac.
  • If your program creates libraries, you can make them contain LLVM bitcode instead of regular object code.
    On Linux, this requires a few tweaks because nm, ar, ranlib etc. don’t know how to handle LLVM bitcode natively. Usually, passing the following to the configure script helps:
    RANLIB=“ar -s --plugin=/path/to/llvm/lib/LLVMgold.so”
    AR=“ar -cru --plugin=/path/to/llvm/lib/LLVMgold.so”
    NM=“nm --plugin=/path/to/llvm/lib/LLVMgold.so”

Let us know if this helps. Also, if anybody knows better workflows, I’d be very interested.

Cheers,
Jonas

0001-Change-LTO-to-emit-optimized-BC-file.patch (3.01 KB)