Compilation benchmark: bzip2

I decided to measure clang's performance by compiling bzip2.

bzip2 is available from http://www.bzip.org/

Makefile used for benchmark is here:
http://sparcs.kaist.ac.kr/~tinuviel/devel/llvm/Makefile.bzip2

Intel(R) Core(TM)2 CPU T7200
Debian GNU/Linux Sid up-to-date 2007-12-23
bzip2 1.0.4

LLVM and clang SVN r45330 Release
GCC (Debian 4.2.2-4)
tcc (0.9.24)

Result (Minimum of 3):

tinuviel@debian:~/clang$ tar zxf src/bzip2-1.0.4.tar.gz
tinuviel@debian:~/clang$ cp make/Makefile.bzip2 bzip2-1.0.4/Makefile
tinuviel@debian:~/clang$ cd bzip2-1.0.4

tinuviel@debian:~/clang/bzip2-1.0.4$ make -s

GCC: real 0.613s user 0.556s sys 0.052s
tcc: real 0.046s user 0.028s sys 0.012s
GCC -S: real 0.555s user 0.488s sys 0.052s
clang: real 0.298s user 0.248s sys 0.040s
clang+llvm-as: real 0.636s user 0.576s sys 0.048s

Analysis:

clang -emit-llvm is about twice faster than gcc -S.
tcc has an internal assembler.

clang+llvm-as is a bit slower than gcc.
clang+llvm-as spent about half the time in assembler.
gcc spent less than 10% time in assembler.
tcc is more than ten times faster than clang or gcc.

I assume the clang binary was an optimized "release" build?

snaroff

Of course. (I stated so in the post...)

I decided to measure clang's performance by compiling bzip2.
bzip2 is available from http://www.bzip.org/
Makefile used for benchmark is here:
http://sparcs.kaist.ac.kr/~tinuviel/devel/llvm/Makefile.bzip2

Cool

Result (Minimum of 3):

tinuviel@debian:~/clang$ tar zxf src/bzip2-1.0.4.tar.gz
tinuviel@debian:~/clang$ cp make/Makefile.bzip2 bzip2-1.0.4/Makefile
tinuviel@debian:~/clang$ cd bzip2-1.0.4

tinuviel@debian:~/clang/bzip2-1.0.4$ make -s

GCC: real 0.613s user 0.556s sys 0.052s
tcc: real 0.046s user 0.028s sys 0.012s
GCC -S: real 0.555s user 0.488s sys 0.052s
clang: real 0.298s user 0.248s sys 0.040s
clang+llvm-as: real 0.636s user 0.576s sys 0.048s

Just so I understand what is going on here:
   "GCC" -> "gcc -O0 -c"
   "GCC -S" -> "gcc -O0 -S"
   "tcc" -> "tcc -c"
   "clang" -> clang -emit-llvm
   "clang+llvm-as" -> clang -emit-llvm | llvm-as

These are interesting numbers, but not very relevant. GCC -S is doing a *lot* more than clang -emit-llvm. To get a useful comparison between gcc vs clang codegen, you'd need to link the llvm code generator into clang to get it to emit a native .s file. Likewise, if you want "clang emission of llvm bytecode", you should link the bytecode writer into clang, instead of using llvm-as (which is obviously not very fast). A really rough functional approximation would be "clang -emit-llvm | llvm-as | llc -fast -regalloc=local", but this will obviously be much slower than linking llc components into clang.

To me, the one interesting thing out of this is that the difference between gcc -c and gcc -S is ~14%. That's a pretty big cost just for an assembler. Maybe someone should work on finishing the llvm .o file emitter at some point ;-).

clang+llvm-as spent about half the time in assembler.
gcc spent less than 10% time in assembler.

Right, but these assemblers are not the same thing at all :slight_smile:

At this point, llvm -O0 code generation is not nearly as fast as it should be. That said, our -O2 or -O3 codegen is a lot faster than GCC in most cases. In the future, we'll put more effort into making -O0 codegen fast.

-Chris

Okay, is this planned in near future, or can I start looking at it
now? It shouldn't be difficult, right?

Chris Lattner wrote:

I decided to measure clang's performance by compiling bzip2.
bzip2 is available from http://www.bzip.org/
Makefile used for benchmark is here:
http://sparcs.kaist.ac.kr/~tinuviel/devel/llvm/Makefile.bzip2

Cool

Result (Minimum of 3):

tinuviel@debian:~/clang$ tar zxf src/bzip2-1.0.4.tar.gz
tinuviel@debian:~/clang$ cp make/Makefile.bzip2 bzip2-1.0.4/Makefile
tinuviel@debian:~/clang$ cd bzip2-1.0.4

tinuviel@debian:~/clang/bzip2-1.0.4$ make -s

GCC: real 0.613s user 0.556s sys 0.052s
tcc: real 0.046s user 0.028s sys 0.012s
GCC -S: real 0.555s user 0.488s sys 0.052s
clang: real 0.298s user 0.248s sys 0.040s
clang+llvm-as: real 0.636s user 0.576s sys 0.048s

Just so I understand what is going on here:
   "GCC" -> "gcc -O0 -c"
   "GCC -S" -> "gcc -O0 -S"
   "tcc" -> "tcc -c"
   "clang" -> clang -emit-llvm
   "clang+llvm-as" -> clang -emit-llvm | llvm-as

These are interesting numbers, but not very relevant. GCC -S is doing a *lot* more than clang -emit-llvm. To get a useful comparison between gcc vs clang codegen, you'd need to link the llvm code generator into clang to get it to emit a native .s file. Likewise, if you want "clang emission of llvm bytecode", you should link the bytecode writer into clang, instead of using llvm-as (which is obviously not very fast). A really rough functional approximation would be "clang -emit-llvm | llvm-as | llc -fast -regalloc=local", but this will obviously be much slower than linking llc components into clang.

To me, the one interesting thing out of this is that the difference between gcc -c and gcc -S is ~14%. That's a pretty big cost just for an assembler. Maybe someone should work on finishing the llvm .o file emitter at some point ;-).

Darn. I was hoping it was further along. I was planning on using it. :slight_smile:
Hope much work do you think is left?

clang+llvm-as spent about half the time in assembler.
gcc spent less than 10% time in assembler.

Right, but these assemblers are not the same thing at all :slight_smile:

Indeed!

At this point, llvm -O0 code generation is not nearly as fast as it should be. That said, our -O2 or -O3 codegen is a lot faster than GCC in most cases. In the future, we'll put more effort into making -O0 codegen fast.

Speed is a Really Good Thing, of course. That's one reason why I'm writing the ellsif driver (getting everything done in in-memory passes).

Having said that, I've worked for a *long* time in environments involving people who use compilers (I've been both a compiler vendor and compiler user), I've heard one complaint about compilation speed and that involved linking. That complaint was from some guys using my linker on an early nineties workstation (either Sparc or hppa, I can't remember which). Systems are a bit faster today. :wink: People are much more concerned about compiler correctness, informational error messages, and debug capability. I cringe every time I get one of those g++ overload messages that force me to expand my window to 300 columns just to read it. :wink:

-Rich

Sanghyeon Seo wrote:

Richard Pennington wrote:
[snip]

Speed is a Really Good Thing, of course. That's one reason why I'm writing the ellsif driver (getting everything done in in-memory passes).

Here are two comparisons of compilation using -O0 and -O5. I'm not sure if these optimization levels are the same as clangs. The sources have been preprocessed by gcc. The only file I/O is reading the .i files. The rest is done in memory. The last few lines with zero timings haven't been implemented yet, but the number are fun, none the less:

[~/elsa/ellsif] dev% ./ellsif -v test/ofmt.i test/sieve.i -time-actions -O5
<premain>: CommandLine Error: Argument 'machine-licm' defined more than once!
ellsif: CommandLine Error: Argument 'machine-licm' defined more than once!
Adding test/ofmt.i as a preprocessed C file
Adding test/sieve.i as a preprocessed C file
Phase: Preprocessing
   test/ofmt.i is ignored during this phase
   test/sieve.i is ignored during this phase
Phase: Translation
   compile test/ofmt.i to become an unoptimized LLVM bitcode file
typechecking results:
   errors: 0
   warnings: 0
   compile test/sieve.i to become an unoptimized LLVM bitcode file
typechecking results:
   errors: 0
   warnings: 0
Phase: Optimization
   optimize ofmt.ubc to become an LLVM bitcode file
   optimize sieve.ubc to become an LLVM bitcode file
Phase: Bitcode linking
   bclink ofmt.bc to become a file that has been linked
   bclink sieve.bc to become a file that has been linked
   bclink a.bc added to the file list
Phase: Bitcode to assembly
   ofmt.bc is ignored during this phase
   sieve.bc is ignored during this phase
   bcassemble a.bc to become an assembly source file
Phase: Assembly
   ofmt.bc is ignored during this phase
   sieve.bc is ignored during this phase
   assemble a.s to become an object file, linker command file, etc
Phase: Linking
   ofmt.bc is ignored during this phase
   sieve.bc is ignored during this phase
   link a.o to become a file that has been linked

Richard Pennington wrote:

Richard Pennington wrote:
[snip]

More data. I know have the ellsif driver doing all the steps except preprocessing. The first steps are done in memory:
1. compiling (C->bitcode) for each file.
2. Optimize each bitcode module.
3. Link the bitcode modules and optimize the result.

The bitcode is then converted to assembly and sent to disk as a .s file.
The final action is to use gcc to assemble and link the result.

[~/elsa/ellsif] dev% time gcc -O5 -o gcc.out -std=c9x test/sieve.i test/ofmt.i
1.228u 0.080s 0:01.38 94.2% 0+0k 0+0io 0pf+0w
[~/elsa/ellsif] dev% time ./ellsif test/ofmt.i test/sieve.i -O5
<premain>: CommandLine Error: Argument 'machine-licm' defined more than once!
ellsif: CommandLine Error: Argument 'machine-licm' defined more than once!
0.852u 0.044s 0:00.90 98.8% 0+0k 0+0io 0pf+0w
[~/elsa/ellsif] dev%

As you can see. At -O5 ellsif is slighty faster. This is a pretty small example: I'll try it on bzip2 soon.

Here are the ellsif phases and the timing by action:

[~/elsa/ellsif] dev% time ./ellsif test/ofmt.i test/sieve.i -O5 -time-actions -v
<premain>: CommandLine Error: Argument 'machine-licm' defined more than once!
ellsif: CommandLine Error: Argument 'machine-licm' defined more than once!
Adding test/ofmt.i as a preprocessed C file
Adding test/sieve.i as a preprocessed C file
Phase: Preprocessing
   test/ofmt.i is ignored during this phase
   test/sieve.i is ignored during this phase
Phase: Translation
   compile test/ofmt.i to become an unoptimized LLVM bitcode file
   compile test/sieve.i to become an unoptimized LLVM bitcode file
Phase: Optimization
   optimize ofmt.ubc to become an LLVM bitcode file
   optimize sieve.ubc to become an LLVM bitcode file
Phase: Bitcode linking
   bclink ofmt.bc to become a file that has been linked
   bclink sieve.bc to become a file that has been linked
   bclink a.bc added to the file list
Phase: Generating
   ofmt.bc is ignored during this phase
   sieve.bc is ignored during this phase
   generate a.bc to become an assembly source file
Phase: Linking
   ofmt.bc is ignored during this phase
   sieve.bc is ignored during this phase
   assemble a.s to become a file that has been linked
Generating Native Executable With:
'/usr/bin/gcc' '-fno-strict-aliasing' '-O3' '-o' 'a.out' 'a.s'

Sure, it should be really really easy: just link in the bcwriter library, and call WriteBitcodeToFile (from llvm/Bitcode/ReaderWriter.h).

There is a bigger question though: do we want to link more and more llvm libraries into clang at this point? In addition to the bitcode writer, you'd eventually want the codegen and target libraries as well. The bigger issue with this is that it increases link times of clang and most people aren't using it right now.

For now, if you want the bcwriter, I'd say go ahead and add it. If you want the target libraries though, I'd suggest building them together into a single "backend" dylib/so file that is loaded by clang. That way we can rebuild clang without relinking all the llvm pieces.

-Chris

There is a bigger question though: do we want to link more and more
llvm libraries into clang at this point? In addition to the bitcode
writer, you'd eventually want the codegen and target libraries as
well. The bigger issue with this is that it increases link times of
clang and most people aren't using it right now.

I'd like to point out that bitcode writer was already linked with
clang (I suppose for use with AST serialization) and my patch didn't
increase link time at all.

For now, if you want the bcwriter, I'd say go ahead and add it. If
you want the target libraries though, I'd suggest building them
together into a single "backend" dylib/so file that is loaded by
clang. That way we can rebuild clang without relinking all the llvm
pieces.

How do I do that? :slight_smile:

By the way, my idea was to have a wrapper script that behaves like
gcc, which calls clang -emit-bc for gcc -c and llvm-ld -native for gcc
linking.

What is the difference between running opt on individual bitcode files
and running llvm-ld -O2 over all bitcode files?

There is a bigger question though: do we want to link more and more
llvm libraries into clang at this point? In addition to the bitcode
writer, you'd eventually want the codegen and target libraries as
well. The bigger issue with this is that it increases link times of
clang and most people aren't using it right now.

I'd like to point out that bitcode writer was already linked with
clang (I suppose for use with AST serialization) and my patch didn't
increase link time at all.

Ok!

For now, if you want the bcwriter, I'd say go ahead and add it. If
you want the target libraries though, I'd suggest building them
together into a single "backend" dylib/so file that is loaded by
clang. That way we can rebuild clang without relinking all the llvm
pieces.

How do I do that? :slight_smile:

I'm not sure what the best way is :).

By the way, my idea was to have a wrapper script that behaves like
gcc, which calls clang -emit-bc for gcc -c and llvm-ld -native for gcc
linking.

Sure, that sounds like a good short-term solution. Longer term, Anton is working on revamping the 'llvmc' tool into a proper compiler driver, which should solve some of these problems.

What is the difference between running opt on individual bitcode files
and running llvm-ld -O2 over all bitcode files?

They run a very different set of optimization passes.

-Chris