Running LLVM Analysis on real-world projects.

I have made few attempts to compile some software packages with llvm.
My approach is to define Make variables as follows :

export AS=llvm-as
export LD=llvm-ld
export AR=llvm-ar
export CXX=llvm-g++

and then run configure and make.

This approach works with very small code bases only.

For most of the projects, it bails out with errors. Some of them are
very clear, like llvm does not support inline assembly yet. Some
others do not make much sense. I should note, that I try to compile
the same with gcc32 first, to check that code is acceptable for gcc
v3.

Also, I am interested in compiling projects partially, if full
compilation is not possible due to missing dependencies. For this, I
usually run make with '-i' options, which does not stop on errors.
After that, I am combining all the .o files generated using 'llvm-ld
--link-as-library --disable-opt' command. This sometimes causes
problems with warnings of symbol duplications, duplication of main
methods etc. Is there a proper approach of running analysis on a
sourcebase which contains .o files (bytecode files) generated by
llvm-gcc?

Following is the sample output when I attempt to compile httpd with
above mentioned setup.

/bin/sh /httpd-2.0.55/srclib/apr/libtool --silent --mode=compile
llvm-gcc -g -O2 -DHAVE_CONFIG_H -DLINUX=2 -D_REENTRANT -D_GNU_SOURCE
  -I../include -I../include/arch/unix -c apr_snprintf.c && touch
apr_snprintf.lo
In file included from apr_snprintf.c:22:
../include/apr_network_io.h:122: error: redefinition of `struct in_addr'
make[4]: *** [apr_snprintf.lo] Error 1

However, this source compiles fine with gcc32 and gcc4

I have made few attempts to compile some software packages with llvm.
My approach is to define Make variables as follows :

export AS=llvm-as
export LD=llvm-ld
export AR=llvm-ar
export CXX=llvm-g++

and then run configure and make.

This approach works with very small code bases only.

I noticed you are setting CXX but not CC. C, especially C99, is not a
strict subset of C++98, so if you are compiling C code, you should use a
C compiler, and if you are compiling C++, you should be using a C++
compiler. For this reason, below, I am setting both CC and CXX and
letting the program compile itself as it wishes (apologies in advance
if llvm-g++/g++ automatically invoke llvm-gcc/gcc on a .c file and this
point is moot).

At one point, some time ago, I was also working on this as well.
My approach and results are here: http://llvm.org/status/
Note that it hasn't been updated since 2004 (one entry in 2005).

You'll note at the time I was able to run xboard, gnuchess, crafty (the
first three played well together; pun intended), mutt, screen, wget,
gnuplot and apache httpd (see below).

In the best case, one of these would work:

% env CC=llvm-gcc CXX=llvm-g++ ./configure [configure options]
or
% ./configure CC=llvm-gcc CXX=llvm-g++ [configure options]

but in other cases I had to patch Makefiles, and sometimes even source
code itself (my screen patch was for an error in screen that GCC
ignored; the patch was accepted into mainline CVS of screen).

In some cases, we needed to specify extra parameters to lli to run the
code via the JIT because gccld wasn't good at generating the run script.
gccld has vastly improved in the time since my tests in 2003 and 2004,
but I don't know how well it works now.

Please note that the status page is hand-written and hand-updated. It
would be great to have an automatically-generated status page where
packages would be compiled and tested (they should have their own
unit/regression tests) just like the nightly tester. Naturally,
compiling and testing something like KDE and Mozilla might take longer
than just one evening. :wink:

Brian Gaeke had a framework for external tests (it's linked to from my
status page but it's no longer available). It's written in perl and
was used to test a bunch of the packages (listed on the status page).
If anyone is interested in using it as a base of an automated testing
framework for external packages, I can post it for people to play around
with.

/bin/sh /httpd-2.0.55/srclib/apr/libtool --silent --mode=compile
llvm-gcc -g -O2 -DHAVE_CONFIG_H -DLINUX=2 -D_REENTRANT -D_GNU_SOURCE
  -I../include -I../include/arch/unix -c apr_snprintf.c && touch
apr_snprintf.lo
In file included from apr_snprintf.c:22:
../include/apr_network_io.h:122: error: redefinition of `struct in_addr'
make[4]: *** [apr_snprintf.lo] Error 1

However, this source compiles fine with gcc32 and gcc4

We also had some problems compinling apache-2 and did not get to the
bottom of it. We stuck with using apache-1 and I believe the website
http://safecode.cs.uiuc.edu/ was at some point (or still is?) hosted by
Apache-1.3 running in LLVM via the JIT.

If you can narrow this test case down to a manageable size and see if
it's an error in llvm-gcc or if the code is not standards-conformant and
should be fixed, that would be great. If it's a bug in LLVM, please
file it at http://llvm.org/bugs/ .

Thanks for your valuable input. By the way, I am exporting CC=llvm-gcc
and RANLIB=llvm-ranlib for projects that use it. It was a mistake not
to write it in the post as apache indeed uses gcc. If you see the
error message, you would see llvm-gcc being used. I would try to
investigate further on the lines you have mentioned. However, the
question that stands is how to best handle failure to compile few
files, as I am only interested in generating the bytecode and merging
it in one file to run the analysis and not executing the bytecode.

As a side note, I usually use "make -k" to continue after an error.
The differences are as follows (from the man page):

-i Ignore all errors in commands executed to remake files.

-k Continue as much as possible after an error. While the target
     that failed, and those that depend on it, cannot be remade, the
     other dependencies of these targets can be processed all the same.

I am not sure if -i or -k will get you further in compilation if the
compile fails.

Thanks for your valuable input. By the way, I am exporting CC=llvm-gcc
and RANLIB=llvm-ranlib for projects that use it. It was a mistake not
to write it in the post as apache indeed uses gcc. If you see the
error message, you would see llvm-gcc being used.

Very well, sorry about the digression.

I would try to investigate further on the lines you have mentioned.
However, the question that stands is how to best handle failure to
compile few files, as I am only interested in generating the bytecode
and merging it in one file to run the analysis and not executing the
bytecode.

If LLVM cannot compile those files, you really have two choices:

1) Figure out what the error is and fix it (*), or
2) Ignore the failure and stick with what you can compile

Naturally, if a file contains one of the major methods in the program,
(2) may not work so you have to abandon the package and either try
another similar package or another version of the same package.

Also, note that while LLVM is wonderful and good, it may not be perfect
(yet). That means that if you cannot run the program, you may not be
sure if it were compiled correctly (if it runs and passes unittests, at
least you've got some code coverage). Incorrect compilation will affect
your analysis as you're not analyzing the original program.

(*) Note that if it's an LLVM bug, file it and it may get automagically
fixed before you know it (though you're welcome to submit patches as
well). If it's a bug in the application source code, you can fix it
yourself (and send a patch upstream).