Adding ClamAV to the llvm testsuite (long)

Hi,

I see that you are looking for new programs for the testsuite, as
described in 'Compile programs with the LLVM compiler', and 'Adding
programs to the llvm testsuite" on llvm.org/OpenProjects.

My favourite "C source code" is ClamAV (www.clamav.net), and I would
like to get it included in the testsuite.

This mail is kind of long, but please bear with me, as I want to clarify
how to best integrate Clamav into LLVM-testsuite's buildsystem.

Why include it?

It can be useful to find regressions, or new bugs; it already uncovered
a few bugs in llvm's cbe, llc, and optimizers that I've reported through
bugzilla (and they've been mostly fixed very fast! Thanks!). ClamAV was
also the "victim" of a bug in gcc 4.1.0's optimizer [see 9)]

It can be useful to test new/existing optimizations. There aren't any
significant differences on its performance when compiled by different
compilers (gcc, icc, llvm-gcc), so I hope LLVM's optimizers can (in the
future) make it faster :wink:

I had a quick look at the build infrastructure, and there are some
issues with getting it to work with programs that use autoconf (such as
ClamAV), since AFAICT testsuites aren't allowed to run configure (listed
below)

Building issues aside there are some more questions:
* ClamAV is GPL (but it includes BSD, LGPL parts), ok for testsuite?
* what version to use? Latest stable, or latest svn?
[In any case I'll wait till the next stable is published, it should be
happening *very soon*]
* what happens if you find bugs that also cause it to fail under gcc
(unlikely) ? [I would prefer to get an entry on clamav's bugzilla then,
with something in its subject saying it is llvm-testsuite related]
* what happens if it only fails under llvm-gcc/llc/clang,.. and it is
not due to a bug in llvm, but because of portability issues in the
source code (unlikely)?
I would prefer a clamav bugzilla here too, clamav is meant to be
"portable" :slight_smile:

Also after I have set it up in the llvm testsuite, is there an easy way
to run clang on it? Currently I have to hack autoconf generated
makefiles if I want to test clang on it.

1. I've manually run, and generated a clamav-config.h.
This usually just contains HAVE_* macros for headers, which should all
be available on a POSIX system, so it shouldn't be a problem from this
perspective for llvm's build farm.
However there are some target specific macros:
#define C_LINUX 1
#define FPU_WORDS_BIGENDIAN 0
#define WORDS_BIGENDIAN 0
Also SIZEOF_INT, SIZEOF_LONG,... but they are only used if the system
doesn't have a proper <stdint.h>
Also not sure of this:
/* ctime_r takes 2 arguments */
#define HAVE_CTIME_R_2 1

What OS and CPU do the machines on llvm's buildfarm have? We could try a
config.h that works on Linux (or MacOSX), and try to apply
that to all, though there might be (non-obvious) failures.

Any solutions to having these macros defined in the LLVM testsuite
build? (especially for the bigendian macro)

2. AFAICT the llvm-testsuite build doesn't support a program that is
built from multiple subdirectories.
libclamav has its source split into multiple subdirectories, gathering
those into one also requires changing #include that have relative paths.
I also get files with the same name but from different subdirs, so I
have to rename them to subdir_filename, and do that in #include too.

I have done this manually, and it works (native, llc, cbe work).
I could hack together some perl script to do this automatically, or is
there a better solution?

3. Comparing output: I've written a small script that compares the
--debug output, because it needs some adjustments since I also get
memory addresses in the --debug output that obviously don't match up
between runs.
There isn't anything else to compare besides --debug output (besides
ClamAV saying no virus found), and that can be a fairly good test.

4. What is the input data?
Clamav is fast :slight_smile:
It needs a lot of input data if you want to get reasonable timings out
of it (tens, hundreds of MB).
Scanning multiple small files will be I/O bound, and it'd be mostly
useless as a benchmark (though still useful for testing
compiler/optimization correctness).

So I was thinking of using some large files already available in the
testsuite (oggenc has one), and then maybe point it to scan the last
*stable* build of LLVM. Or find some files that are scanned slowly, but
that don't presume lots of disk I/O (an archive, with ratio/size limits
disabled, with highly compressable data).
You won't be able to benchmark clamav in a "real world" scenario though,
since that'd involve making it scanning malware, and I'm sure you don't
want that on your build farm.

You could give it to scan random data, but you'll need it to be
reproducible, so scanning /dev/random, or /bin of current LLVM tree is
not a good choice :wink:

There's also the problem of eliminating the initial disk I/O time out of
the benchmark, like rerun 3 times automatically or something like that?

5. Library dependencies
It needs zlib, all the rest is optional (bzip2, gmp, ....). I think I
can reasonably assume zlib is available on all systems where the
testsuite is run.

6. Sample output on using 126Mb of data as input:

$ make TEST=nightly report
....
Program | GCCAS Bytecode LLC compile LLC-BETA compile JIT codegen |
GCC CBE LLC LLC-BETA JIT | GCC/CBE GCC/LLC GCC/LLC-BETA
LLC/LLC-BETA
clamscan | 7.0729 2074308 * * * |
17.48 17.55 18.81 * * | 1.00 0.93 n/a n/a

7. Clamav is multithreaded
If you're interested in testing if llvm-generated code works when
multithreaded (I don't see why it wouldn't, but we're talking about a
testsuite), you'd need to start the daemon (as an unprivileged user is
just fine), and then connect to it.
Is it possible to tell the testsuite build system to do this?

8. Code coverage
Testing all of clamav code with llvm is ... problematic. Unless you
create files with every packer/archiver known to clamav it is likely
there will be files that are compiled in but never used during the
testsuite run. You can still test that these files compile, but thats it.

9. Configure tests
Configure has 3 tests that check for gcc bugs known to break ClamAV (2
of which you already have, since those are in gcc's testsuite too). Add
as separate "programs" to run in llvm testsuite?

Thoughts?

Best regards,
Edwin

We always welcome more tests. But it looks like there are two issues here.

1. The autoconf requirement. Is it possible to get one configuration working without the need for autoconf?
2. GPL license. Chris?

Evan

Evan Cheng wrote:

We always welcome more tests. But it looks like there are two issues
here.

1. The autoconf requirement. Is it possible to get one configuration
working without the need for autoconf?
  

I could make an clamav-config.h that should work if compiled with llvm-gcc.
Can I assume <endian.h> exists on all your platforms?
[or how else can I detect endianness by using only macros from headers?]

I've seen a Makefile having if $(ENDIAN), can I use that to pass
-DWORDS_BIGENDIAN=... to the compiler?

Or I can create a config.h that assumes the platform is bigendian
(assuming little-endian would SIGBUS on Sparc).

Thoughts?

Thanks,
Edwin

We always welcome more tests. But it looks like there are two issues
here.

1. The autoconf requirement. Is it possible to get one configuration
working without the need for autoconf?

One way to do this is to add a "cut down" version of the app to the test suite.

2. GPL license. Chris?

Any open source license that allows unrestricted redistribution is fine in llvm-test

-Chris

Evan

Hi,

I see that you are looking for new programs for the testsuite, as
described in 'Compile programs with the LLVM compiler', and 'Adding
programs to the llvm testsuite" on llvm.org/OpenProjects.

My favourite "C source code" is ClamAV (www.clamav.net), and I would
like to get it included in the testsuite.

This mail is kind of long, but please bear with me, as I want to
clarify
how to best integrate Clamav into LLVM-testsuite's buildsystem.

Why include it?

It can be useful to find regressions, or new bugs; it already
uncovered
a few bugs in llvm's cbe, llc, and optimizers that I've reported
through
bugzilla (and they've been mostly fixed very fast! Thanks!). ClamAV
was
also the "victim" of a bug in gcc 4.1.0's optimizer [see 9)]

It can be useful to test new/existing optimizations. There aren't any
significant differences on its performance when compiled by different
compilers (gcc, icc, llvm-gcc), so I hope LLVM's optimizers can (in
the
future) make it faster :wink:

I had a quick look at the build infrastructure, and there are some
issues with getting it to work with programs that use autoconf (such
as
ClamAV), since AFAICT testsuites aren't allowed to run configure
(listed
below)

Building issues aside there are some more questions:
* ClamAV is GPL (but it includes BSD, LGPL parts), ok for testsuite?
* what version to use? Latest stable, or latest svn?
[In any case I'll wait till the next stable is published, it should be
happening *very soon*]
* what happens if you find bugs that also cause it to fail under gcc
(unlikely) ? [I would prefer to get an entry on clamav's bugzilla
then,
with something in its subject saying it is llvm-testsuite related]
* what happens if it only fails under llvm-gcc/llc/clang,.. and it is
not due to a bug in llvm, but because of portability issues in the
source code (unlikely)?
I would prefer a clamav bugzilla here too, clamav is meant to be
"portable" :slight_smile:

Also after I have set it up in the llvm testsuite, is there an easy
way
to run clang on it? Currently I have to hack autoconf generated
makefiles if I want to test clang on it.

1. I've manually run, and generated a clamav-config.h.
This usually just contains HAVE_* macros for headers, which should all
be available on a POSIX system, so it shouldn't be a problem from this
perspective for llvm's build farm.
However there are some target specific macros:
#define C_LINUX 1
#define FPU_WORDS_BIGENDIAN 0
#define WORDS_BIGENDIAN 0
Also SIZEOF_INT, SIZEOF_LONG,... but they are only used if the system
doesn't have a proper <stdint.h>
Also not sure of this:
/* ctime_r takes 2 arguments */
#define HAVE_CTIME_R_2 1

What OS and CPU do the machines on llvm's buildfarm have? We could
try a
config.h that works on Linux (or MacOSX), and try to apply
that to all, though there might be (non-obvious) failures.

Any solutions to having these macros defined in the LLVM testsuite
build? (especially for the bigendian macro)

2. AFAICT the llvm-testsuite build doesn't support a program that is
built from multiple subdirectories.
libclamav has its source split into multiple subdirectories, gathering
those into one also requires changing #include that have relative
paths.
I also get files with the same name but from different subdirs, so I
have to rename them to subdir_filename, and do that in #include too.

I have done this manually, and it works (native, llc, cbe work).
I could hack together some perl script to do this automatically, or is
there a better solution?

3. Comparing output: I've written a small script that compares the
--debug output, because it needs some adjustments since I also get
memory addresses in the --debug output that obviously don't match up
between runs.
There isn't anything else to compare besides --debug output (besides
ClamAV saying no virus found), and that can be a fairly good test.

4. What is the input data?
Clamav is fast :slight_smile:
It needs a lot of input data if you want to get reasonable timings out
of it (tens, hundreds of MB).
Scanning multiple small files will be I/O bound, and it'd be mostly
useless as a benchmark (though still useful for testing
compiler/optimization correctness).

So I was thinking of using some large files already available in the
testsuite (oggenc has one), and then maybe point it to scan the last
*stable* build of LLVM. Or find some files that are scanned slowly,
but
that don't presume lots of disk I/O (an archive, with ratio/size
limits
disabled, with highly compressable data).
You won't be able to benchmark clamav in a "real world" scenario
though,
since that'd involve making it scanning malware, and I'm sure you
don't
want that on your build farm.

You could give it to scan random data, but you'll need it to be
reproducible, so scanning /dev/random, or /bin of current LLVM tree is
not a good choice :wink:

There's also the problem of eliminating the initial disk I/O time
out of
the benchmark, like rerun 3 times automatically or something like
that?

5. Library dependencies
It needs zlib, all the rest is optional (bzip2, gmp, ....). I think I
can reasonably assume zlib is available on all systems where the
testsuite is run.

6. Sample output on using 126Mb of data as input:

$ make TEST=nightly report
....
Program | GCCAS Bytecode LLC compile LLC-BETA compile JIT codegen |
GCC CBE LLC LLC-BETA JIT | GCC/CBE GCC/LLC GCC/LLC-BETA
LLC/LLC-BETA
clamscan | 7.0729 2074308 * * * |
17.48 17.55 18.81 * * | 1.00 0.93 n/a n/a

7. Clamav is multithreaded
If you're interested in testing if llvm-generated code works when
multithreaded (I don't see why it wouldn't, but we're talking about a
testsuite), you'd need to start the daemon (as an unprivileged user is
just fine), and then connect to it.
Is it possible to tell the testsuite build system to do this?

8. Code coverage
Testing all of clamav code with llvm is ... problematic. Unless you
create files with every packer/archiver known to clamav it is likely
there will be files that are compiled in but never used during the
testsuite run. You can still test that these files compile, but
thats it.

9. Configure tests
Configure has 3 tests that check for gcc bugs known to break ClamAV (2
of which you already have, since those are in gcc's testsuite too).
Add
as separate "programs" to run in llvm testsuite?

Thoughts?

Best regards,
Edwin
_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-Chris

Assuming endianness or presence of endian.h won't work, however, we already have an ENDIAN make flag you can use. For example see how External/SPEC/CINT2000/186.crafty/Makefile handles it.

Thanks, having a better testsuite is very very useful!

-Chris

Chris Lattner wrote:

One way to do this is to add a "cut down" version of the app to the
test suite.

I disabled optional features in clamav-config.h

2. GPL license. Chris?

Any open source license that allows unrestricted redistribution is
fine in llvm-test

Ok, I have created a script that automatically checks out ClamAV
0.92-stable source code from svn, and moves files to match LLVM
testsuite's requirements.
It only checks out GPL files, it omits libclamunrar (which is not GPL).

Attaching all files. Let me know if you want to upload the entire
package somewhere.
For the moment I provide a script that automatically checks out sources
from ClamAV svn repository (the tagged stable version).

Please provide feedback on the scripts, and Makefile.

What are the next steps required to add it to the testsuite?

Now it knows to do this:
edwin@lightspeed2:~/llvm-svn/llvm/projects/llvm-test/MultiSource/Applications/ClamAV$
make ENABLE_OPTIMIZED=1 TEST=nightly report
....
Program | GCCAS Bytecode LLC compile LLC-BETA compile JIT codegen |
GCC CBE LLC LLC-BETA JIT | GCC/CBE GCC/LLC GCC/LLC-BETA
LLC/LLC-BETA
clamscan | 6.2900 1728848 7.1709 * * |
12.71 12.75 13.49 * * | 1.00 0.94 n/a n/a

Chris Lattner wrote:

Thanks, having a better testsuite is very very useful!

Your welcome, if you encounter any problems with ClamAV (that isn't due
to my makefiles) you are welcome to open a bug on http://bugs.clamav.net
For LLVM-testsuite build problems contact me directly.

Thanks a lot,
Edwin

clamav-config.h (9.18 KB)

filterdiff.sh (472 Bytes)

header_rename.sh (162 Bytes)

Makefile (3.01 KB)

prepare.sh (1.67 KB)

README.LLVM-tests (1.69 KB)

rename.sh (137 Bytes)

(Attachment target.h is missing)

ls_R (7.08 KB)

Török Edwin wrote:

Attaching all files. Let me know if you want to upload the entire
package somewhere.
[...]
Please provide feedback on the scripts, and Makefile.
[...]
What are the next steps required to add it to the testsuite

Is there any chance for this to go in before the code freeze, or do you
prefer to add it after?

Best regards,
--Edwin

I'll try your patch today and commit it if it works well.

Thanks,

Evan

Hi Edwin,

I ran into two problems.

1. Using your config file and Makefile, I ran into issue compiling with gcc:

gcc -I/Users/echeng/LLVM/llvm/projects/llvm-test/MultiSource/Applications/ClamAV -I/Users/echeng/LLVM/llvm/projects/llvm-test/MultiSource/Applications/ClamAV -I/\Users/echeng/LLVM/llvm/include -I/Users/echeng/LLVM/llvm/projects/llvm-test/include -I../../..//include -I/Users/echeng/LLVM/llvm/include -D_GNU_SOURCE -D__STDC_LIMIT_MACROS -O3 -O2 -mdynamic-no-pic -fomit-frame-pointer -c clamscan_clamscan.c -o Output/clamscan_clamscan.o
clamscan_clamscan.c:39:26: error: clamscan_opt.h: No such file or directory
clamscan_clamscan.c:40:29: error: clamscan/others.h: No such file or directory

llvm-gcc compiles this just fine:

/Users/echeng/LLVM/install/bin/llvm-gcc -I/Users/echeng/LLVM/llvm/projects/llvm-test/MultiSource/Applications/ClamAV -I/Users/echeng/LLVM/llvm/projects/llvm-test\/MultiSource/Applications/ClamAV -I/Users/echeng/LLVM/llvm/include -I/Users/echeng/LLVM/llvm/projects/llvm-test/include -I../../..//include -I/Users/echeng/LLVM/\llvm/include -D_GNU_SOURCE -D__STDC_LIMIT_MACROS -DHAVE_CONFIG_H -Iinclude/libclamav/regex/ -Iinclude/libclamav/nsis -Iinclude/libclamav/lzma -Iinclude/libclamav\ -Iinclude/shared -Iinclude/clamscan/ -Iinclude/ -DC_DARWIN -DFPU_WORDS_BIGENDIAN=0 -DWORDS_BIGENDIAN=0 -O2 -mdynamic-no-pic -fomit-frame-pointer -O0 -c clamsca\n_clamscan.c -o Output/clamscan_clamscan.bc -emit-llvm

Looks like it's a makefile issue?

2. prepare.sh getdb doesn't work for me because I don't have wget.

Is it possible for you to get the complete source in a compilable state and send me a tar file instead?

A request: If it's possible, can you flatten the entire tree? That is, everything in the top level directory?

Thanks,

Evan

Evan Cheng wrote:

Hi Edwin,

I ran into two problems.

1. Using your config file and Makefile, I ran into issue compiling
with gcc:

Looks like it's a makefile issue?
  
Should be solved with the new flattened source layout and Makefile (-I.
was missing)

2. prepare.sh getdb doesn't work for me because I don't have wget.

Is it possible for you to get the complete source in a compilable
state and send me a tar file instead?
  
Yes, I have uploaded the .tar.gz here:
http://edwintorok.googlepages.com/ClamAV-srcflat.tar.gz
[inputs dir contains some symlink, place ClamAV dir in
llvm/projects/llvm-test/MultiSource/Applications to make
links point to right place]

You also need to download main.cvd and place it in the dbdir folder:
http://database.clamav.net/main.cvd

Let me know if you encounter any problems.

A request: If it's possible, can you flatten the entire tree? That
is, everything in the top level directory?
  

I have flattened the tree for the sources, but left the data in a
different dir (dbdir, and inputs).
Is this ok?

Thanks,
Edwin

Török Edwin wrote:

Yes, I have uploaded the .tar.gz here:
http://edwintorok.googlepages.com/ClamAV-srcflat.tar.gz
[inputs dir contains some symlink, place ClamAV dir in
llvm/projects/llvm-test/MultiSource/Applications to make
links point to right place]

Hi,

Because llvm bug #1730 got fixed, this testcase can run under with the
JIT on x86-64 :).
I have attached the updated filterdiff.sh script (JIT has extra file
descriptor open), and now all tests pass.

I've also uploaded the new tarball here:
http://edwintorok.googlepages.com/ClamAV-srcflat2.tar.gz

TEST-PASS: compile /MultiSource/Applications/ClamAV/clamscan
TEST-RESULT-compile: Total Execution Time: 6.0010 seconds (6.7158 wall
clock)

TEST-RESULT-compile: 1728712 Output/clamscan.llvm.bc

TEST-RESULT-nat-time: program 12.880000

TEST-PASS: llc /MultiSource/Applications/ClamAV/clamscan
TEST-RESULT-llc: Total Execution Time: 7.3758 seconds (7.7852 wall
clock)
TEST-RESULT-llc-time: program 13.940000

TEST-PASS: jit /MultiSource/Applications/ClamAV/clamscan
TEST-RESULT-jit-time: program 18.320000

TEST-RESULT-jit-comptime:

TEST-PASS: cbe /MultiSource/Applications/ClamAV/clamscan
TEST-RESULT-cbe-time: program 12.750000

Best regards,
--Edwin

filterdiff.sh (598 Bytes)

Hi,

We are getting closer.

1. In Makefile, all the references to CFLAGS should be CPPFLAGS instead.
2. filterdiff.sh uses sed -re. This causes a problem on Mac OS X where -E means using extended regular expression, not -r.

sed: illegal option -- r
usage: sed script [-Ealn] [-i extension] [file ...]
        sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]

Can this be changed?

3. This triggers a optimizer bug:
/Users/echeng/LLVM/llvm/Release/bin/opt -std-compile-opts -time-passes -info-output-file=/Volumes/Muggles/LLVM/llvm/projects/llvm-test/MultiSource/Applications/ClamAV/Out\
put/clamscan.linked.bc.info Output/clamscan.linked.rbc -o Output/clamscan.linked.bc -f
Assertion failed: (getLoopLatch() && "Loop latch is missing"), function verifyLoop, file /Volumes/Muggles/LLVM/llvm/include/llvm/Analysis/LoopInfo.h, line 517.

I'll file a bug on this.

Evan

Hi,

We are getting closer.

1. In Makefile, all the references to CFLAGS should be CPPFLAGS instead.
2. filterdiff.sh uses sed -re. This causes a problem on Mac OS X where
-E means using extended regular expression, not -r.

sed: illegal option -- r
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f
script_file] ... [file ...]

Can this be changed?

Looks like the filter might not work correctly for all platforms (at least that appears to be the case on Mac OS X). Can we simply disable those debug outputs?

Thanks,

Evan

Evan Cheng wrote:

Hi,
  
Hi Evan,

We are getting closer.
  
That is good news.

1. In Makefile, all the references to CFLAGS should be CPPFLAGS instead.
  
Done, and attached.

2. filterdiff.sh uses sed -re. This causes a problem on Mac OS X where
-E means using extended regular expression, not -r.
Can this be changed?
  
I'm not really familiar with sed's non-extended regular expression
syntax, I'll have to read its info page.
I can do that if you decide to keep filterdiff.sh, see below

Evan Cheng wrote:

Looks like the filter might not work correctly for all platforms (at
least that appears to be the case on Mac OS X). Can we simply disable
those debug outputs?

Thanks,
  

If it is causing problems, yes. Or maybe just enable it for certain OSes?
(where sed supports -re?)
How would you prefer?

3. This triggers a optimizer bug:
[...]

I'll file a bug on this.
  

Ok, that bug didn't trigger for me, is it x86/ppc specific? (I use x86-64).

--Edwin

Makefile (2.93 KB)

Török Edwin wrote:

Done, and attached.
  

Attached the wrong file, sorry. Here is the correct one.

--Edwin

Makefile (2.96 KB)

2. filterdiff.sh uses sed -re. This causes a problem on Mac OS X where
-E means using extended regular expression, not -r.
Can this be changed?

I'm not really familiar with sed's non-extended regular expression
syntax, I'll have to read its info page.
I can do that if you decide to keep filterdiff.sh, see below

Why not just hack the program to not print out those file names?

-Chris

Chris Lattner wrote:

Why not just hack the program to not print out those file names?
  

Thanks for the tip., I just hacked cli_dbgmsg to do a simple puts of its
format argument.
That should still make the output useful for the testsuite (ensure that
llvm compiled code takes same codepath as gcc compiled code),
and also avoid problems with filterdiff.sh.

Updated tarball is here, let me know what you think:

http://edwintorok.googlepages.com/ClamAV-srcflat3.tar.gz
md5sum:
22bb9be67418ea75da078080e0d396fc ClamAV-srcflat3.tar.gz

Best regards,
Edwin

I've filed
http://www.llvm.org/bugs/show_bug.cgi?id=1912
for the optimizer bug.

Evan

Devang has fixed the bug. I've added ClamAV to the testsuite.

Thanks!

Evan

Evan Cheng wrote:

Devang has fixed the bug. I've added ClamAV to the testsuite.

Thanks!
  
Thanks a lot!

I noticed that the runtime is ~0.27 seconds, if you want it to be higher
(to better highlight differences between native, llc, jit, cbe),
you can add some more files to inputs/.

For example symlinks to already existing large binary files in the
llvm-testsuite, such as: large.pcm, mei16v2.m2v, tune.

Best regards,
--Edwin