Introducing clang-triage: A bot to test clang with fuzzed inputs

Hi,

I've set up a bot to test Clang by running svn HEAD of LLVM and Clang
with a large corpus of test cases that exercises a lot of different
code paths (from afl-fuzz). Whenever a test case causes clang to
crash, it records the output and reduces the test case using CReduce
or, if CReduce crashes, a dumber reducer. Crashes (assertion failures,
signals) are the only kind of failure detected.

You can see here the kind of output it produces (running on my desktop
computer for now, so this URL will probably go away at some point):

    http://sli.dy.fi/~sliedes/clang-triage/triage_report.xhtml

Currently the bot only runs the test cases using clang -std=c++11 -O0;
trying with different language options would probably require more
afl-fuzzing with the different option set to really be effective.

If someone wants to try it with different language parameters (or even
for different frontends), or to set it up on some better
infrastructure, the code and the instructions are here:

    https://github.com/sliedes/clang-triage

The number of test cases that cause a given kind of failure also
roughly corresponds to how likely afl-fuzz is to hit that failure
(though the corpus is minimized in the sense that every test case in
the bot's database should exercise some code path that other test
cases do not). By default afl-fuzz stops recording new crashes once it
has found 5000 crashing inputs. Once some of the most common ones have
been fixed, it would make sense to rerun the fuzzer and add new test
cases to the bot.

  Sami

Sami, this is very cool. Here are a few suggestions/comments/requests:

Are you adding these into the LLVM bugzilla?

A lot of your crashers are syntactically invalid, but it often is the case that crashes on valid code are more interesting than crashes on invalid code. Can you separate out the ones for which you have a valid test case? These might just be the ones that GCC can compile.

Your mail mentions C-Reduce crashing. Please report bugs to us when this happens.

One thing we've seen happen a lot is that running C-Reduce on a crasher will find new crashes. Basically, a reducer is a fuzzer too. But you have to rig your C-Reduce interestingness test to look for these.

If you have time, it would be helpful to add Csmith into the mix here. I would be very interested to learn how Csmith and AFL work together.

You should definitely turn on LLVM optimizations when doing this testing, unless there's some specific reason why that isn't a good idea.

John

Sami, this is very cool. Here are a few suggestions/comments/requests:

Thanks!

Are you adding these into the LLVM bugzilla?

I added many in December:
http://llvm.org/bugs/buglist.cgi?quicksearch=fuzz&list_id=65362

But no, I have not added all.

A lot of your crashers are syntactically invalid, but it often is the case
that crashes on valid code are more interesting than crashes on invalid
code. Can you separate out the ones for which you have a valid test case?
These might just be the ones that GCC can compile.

I suspect that not very many cases produced are going to be valid
code, or at least at this stage parser bugs are clearly prevalent.
Maybe it will be possible to reach deeper bugs once the parser is more
robust, I don't know. Recent releases of afl-fuzz even include some
kind of support for extra dictionaries, i.e. making the fuzzer insert
whole tokens like "template" or "typename", which presumably might
increase its chances of producing something that looks like more
legitimate code.

Basically the fuzzer treats the source code just like any binary, and
the results generally are a mess of text and binary garbage. I'm
actually surprised that it manages to find any crashes that after
reducing do not contain binary. I think the more interesting crashing
cases are mostly produced by randomly combining (crossover) different
clang test cases and cases discovered by fuzzing and combining that
trigger new execution paths. I think if GCC accepted those it would be
mostly by chance, unless it's a lot more tolerant towards binary
garbage than Clang.

After reducing the cases often look somewhat more sensible, but
truncated, since with incremental parsing the smallest crashing input
is often going to end in the middle of some block. GCC clearly would
not compile those either.

In the very least I believe it's safe to say the fuzzer is heavily
biased towards finding parser bugs as long as they are relatively
common. Whether it can go deeper than that is to be seen.

Your mail mentions C-Reduce crashing. Please report bugs to us when this
happens.

I dumped those cases where creduce has failed, together with the
dumb-reduced cases causing the same crash on Clang (some since then
fixed, I think) and the assertion failures they caused. I used CReduce
with llvm 3.5. With CReduce using clang, I don't find it that
surprising that it fails on some of the inputs that also cause Clang
to crash.

    http://sli.dy.fi/~sliedes/creduce-failures.tar.gz

If you have time, it would be helpful to add Csmith into the mix here. I
would be very interested to learn how Csmith and AFL work together.

A bit hard to see how they would work together. IIUC Csmith tries hard
to produce code that is also a semantically valid C program; afl-fuzz
tries to mangle the input to maximize code coverage in the compiler
and knows nothing about code correctness. Anyway, I believe there's
still likely to be a lot of bugs to be found starting from the clang
test cases; together they contain large amounts of quite different,
even exotic constructs. I've only started to run afl-fuzz; it doesn't
get far through its work queue (I think less than 1%, and the queue
obviously grows) until it hits its limit of 5000 crashing cases found
- though it still takes a few days to get that far; running clang that
many times is resource intensive. Obviously that's the low-hanging,
rather easy-to-find fruit.

You should definitely turn on LLVM optimizations when doing this testing,
unless there's some specific reason why that isn't a good idea.

I think it will be a good idea once it no longer hits that many parser
bugs. The drawback is that with -O0 I can do something like 300
executions/second on a 4-core machine, and with that speed I guess it
would already take at least months to go through afl-fuzz's queue
starting from the clang test suite.

  Sami

> Sami, this is very cool. Here are a few suggestions/comments/requests:

Thanks!

> Are you adding these into the LLVM bugzilla?

I added many in December:
http://llvm.org/bugs/buglist.cgi?quicksearch=fuzz&list_id=65362

But no, I have not added all.

> A lot of your crashers are syntactically invalid, but it often is the
case
> that crashes on valid code are more interesting than crashes on invalid
> code. Can you separate out the ones for which you have a valid test
case?
> These might just be the ones that GCC can compile.

I suspect that not very many cases produced are going to be valid
code, or at least at this stage parser bugs are clearly prevalent.
Maybe it will be possible to reach deeper bugs once the parser is more
robust, I don't know. Recent releases of afl-fuzz even include some
kind of support for extra dictionaries, i.e. making the fuzzer insert
whole tokens like "template" or "typename", which presumably might
increase its chances of producing something that looks like more
legitimate code.

Basically the fuzzer treats the source code just like any binary, and
the results generally are a mess of text and binary garbage. I'm
actually surprised that it manages to find any crashes that after
reducing do not contain binary. I think the more interesting crashing
cases are mostly produced by randomly combining (crossover) different
clang test cases and cases discovered by fuzzing and combining that
trigger new execution paths. I think if GCC accepted those it would be
mostly by chance, unless it's a lot more tolerant towards binary
garbage than Clang.

After reducing the cases often look somewhat more sensible, but
truncated, since with incremental parsing the smallest crashing input
is often going to end in the middle of some block. GCC clearly would
not compile those either.

In the very least I believe it's safe to say the fuzzer is heavily
biased towards finding parser bugs as long as they are relatively
common. Whether it can go deeper than that is to be seen.

> Your mail mentions C-Reduce crashing. Please report bugs to us when this
> happens.

I dumped those cases where creduce has failed, together with the
dumb-reduced cases causing the same crash on Clang (some since then
fixed, I think) and the assertion failures they caused. I used CReduce
with llvm 3.5. With CReduce using clang, I don't find it that
surprising that it fails on some of the inputs that also cause Clang
to crash.

    http://sli.dy.fi/~sliedes/creduce-failures.tar.gz

> If you have time, it would be helpful to add Csmith into the mix here. I
> would be very interested to learn how Csmith and AFL work together.

A bit hard to see how they would work together. IIUC Csmith tries hard
to produce code that is also a semantically valid C program; afl-fuzz
tries to mangle the input to maximize code coverage in the compiler
and knows nothing about code correctness. Anyway, I believe there's
still likely to be a lot of bugs to be found starting from the clang
test cases; together they contain large amounts of quite different,
even exotic constructs. I've only started to run afl-fuzz; it doesn't
get far through its work queue (I think less than 1%, and the queue
obviously grows) until it hits its limit of 5000 crashing cases found
- though it still takes a few days to get that far; running clang that
many times is resource intensive. Obviously that's the low-hanging,
rather easy-to-find fruit.

> You should definitely turn on LLVM optimizations when doing this testing,
> unless there's some specific reason why that isn't a good idea.

I think it will be a good idea once it no longer hits that many parser
bugs. The drawback is that with -O0 I can do something like 300
executions/second on a 4-core machine, and with that speed I guess it
would already take at least months to go through afl-fuzz's queue
starting from the clang test suite.

I'm wondering how much we can improve on that 300 executions/second. My
guess is that a lot of time is constant-overhead startup code. A back of
the envelope calculation:

300 executions/second * 300 bytes/source file (small files) ~ 100 000
bytes/second.
4 cores * 3 giga instructions/second ~ 10 000 000 000 instructions/second.

So that's about 1 million instructions per byte, which seems excessive.

The first thing that comes to mind to hack around that would be write a
quick tool that uses clang as a library and have afl-fuzz just send it IPC
messages asking it to parse files; the server then forks off a child to
parse, avoiding all the startup overhead and option parsing and stuff
inside clang.

-- Sean Silva

> Sami, this is very cool. Here are a few suggestions/comments/requests:

Thanks!

> Are you adding these into the LLVM bugzilla?

I added many in December:
http://llvm.org/bugs/buglist.cgi?quicksearch=fuzz&list_id=65362

But no, I have not added all.

> A lot of your crashers are syntactically invalid, but it often is the
case
> that crashes on valid code are more interesting than crashes on invalid
> code. Can you separate out the ones for which you have a valid test
case?
> These might just be the ones that GCC can compile.

I suspect that not very many cases produced are going to be valid
code, or at least at this stage parser bugs are clearly prevalent.
Maybe it will be possible to reach deeper bugs once the parser is more
robust, I don't know. Recent releases of afl-fuzz even include some
kind of support for extra dictionaries, i.e. making the fuzzer insert
whole tokens like "template" or "typename", which presumably might
increase its chances of producing something that looks like more
legitimate code.

Basically the fuzzer treats the source code just like any binary, and
the results generally are a mess of text and binary garbage. I'm
actually surprised that it manages to find any crashes that after
reducing do not contain binary. I think the more interesting crashing
cases are mostly produced by randomly combining (crossover) different
clang test cases and cases discovered by fuzzing and combining that
trigger new execution paths. I think if GCC accepted those it would be
mostly by chance, unless it's a lot more tolerant towards binary
garbage than Clang.

After reducing the cases often look somewhat more sensible, but
truncated, since with incremental parsing the smallest crashing input
is often going to end in the middle of some block. GCC clearly would
not compile those either.

In the very least I believe it's safe to say the fuzzer is heavily
biased towards finding parser bugs as long as they are relatively
common. Whether it can go deeper than that is to be seen.

> Your mail mentions C-Reduce crashing. Please report bugs to us when
this
> happens.

I dumped those cases where creduce has failed, together with the
dumb-reduced cases causing the same crash on Clang (some since then
fixed, I think) and the assertion failures they caused. I used CReduce
with llvm 3.5. With CReduce using clang, I don't find it that
surprising that it fails on some of the inputs that also cause Clang
to crash.

    http://sli.dy.fi/~sliedes/creduce-failures.tar.gz

> If you have time, it would be helpful to add Csmith into the mix here.
I
> would be very interested to learn how Csmith and AFL work together.

A bit hard to see how they would work together. IIUC Csmith tries hard
to produce code that is also a semantically valid C program; afl-fuzz
tries to mangle the input to maximize code coverage in the compiler
and knows nothing about code correctness. Anyway, I believe there's
still likely to be a lot of bugs to be found starting from the clang
test cases; together they contain large amounts of quite different,
even exotic constructs. I've only started to run afl-fuzz; it doesn't
get far through its work queue (I think less than 1%, and the queue
obviously grows) until it hits its limit of 5000 crashing cases found
- though it still takes a few days to get that far; running clang that
many times is resource intensive. Obviously that's the low-hanging,
rather easy-to-find fruit.

> You should definitely turn on LLVM optimizations when doing this
testing,
> unless there's some specific reason why that isn't a good idea.

I think it will be a good idea once it no longer hits that many parser
bugs. The drawback is that with -O0 I can do something like 300
executions/second on a 4-core machine, and with that speed I guess it
would already take at least months to go through afl-fuzz's queue
starting from the clang test suite.

I'm wondering how much we can improve on that 300 executions/second. My
guess is that a lot of time is constant-overhead startup code. A back of
the envelope calculation:

300 executions/second * 300 bytes/source file (small files) ~ 100 000
bytes/second.
4 cores * 3 giga instructions/second ~ 10 000 000 000 instructions/second.

So that's about 1 million instructions per byte, which seems excessive.

The first thing that comes to mind to hack around that would be write a
quick tool that uses clang as a library and have afl-fuzz just send it IPC
messages asking it to parse files; the server then forks off a child to
parse, avoiding all the startup overhead and option parsing and stuff
inside clang.

afl does this automatically:
http://lcamtuf.blogspot.com/2014/10/fuzzing-binaries-without-execve.html

> Sami, this is very cool. Here are a few suggestions/comments/requests:

Thanks!

> Are you adding these into the LLVM bugzilla?

I added many in December:
http://llvm.org/bugs/buglist.cgi?quicksearch=fuzz&list_id=65362

But no, I have not added all.

> A lot of your crashers are syntactically invalid, but it often is the
case
> that crashes on valid code are more interesting than crashes on invalid
> code. Can you separate out the ones for which you have a valid test
case?
> These might just be the ones that GCC can compile.

I suspect that not very many cases produced are going to be valid
code, or at least at this stage parser bugs are clearly prevalent.
Maybe it will be possible to reach deeper bugs once the parser is more
robust, I don't know. Recent releases of afl-fuzz even include some
kind of support for extra dictionaries, i.e. making the fuzzer insert
whole tokens like "template" or "typename", which presumably might
increase its chances of producing something that looks like more
legitimate code.

Basically the fuzzer treats the source code just like any binary, and
the results generally are a mess of text and binary garbage. I'm
actually surprised that it manages to find any crashes that after
reducing do not contain binary. I think the more interesting crashing
cases are mostly produced by randomly combining (crossover) different
clang test cases and cases discovered by fuzzing and combining that
trigger new execution paths. I think if GCC accepted those it would be
mostly by chance, unless it's a lot more tolerant towards binary
garbage than Clang.

After reducing the cases often look somewhat more sensible, but
truncated, since with incremental parsing the smallest crashing input
is often going to end in the middle of some block. GCC clearly would
not compile those either.

In the very least I believe it's safe to say the fuzzer is heavily
biased towards finding parser bugs as long as they are relatively
common. Whether it can go deeper than that is to be seen.

> Your mail mentions C-Reduce crashing. Please report bugs to us when
this
> happens.

I dumped those cases where creduce has failed, together with the
dumb-reduced cases causing the same crash on Clang (some since then
fixed, I think) and the assertion failures they caused. I used CReduce
with llvm 3.5. With CReduce using clang, I don't find it that
surprising that it fails on some of the inputs that also cause Clang
to crash.

    http://sli.dy.fi/~sliedes/creduce-failures.tar.gz

> If you have time, it would be helpful to add Csmith into the mix
here. I
> would be very interested to learn how Csmith and AFL work together.

A bit hard to see how they would work together. IIUC Csmith tries hard
to produce code that is also a semantically valid C program; afl-fuzz
tries to mangle the input to maximize code coverage in the compiler
and knows nothing about code correctness. Anyway, I believe there's
still likely to be a lot of bugs to be found starting from the clang
test cases; together they contain large amounts of quite different,
even exotic constructs. I've only started to run afl-fuzz; it doesn't
get far through its work queue (I think less than 1%, and the queue
obviously grows) until it hits its limit of 5000 crashing cases found
- though it still takes a few days to get that far; running clang that
many times is resource intensive. Obviously that's the low-hanging,
rather easy-to-find fruit.

> You should definitely turn on LLVM optimizations when doing this
testing,
> unless there's some specific reason why that isn't a good idea.

I think it will be a good idea once it no longer hits that many parser
bugs. The drawback is that with -O0 I can do something like 300
executions/second on a 4-core machine, and with that speed I guess it
would already take at least months to go through afl-fuzz's queue
starting from the clang test suite.

I'm wondering how much we can improve on that 300 executions/second. My
guess is that a lot of time is constant-overhead startup code. A back of
the envelope calculation:

300 executions/second * 300 bytes/source file (small files) ~ 100 000
bytes/second.
4 cores * 3 giga instructions/second ~ 10 000 000 000 instructions/second.

So that's about 1 million instructions per byte, which seems excessive.

The first thing that comes to mind to hack around that would be write a
quick tool that uses clang as a library and have afl-fuzz just send it IPC
messages asking it to parse files; the server then forks off a child to
parse, avoiding all the startup overhead and option parsing and stuff
inside clang.

afl does this automatically:
http://lcamtuf.blogspot.com/2014/10/fuzzing-binaries-without-execve.html

Wow neat! Then maybe the overhead is in clang's driver or frontend. From
the ballpark calculation above, it looks like there are orders of magnitude
of performance being left on the table.

-- Sean Silva

What afl-fuzz does, I believe, is fork at the beginning of main() or
somewhere near that point. As far as I understand there's probably no
really good reason why it couldn't fork at the first read from stdin,
but that's just not implemented yet.

Even the 300/s or something like that comes from running clang -cc1
directly; if I run clang without -cc1, I seem to remember that I get
something like 30/s instead.

  Sami

In the very least I believe it's safe to say the fuzzer is heavily
biased towards finding parser bugs as long as they are relatively
common. Whether it can go deeper than that is to be seen.

Yep. It'll be pretty interesting if that works.

with llvm 3.5. With CReduce using clang, I don't find it that
surprising that it fails on some of the inputs that also cause Clang
to crash.

C-Reduce uses a clang-based tool to do some of its work, but the main C-Reduce loop is not based on clang and doesn't care that much if the clang part crashes. C-Reduce will work just fine (though of course its reduction power will be somewhat degraded) even if the clang tool crashes every time. I advise simply ignoring these crashes, and if you like I can add a command line option that causes it to hide the crashes from you.

A bit hard to see how they would work together. IIUC Csmith tries hard
to produce code that is also a semantically valid C program;

It does, and afl-fuzz will ruin that property, but that isn't important. The important thing is that Csmith will probably cover branches in LLVM that the test suite does not cover, and then afl-fuzz can use this as the basis for exploring still more code, meaning that Csmith+afl-fuzz may well find bugs that are not findable otherwise. I agree that you might as well wait for testing based on the test suite to run out of steam before doing this, but I wouldn't advise reading too much into whether afl-fuzz empties its queue or not, since its queuing policy (as far as I can tell) is more or less arbitrary anyway.

John

One thing that clearly does cause an overhead is the instrumentation
done by afl to get the edge coverage. It could probably also be made
more efficient by turning it into an LLVM pass instead of the current
textual search-and-replace on .s files (never before I've seen
instrumentation done that way), since now it among other things always
saves and restores registers at every conditional branch and function
entry point[1] (plus a few non-conditionals "due to the simplicity of
afl-as" [2]).

  Sami

[1] https://github.com/mcarpenter/afl/blob/master/afl-as.h#L115
[2] https://groups.google.com/d/msg/afl-users/9swBIFHTmpo/P_PNdubKpQwJ

> I'm wondering how much we can improve on that 300 executions/second. My
> guess is that a lot of time is constant-overhead startup code. A back of
> the envelope calculation:
>
> 300 executions/second * 300 bytes/source file (small files) ~ 100 000
> bytes/second.
> 4 cores * 3 giga instructions/second ~ 10 000 000 000
instructions/second.
>
> So that's about 1 million instructions per byte, which seems excessive.

One thing that clearly does cause an overhead is the instrumentation
done by afl to get the edge coverage. It could probably also be made
more efficient by turning it into an LLVM pass instead of the current
textual search-and-replace on .s files (never before I've seen
instrumentation done that way), since now it among other things always
saves and restores registers at every conditional branch and function
entry point[1] (plus a few non-conditionals "due to the simplicity of
afl-as" [2]).

Do you have any idea how to quantify the overhead? Like what is the time
difference between an instrumented and non-instrumented clang?

-- Sean Silva

From some quick testing it seems we're talking about roughly 3-5x
execution times compared to an uninstrumented binary.

  Sami

Hi,

I've set up a bot to test Clang by running svn HEAD of LLVM and Clang
with a large corpus of test cases that exercises a lot of different
code paths (from afl-fuzz). Whenever a test case causes clang to
crash, it records the output and reduces the test case using CReduce
or, if CReduce crashes, a dumber reducer. Crashes (assertion failures,
signals) are the only kind of failure detected.

You can see here the kind of output it produces (running on my desktop
computer for now, so this URL will probably go away at some point):

    http://sli.dy.fi/~sliedes/clang-triage/triage_report.xhtml

Currently the bot only runs the test cases using clang -std=c++11 -O0;
trying with different language options would probably require more
afl-fuzzing with the different option set to really be effective.

If someone wants to try it with different language parameters (or even
for different frontends), or to set it up on some better
infrastructure, the code and the instructions are here:

    https://github.com/sliedes/clang-triage

The number of test cases that cause a given kind of failure also
roughly corresponds to how likely afl-fuzz is to hit that failure
(though the corpus is minimized in the sense that every test case in
the bot's database should exercise some code path that other test
cases do not). By default afl-fuzz stops recording new crashes once it
has found 5000 crashing inputs. Once some of the most common ones have
been fixed, it would make sense to rerun the fuzzer and add new test
cases to the bot.

Once we start to get clean under afl (from a seed of an empty file), it
would be great to start testing that new commits don't introduce
regressions. Here's my idea: afl-fuzz already requires a starting input to
act as a seed. If there is a test file change in a commit, use that test as
a seed for afl-fuzz to check that revision. What do you think?

I’ve been playing with something like this in my spare time. I’ve gotten most of the infrastructure built to do this, but my initial couple of runs haven’t been real productive. The problem is that most modified test cases are massive and afl-fuzz has a really hard time with that. I’ve got a couple of ideas on how to address that, but I haven’t gotten back to it yet. If others are interested, I can push the code I’ve got up on github. (It’s just some basic python scripting.) Philip

I've been playing with something like this in my spare time. I've gotten
most of the infrastructure built to do this, but my initial couple of runs
haven't been real productive. The problem is that most modified test cases

Yeah, it seems to me that it might be doable and with some careful
planning probably quite useful too, but with some caveats:

* You cannot really reuse old coverage results from previous runs of
  afl-fuzz on older clang binaries (and these in turn are used for
  splicing test cases when producing new cases), so probably this
  would involve taking, say, all the old cases that have reached new
  paths and rerunning afl-fuzz with the new clang for all those.

   * Currently afl-fuzz runs a timing step for all cases that cover
     new paths. I think that step takes somewhere near 1 second. You
     might get away by reusing timings from old clang or something, or
     just disabling the timing logic.

   * Alternatively you could just use the new test case as a single
     input, but then you will be throwing away all the splicing
     possibilities.

* This still involves "throwing away" old results in any case, namely
  the negative results of "this mutation did not cause a new path to
  be executed". In a new clang binary, it might just be that it would
  have covered new paths. Hence, doing this for new clang revisions is
  unlikely to be a substitute for deeper testing by running a more
  comprehensive test run every now and then.

are *massive* and afl-fuzz has a really hard time with that. I've got a
couple of ideas on how to address that, but I haven't gotten back to it yet.
If others are interested, I can push the code I've got up on github. (It's
just some basic python scripting.)

I've been thinking a bit about also minimizing the test cases instead
of only minimizing their number; those could then be a reasonable
starting point as a corpus for rerun on a newer clang binary. I think
it would also in a sense be cool to have minimized test cases where
each exercises some unique path in clang. I have no code though.

Currently, minimizing a corpus (with a script that comes with
afl-fuzz) works roughly like this:

  ALL = the set of edges covered by the entire corpus
  REDUCED = set()
  while REDUCED != ALL:
      e = the edge in ALL-REDUCED that is rarest
      c = the smallest test case that exercises e
      choose c for the minimized corpus
      REDUCED |= edges exercised by c

If this is modified to further minimize the test case c (say, by
creduce or equivalent means) subject to the condition that the
resulting case still needs to exercise edge e, the result should be a
larger corpus of much smaller test cases.

An even simpler idea that I haven't done so far would be to strip
comments from the test cases before feeding them to afl-fuzz. They're
unlikely to be *that* interesting. Apparently afl-fuzz does some
minimization of its own, but I don't think it's very aggressive.

  Sami