A Prototype to Track Input Read for Sparse File Fuzzing

Hi everyone,

I wrote a prototype based on LLVM sanitizer infrastructure to improve fuzzing performance, especially over sparse file format. I’d like to upstream it if the anyone thinks it is useful.

Sparse file format are formats that only a small portion of the file data could have impact on the behavior of the program that parses it. Common examples are archive files or a file system image where only metadata would affect program behavior. When fuzzing those formats, a general fuzzer will randomly select ranges to mutate. Because of the sparse nature of the formats, random range selection has a high probability to hit the “wholes” where data have no influence on the parser. While applying trim over the input could sometimes improve the effective range hit rate, it would not always work. For instance, some program may pose a minimum file size requirement which turns to be fairly large for fuzzing, or the effective ranges are sparsely distributed over an entire file instead of being centralized in the beginning.

The tool I wrote leverages the observation that a piece of data would only have influence on its parser’s behavior only if the data is at least read out by the parser, and the read regions of a sparse file is usually pretty small compared to the entire file. By generating an read map for each input and feeding the map to a modified fuzzer that prioritizes mutating those ranges, we noticed over 10X performance improvement in path discovery at bootstrap time in our test. The modified fuzzer was also able to find crashes in 0.5 hour where the original version couldn’t find in 72 hours when we ended the test.

The high level idea about how the tool works is it uses an instrumentation pass to record any memory read in shadow memory, while a runtime tracks buffer propagation from a user specified buffer (the initial buffer a file is read into), and coalesces shadow memory for these buffers. A read map can be generated for each input file with the instrumented binary.

I hope this is interesting to some people and I can provide more details. The prototype is not ready to upstream yet, but I would like to work on it if the community is interested.

This topic pops up regularly when discussing fuzzers, and not only for sparse input formats.
I hope to eventually have a reasonable solution in libFuzzer itself.
One way is to couple libFuzzer with dfsan (I even had some code for this, but removed it later).

In the mean time, contribution is very welcome in various forms:

Hi Kostya,

Thanks for getting back to me.

My work is somewhat similar to dfsan from high level but is specifically designed only to identify read regions of an input file. It uses the foundmental sanitizer infrastructure, so if dfsan can be integrated with libfuzzer, I think my work can as well.

Regarding dfsan with libfuzzer, can you refresh me why it was removed?

Hi Kostya,

Thanks for getting back to me.

My work is somewhat similar to dfsan from high level but is specifically
designed only to identify read regions of an input file.

.. which is probably good :slight_smile:

It uses the foundmental sanitizer infrastructure, so if dfsan can be
integrated with libfuzzer, I think my work can as well.

Regarding dfsan with libfuzzer, can you refresh me why it was removed?

it was not used anywhere and prevented me from doing large refactoring.
I do want to reinstate it, or something similar.

--kcc