I wrote a prototype based on LLVM sanitizer infrastructure to improve fuzzing performance, especially over sparse file format. I’d like to upstream it if the anyone thinks it is useful.
Sparse file format are formats that only a small portion of the file data could have impact on the behavior of the program that parses it. Common examples are archive files or a file system image where only metadata would affect program behavior. When fuzzing those formats, a general fuzzer will randomly select ranges to mutate. Because of the sparse nature of the formats, random range selection has a high probability to hit the “wholes” where data have no influence on the parser. While applying trim over the input could sometimes improve the effective range hit rate, it would not always work. For instance, some program may pose a minimum file size requirement which turns to be fairly large for fuzzing, or the effective ranges are sparsely distributed over an entire file instead of being centralized in the beginning.
The tool I wrote leverages the observation that a piece of data would only have influence on its parser’s behavior only if the data is at least read out by the parser, and the read regions of a sparse file is usually pretty small compared to the entire file. By generating an read map for each input and feeding the map to a modified fuzzer that prioritizes mutating those ranges, we noticed over 10X performance improvement in path discovery at bootstrap time in our test. The modified fuzzer was also able to find crashes in 0.5 hour where the original version couldn’t find in 72 hours when we ended the test.
The high level idea about how the tool works is it uses an instrumentation pass to record any memory read in shadow memory, while a runtime tracks buffer propagation from a user specified buffer (the initial buffer a file is read into), and coalesces shadow memory for these buffers. A read map can be generated for each input file with the instrumented binary.
I hope this is interesting to some people and I can provide more details. The prototype is not ready to upstream yet, but I would like to work on it if the community is interested.