Lib/fuzzer: `TORCW` capacity is too small (and other issues)?

(Note: “new users can only put 5 links in a post” so apologies I couldn’t link all the things I wanted!)

We’ve recently started using -fsanitizer=fuzzer to fuzz our library, and I’d like to share what looks like some limitations of the fuzzing runtime that we’ve run into involving the rather small TORCW (Table Of Recent Compares - Word) when used with code that performs a large number of string comparisons. I’d like to ask a few questions around the design and see if there is room for improvement here.

We’re using -fsanitizer=fuzzer with libprotobuf-mutator to fuzz a C++ library. The protobuf schema describes how to create objects within the API, call methods on them, etc. and we execute that schema against the library/API. So a pretty typical use case of what is described in “Fuzzing Stateful APIs”. Part of this API is an initial initialization step where runtime options and feature flags are specified as a simple array of strings (or a repeated string field in protobuf). For example, one of these string values might be "enableSomeFeatureFlagOrBehaviourChange".

We noticed that only a very small subset of these flags are ever discovered by the fuzzer even when left to run for days at thousands of execs/s, and despite the undiscovered flags opening up a lot of new branches. If we provide these flags manually in a dictionary file or seed the corpus with examples everything works as expected and the coverage paths are found and explored.

I spent a while looking into this because although providing a dictionary of these flags works, my understanding is that finding magic strings is exactly what the fuzzer should be good at so I was worried that we were missing something. After adding a lot of trace to the compiler-rt/lib/fuzzer code and trying various experiments I’ve found that:

  • The particular string comparisons having the issues are hooked via __sanitizer_weak_hook_memcmp
  • The memcmp arguments are added to the TORCW (Table Of Recent Compares - Word) table, which stores up to 32 pairs of FixedWord<64> (the pair being the two arguments of the memcmp/strcmp/etc). Existing table entries are replaced if a subsequent comparison’s inputs hash to the same index.
  • Our library ends up hitting the memcmp hook hundreds of times on each run just during regular initialization (ignoring any fuzz input), and these all hit the same TORCW table. This ends up kicking out a lot of important feature flag strings found earlier in the run due to collisions.
    • The handful of flags the fuzzer is able to find are the ones that happen to “survive” / not collide with any others during a run. Conversely, the flags that are not found are consistently kicked out by collisions.

So TORCW seems to be too small at least for the code we are fuzzing, and its contents end up biased towards comparisons that happen towards the end of a particular run, due to earlier ones getting kicked out by collisions. Yes, it’s called the Table Of Recent Comparisons, but this doesn’t seem like a desirable property because there’s no reason “interesting” strings can’t appear early on, and ideally they shouldn’t have a lower chance of being discovered.

As an experiment, I increased the size of the TORCW table from 32 to 1024 (mirroring the size of the MemMemTable). The fuzzer was able to find the particular string I was testing for within ~5 mins consistently, though there were a couple of issues with this:

  1. If the TORCW isn’t fully saturated with entries (depending on the number of comparisons the code is performing), the extra empty slots result in Mutate_AddWordFromTORC early returning more often as it fails to hit a valid entry, which essentially reduces the success rate of Mutate_AddWordFromTORC overall. If I change Mutate_AddWordFromTORC to iterate through TORCW from the initial random index until it finds a non-empty entry, my test case improved further from <5mins to <2mins.
  2. The hashing code in AddValueForMemcmp is particularly weak, essentially just doing hash ^= lhs[i] << 8 | rhs[i] with a few other bits from the -use_value_profile=1 logic thrown in. So it’s only 16 bits, and XOR is pretty bad in that repeated characters (which the fuzzer likes to throw in a lot) just XOR down to either 0 or the character value, anagrams produce the same hash, etc. Although this is less of a problem when the capacity is only 32, it becomes more of a problem with larger sizes. Switching to SimpleFastHash for the TORCW.Insert(...) call (while leaving the ValueProfileMap alone) increased the number of table entries in the scenario I was testing from ~200 to ~320 i.e. 120 less collisions which seems significant.

So at this point I’m wondering if anybody here is familiar with the code and design choices in this general area. Some questions/comments:

  • Why does TORCW only have a capacity of 32 pairs (and not 64/128/…?), whereas the similar MemMemTable has 1024 (singles)? strcmp/memcmp and strstr/memmem are slightly different operations, but in terms of how many entries we remember I don’t see why there should be such a big difference.
  • With regards to earlier string comparisons having a lower chance of “surviving” in TORCW, while not a perfect solution, maybe the hash function could be randomly seeded at some point (even at the process level) so that an input which is kicked out due to a collision may “survive” some runs depending on the seed. I guess this wouldn’t scale well with more comparisons / a higher “load factor” on the table, but maybe there is some other way to randomly cover / drift over the full set of comparisons over time? At the very least the code we are fuzzing has a critical string that appears in an early comparison, but which is getting kicked out of TORCW pretty deterministically. So throwing in some variability somehow would be good.
  • As I mentioned earlier, increasing the capacity of TORCW (or MemMemTable for that matter) will reduce the chance of Mutate_AddWordFromTORC finding/returning a string as it will introduce more empty slots. In the worse case, if there is just one string in TORCW today, the chance of selecting it will be ~1/32, and increasing the capacity to 1024 would reduce this to ~1/1024. Maybe a more dynamically sized table makes sense here to control the density, and/or it’s better to do a linear search from the initially chosen random index until something is found (this is the experiment I tried in 1. above)? The fuzzer has already decided it wants a TORC string so seems to me like we should just spend a bit longer to actually return one, rather than randomly (with a chance based on the unused capacity) “yielding” back up to the top-level fuzz loop to try another random mutator.
  • The str/mem* hooks in FuzzerTracePC.cpp have an if (!fuzzer::RunningUserCallback) return; early return, presumably to avoid polluting the comparison-tables with values which were seen outside the user’s fuzz function. However these early returns are not present on any of the other hooks like integer comparisons, switches, etc. Is there a reason for this?

Comments/thoughts welcome!

1 Like

Going to try a “bump” and ask if anybody knows who to reach out to about this?

As per libFuzzer – a library for coverage-guided fuzz testing. — LLVM 19.0.0git documentation

The original authors of libFuzzer have stopped active work on it and switched to working on another fuzzing engine, Centipede. LibFuzzer is still fully supported in that important bugs will get fixed. However, please do not expect major new features or code reviews, other than for bug fixes.

Looking through google/fuzztest I see they have a slightly new implementation of this stuff in the TableOfRecentlyComparedBuffers class, though they still acknowledge it is lossy. I guess we’ll try this out and see if it exhibits similar bad behaviour for our use case.