Post mortem on test flake in Sanitizer Common

I’m taking the time here to write up a bit of my investigation into a flaky test in sanitizer common that we saw start cropping up in Fuchsia’s Clang CI for a few reasons:

  1. I think it’s interesting,
  2. it took a lot of time, and
  3. I hope I can rope other community members into the discussion with a good story.

The TLDR; can be found in the original bug report at Sanitizer test failures in symbolize_stack.cpp on AArch64 · Issue #55460 · llvm/llvm-project · GitHub if you’re interested, although all of that information is now in this post (sometimes verbatim) and is hopefully presented here in a more enjoyable way.

So last weekend we saw for the first time a new test failure crop up in the symbolize_stack.cpp test in sanitizer common. Initially it only cropped up on in the UBSAN tests on AArch64, but later it came up in lots of other configurations on x86_64 as well (I did mention it was flaky right?).

So I started the normal course of action, reporting the bug, and looking for which commits caused the failure. Like LLVM, we often batch recent commits to reduce the load on our CI, so there is a blame list for each build, and it’s normally in the range of 3-10 commits. The failing test in question had been around for quite a while, and only ran a single check in FileCheck matching a nested template pattern for C++ vectors (essentially trying to match vector<.*vector.*vector<.*vector.*vector<). So I started looking through the recent commits, and didn’t see anything obvious, so I started bisecting hoping that would reveal something. It didn’t, and for whatever reason I couldn’t reproduce it locally. Moreover, by this point it was clear that the test was only failing sometimes, wasn’t limited to AArch64, and was the result of some non-deterministic failure in the test… Great.

I tried a lot of things to get the test failure to reproduce, and then Petr Hosek noticed that our CI output used some fairly long paths. He wondered if that may be part of the problem since we saw in the FileCheck output a warning diagnostic we thought was the cause of the issue:

==180122==WARNING: Symbolizer buffer too small

Using his insight I was able to construct some ridiculously convoluted path and retry the tests … and it worked I was able to (after two days) reproduce the failure which was often blocking our Toolchain work … well I could reproduce it sometimes.

We had already dug into the test code and the code in sanitizer common, but couldn’t for the life of us see why this test that had been well behaved for years on our bots suddenly started failing.

So after a lot of digging we finally determined the root cause of the issue. The test in symbolize_stack.cpp(llvm-project/symbolize_stack.cpp at 1dfd8e99f91c52879fd361466ad8765293a872be · llvm/llvm-project · GitHub) calls a recursive template, designed to create a function with a long function name. The purpose of the test is to ensure that the symbolizer does not crash when the name is very long(like 45KB).

The SymbolizerProcess class defines a buffer of fixed length(16K)(llvm-project/sanitizer_symbolizer_internal.h at 1dfd8e99f91c52879fd361466ad8765293a872be · llvm/llvm-project · GitHub), and uses that buffer during symbolization.

However, in ReadFromSymbolizer() (llvm-project/sanitizer_symbolizer_libcdep.cpp at 1dfd8e99f91c52879fd361466ad8765293a872be · llvm/llvm-project · GitHub) when reading into the buffer the symbolizer code checks if the buffer is full, and if so, basically aborts reading(llvm-project/sanitizer_symbolizer_libcdep.cpp at 1dfd8e99f91c52879fd361466ad8765293a872be · llvm/llvm-project · GitHub), and discards all data read so far. The buffer has the null terminator written to its first character, and a pointer to it is returned back up the call chain. Thus, when the symbolizer data is eventually flushed, it is truncated. The long function name in question can be over 45K long, far exceeding the size of the underlying buffer, and it just so happens that now suddenly the truncation is occuring before the string FileCheck is trying to match makes it into the output stream.

As mentioned earlier the test in symbolize_stack.cpp is trying to match a set of nested C++ vectors in FileCheck, essentially matching vector<.*vector.*vector<.*vector.*vector<. Historically, the test seems to have just been lucky in how the diagnostic output makes it into FileCheck, and in practice there have always been at least 5 nested vectors available in the output.

Like I mentioned earlier though, in our CI, we have some fairly long path names, and further there have been some changes to how demangling is done recently. I think the confluence of these factors started causing us to hit the test failure with more regularity in our CI.

I do want to point out that this doesn’t appear to be a bug in the demangler, but to be a subtle design flaw in how online symbolization works. Typically when reading into a fixed buffer, you keep working until the buffer is full (or you reach the end of the input), and then process whatever you’ve already received and then after you can discard the old contents, you start filing the buffer again until you’ve reached the end of the input. In the case of the symbolizer I had expected it to read whatever was ready, and if it ran out of buffer space to process the portion of the trace it had read, and then finish reading in the remaining data.

It’s also important to remember here that this code that is doing online symbolization, and it seems unlikely that it would ever be allowed to block as part of its design. Given that I believe blocking would be required to operate in the way I’ve described above, I’m unsure of the correct course of action here.

Sure, we can increase the buffer size. Setting it to the size of a huge page seems like a reasonable compromise. But that only kind of punts the problem down the line. Maybe the fix here is to stop dropping the data that’s been read in, and instead allow the symbolizer to flush what it’s already available – even though it may be incomplete. Afterall, it’s already printed a warning that the buffer was to small, so users won’t be surprised that the output looks less than ideal. I’m not familiar enough w/ the internals of the symbolizer reporting mechanism to know that this would work. For instance it may be impossible to parse partial data. I’m not an expert here, which is one of the reasons I made this writeup.

Back to the issue at hand: what to do about this flaky test?

All of the quick solutions I can see probably render the test useless:

  • increasing the buffer size means we’d need to also increase the number of recursions
  • reducing the number of recursions means we are no longer testing that long function names don’t trigger a crash
  • matching on earlier function names may silently ignore crashes

It also brings up a larger question about what the right way to design such a test would be, because i think using FileCheck here is the wrong approach. We already use GTest, and it IIRC supports death tests, which seems like the right way to test this instead of by proxy with FileCheck.

But beyond that, we may also want to start thinking about alternative designs for online symbolization, and if our tools can do better in these cases by using a different design.

If you have thoughts on any of this they’re welcome. The github issue is also a good place for discussion. If nothing else, then I hope the story was fun or at least mildly entertaining.


It does. You can grep for EXPECT_DEATH in the existing unittests for examples.