Confusing buildbot failure in LLVM on sanitizer-x86_64-linux

Alexey, I got mail from one of the tsan buildbots, claiming a breakage
in tsan tests. But I cannot see anything on the logs it has for the
build.

http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/builds/17916/steps/run%2064-bit%20tsan%20unit%20tests/logs/stdio

Any ideas? Thanks. Diego.

It’s a 20m timeout without output.

If you back up to the build and look at the ‘annotate’ step output, there’s this text:http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/builds/17916/steps/annotate/logs/stdio

-- Testing: 258 tests, 16 threads --
Testing: 0 .. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..
command timed out: 1200 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=3507.624426

The annotator should probably include that timeout text in the failing step, so that sounds like a bug.

Another issue is that tsan times out sometimes. Should we be sending tsan build failures to upstream developers? How often do they break tsan? I suspect that when LLVM breaks tsan, it also breaks ASan, which isn’t as flaky. It might be better to mail the tsan failures to Dmitry or someone and not upstream LLVM devs.

It's a 20m timeout without output.

If you back up to the build and look at the 'annotate' step output,
there's this text:

http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/builds/17916/steps/annotate/logs/stdio

-- Testing: 258 tests, 16 threads --
Testing: 0 .. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..
command timed out: 1200 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=3507.624426

The annotator should probably include that timeout text in the failing
step, so that sounds like a bug.

Another issue is that tsan times out sometimes.

Also - how often are the timeouts actually indicative of regressions.
Perhaps we could flag them as "exceptional" results, shown in purple (&
possibly not emailing anyone except the buildbot owner) - rather than red
failures somehow.

+dvyukov

Happened to me again:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/builds/18273/steps/annotate/logs/stdio

In fact, this whole bot has a 20% failure rate with the same failure mode, from looking at the history:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/?numbuilds=50

They all end with this:

[100%] Running ThreadSanitizer tests
– Testing: 258 tests, 16 threads –
Testing: 0 … 10… 20… 30… 40… 50… 60… 70… 80… 90…
command timed out: 1200 seconds without output, attempting to kill

It seems like we’d get a lot more value from this bot if we just disabled the tsan tests, or at whichever tests have the highest deadlock risk.

Do we know that 14.4 GB of RAM is enough to run tsan tests with
parallelism level 16? I would not be surprised if it is not. Don't yet
have a machine to test.
Alexey, reduce parallelism level for tsan tests to 4 on that bot and
let's see what happens.

So far as I can tell no one is root causing this, so in the meantime can we disable check-tsan?