Uncovering non-determinism in LLVM - An Update

Hi All,

I wanted to share a couple of updates on the effort to uncover non-determinism in LLVM through reverse iteration.

1. Reverse iteration has now been enabled for DenseMap (https://reviews.llvm.org/D35043)

2. We have setup a nightly reverse iteration buildbot (http://lab.llvm.org:8011/builders/reverse-iteration).
This builds all LLVM targets with reverse iteration ON and runs ninja check-all. Currently there are 14 unit test failures. Please feel free to fix these.

Also currently, only I receive the nightly email notification for this buildbot run. My plan is to enable sending the nightly notifications to llvm-commits once all 14 failures have been resolved.
Please let me know if the community wants the nightly notifications even with the failures.
As a potential next step, I was thinking about bootstrapping this reverse iteration LLVM to compile itself. Not sure if it can uncover more bugs but maybe worth a shot.

All comments/suggestions welcome.

Thanks,
Mandeep

Hi All,

I wanted to share a couple of updates on the effort to uncover
non-determinism in LLVM through reverse iteration.

  1. Reverse iteration has now been enabled for DenseMap
    (https://reviews.llvm.org/D35043)

  2. We have setup a nightly reverse iteration buildbot
    (http://lab.llvm.org:8011/builders/reverse-iteration).
    This builds all LLVM targets with reverse iteration ON and runs ninja
    check-all. Currently there are 14 unit test failures. Please feel free
    to fix these.

Also currently, only I receive the nightly email notification for this
buildbot run. My plan is to enable sending the nightly notifications to
llvm-commits once all 14 failures have been resolved.
Please let me know if the community wants the nightly notifications even
with the failures.
As a potential next step, I was thinking about bootstrapping this
reverse iteration LLVM to compile itself. Not sure if it can uncover
more bugs but maybe worth a shot.

To uncover bugs in this configuration, I believe you’d want/need a stage2/stage3 comparison which might be a bit tricky/expensive*, something like:

build clang twice (reverse and forward enabled) then build (in one mode, doesn’t matter which I think) clang or other release binaries (or even the whole release) from each of those and compare them bit-for-bit, they should be identical.

  • If you want other developers to act on bugs found, the buildbot needs to have a short blame list (this can be done on a slow buildbot by having multiple slaves/builders running in parallel) but preferably also a short cycle time (so failures are reported soon after they’re created) - otherwise expect to do a lot of triage yourself (& possibly leave the emails only going to you - because they’ll have too large blame lists/revision ranges and people won’t find them actionable) & then probably following up on the specific commit you believe introduced the problem and either fixing it yourself or replying on the commits list to report it to the original contributor.

Hi All,

I wanted to share a couple of updates on the effort to uncover
non-determinism in LLVM through reverse iteration.

1. Reverse iteration has now been enabled for DenseMap
(https://reviews.llvm.org/D35043)

2. We have setup a nightly reverse iteration buildbot
(http://lab.llvm.org:8011/builders/reverse-iteration).
This builds all LLVM targets with reverse iteration ON and runs ninja
check-all. Currently there are 14 unit test failures. Please feel free
to fix these.

Also currently, only I receive the nightly email notification for this
buildbot run. My plan is to enable sending the nightly notifications to
llvm-commits once all 14 failures have been resolved.
Please let me know if the community wants the nightly notifications even
with the failures.
As a potential next step, I was thinking about bootstrapping this
reverse iteration LLVM to compile itself. Not sure if it can uncover
more bugs but maybe worth a shot.

To uncover bugs in this configuration, I believe you'd want/need a
stage2/stage3 comparison which might be a bit tricky/expensive*, something
like:

build clang twice (reverse and forward enabled) then build (in one mode,
doesn't matter which I think) clang or other release binaries (or even the
whole release) from each of those and compare them bit-for-bit, they should
be identical.

* If you want other developers to act on bugs found, the buildbot needs to
have a short blame list (this can be done on a slow buildbot by having
multiple slaves/builders running in parallel) but preferably also a short
cycle time (so failures are reported soon after they're created) - otherwise
expect to do a lot of triage yourself (& possibly leave the emails only
going to you - because they'll have too large blame lists/revision ranges
and people won't find them actionable) & then probably following up on the
specific commit you believe introduced the problem and either fixing it
yourself or replying on the commits list to report it to the original
contributor.

I agree with what David said here, but I just wanted to say that you
shouldn't feel too discouraged because of it.

As someone that occasionally has to bisect 5h+ worth of revisions, I
can tell you that in time you'll often be able to just look at the
revisions and spot the culprit, or maybe 2-3 candidates that have
likely caused the issue. Given that this bot does something very
specific, you can then probably just inspect the code and see what
caused the problem (if the revision doesn't touch any containers, then
it probably didn't cause the issue, right?). It's a lot easier when
you have a revision range, so it obviously won't take as long to
identify and fix as the initial failures that you are seeing now.

Ultimately, it's up to you to decide how much effort you are willing /
able to put into this. This kind of failures probably won't even occur
that often in practice, but when they do I think it's important to
find them and fix them. The best way to know for sure is to give it a
try for a while and see how it goes. If you find that it's
impractical, you can always revert to the current configuration.

Thanks David and Diana for your suggestions. Yes, I am looking at setting up the builder to run after every 10 commits (is 10 a reasonable number?) and notify the blame list in case of failures.

My plan is to enable this once all the current failures in the reverse builder are fixed (currently there are 4 failures).

As for the bit-by-bit comparison of forward vs reverse builders is concerned I am trying to convince my team to dedicate some resources to this. Not sure how soon I can get this done :slight_smile:

Also thanks Diana for your ideas on how to debug/fix reverse iteration failures. Actually I have been following a similar strategy to fix these issues so far. Maybe I will update the community on this in a future email thread.

--Mandeep

Thanks David and Diana for your suggestions. Yes, I am looking at
setting up the builder to run after every 10 commits (is 10 a reasonable
number?) and notify the blame list in case of failures.

Mostly you probably just end up having the bot run as fast as possible - I think the default configuration maybe has a time based delay so it doesn’t fire off on the first commit after a quiet period, but if nothing else comes in for a few minutes, it goes off and runs on a single commit rather than sitting idle.

I guess you’d find more in the zorg repository.

But please keep an eye on how reliable/actionable the emails are - if developers aren’t responding to/fixing issues, or if the bot is sending fail mail for other uninteresting things (like build failures that other buildbots alerady diagnosed) - please tweak/tune or disable the buildbot. I know there’s already a lot of buildbot email spam, but everything we can do to reduce the noise is really important.