[RFC] Introduce "gate keeper" builders to reduce notification noise from long running bots

Hello everyone,

We have been discussing for a while now how to reduce notification noise from long running bots. One of the ideas which looks quite attractive is to have one or a set of “gate keeper” builders to do quick checks and all not so fast builders build only revisions for which the “gate keepers” reported the green status. The main idea behind this is such dependency filters simple problems and supposedly address cases when long running bots are unfortunate with commit/revert or commit/fix pairs.

I have played with this idea for a while and have a draft of the patch for llvm-zorg.

I live out of the scope of this RFC the idea of having the “gate keeper” builders do actual commits. We can discuss this in a separate thread if we want to.

First of all, let me go through pros and cons of this setup.

Pros:

  • Feedback to trivial problems will be fast, selective (since each commit will be built), and will not have a long tail of failure notifications keep arriving sometime days after the problem in question got resolved. This is a big win, up to me.

  • Long running builds wouldn’t fail on trivial problems if they will not build those revisions in the first place. This wouldn’t make blame list shorter, though, just eliminate “oh, this issue is known, and has been reverted or fixed hours ago” cases.

  • Questionably it would save slower builders resources, since they will not waste time building known to be bad commits.

Cons:

  • We add a new layer of dependencies. Some builders will depend on the “gate keeper” builders, so the “gate keeper” builders become a point of failure. No revision will be built on a significant number of builders if “gate keeper” builders fail on hardware or platform issues, shortage of resources, service outage, and such. I hope this wouldn’t be often, but it could be quite annoying for everyone if doesn’t work well.

  • Not running slower builders on known to be bad commits will not make the blame lists shorter, it will make them longer in fact, because skipped revisions will add to the blame list.

  • “Gate keeper” builders are “gates”, so they are essentially all-or-nothing. The list of hosts is going to be very limited at the beginning. Like x86_64 Linux only, I think. However, this could be a motivating point to have more fast-running reliable builders.

If we move further with this idea, we would get the “gate keeper” builders in place, running for a while to prove the correctness and reliability, then add the dependencies on them for slower running builders.

Notifications fom the “gate keeper” builders will have distinct subject line to attract attention, I think.

There are things I’d like to discuss before moving further:

  1. What should be “gate keeper” build configurations?

We want these builders to be fast, build each and every commit to llvm-project, provide short (about 3 minutes or better) response time, and be representative for catching those “easy to catch” problems in commits. Partially this depends on how fast is a particular worker, and how much resources it has. But also depends on what exactly we build and what tests we run. Obviously, less we build and check - shorter the response time. Ccache helps to make builds “incremental” in a sense. We can make more granular builders to guard particular projects and build only changes to those projects. Effectively that would reduce the testing footprint, as LLVM projects is a good test bench by itself.

  1. Do we want to have some kind of a policy for how long the “gate keeper” builders could be broken before a patch in question gets reverted?

Longer we wait less trivial revert is going to be in general, since some dependencies on that change could be introduced. Since the “gate keeper” builders are going to be a critical infrastructure point, they must be green most of the time and failures must be addressed quickly.

  1. How that policy will be enforced?

Options I see:

  • “Gate keeper” builders do auto-reverts. And manually address only cases when auto-revert is not possible for any reason.
  • Repository lock and “code sheriffs” who will be on a shift and would do a revert or fix it quickly.
  • No repository lock, but “code sheriffs” step up if an author didn’t revert or fix the issue in time.
  • Anything else?

Thanks

Galina

7 Likes

This is a great idea, thanks for working on this.

I think the kind of failures we should work the hardest to prevent are failures that block other people from doing work. The worst kinds of failures, in my opinion, are build failures, because then, I, as a developer can’t do local development on top of trunk. Or as a distributor I can’t do my daily or weekly snapshot.

I think for “gate keeper” configurations, we should have build-only jobs that build all (or a subset of the projects based on the files touched by the commit) projects for the 3 major OSes that people develop on (Mac, Windows, Linux).

I’m not opposed to adding ‘make check’ configurations as well, but I think we should be very selective about this. Trying to make check all projects can introduce a lot of noise in my experience, so I would prefer if we kept the testing limited (e.g. to just llvm or llvm, clang, and lld).

2 Likes

I don’t think solving this problem is particularly urgent. Reverting is a human decision today, and it makes sense to establish a norm to revert more aggressively for gate keeper breakage, and I expect that’s enough for the time being.

If we do want to solve it properly, IMHO that would involve something like bors which maintains a commit queue and uses tests to gate pushing to the main branch in the first place. (As opposed to reverting after the fact.)

1 Like

I really like the idea of gate keeper buildbots, but I think for them to truly be gate keepers, they need to do their work before a commit “gets through the gate” by landing on the main branch.

Having a good implementation of this kind of system would be one of the most helpful things to improve the stability of LLVM in my opinion.

3 Likes

We could call them something else :slight_smile: having one or more smoke-test bots seems like an improvement over the current situation. Let’s not have the excellent be the enemy of the better. Implementing a smoke-test bot doesn’t affect anyone’s workflow, while improving the quality of life. A true gate-keeper would be a distinct workflow change (one I would like to have, but is a bigger deal for the community).

The smoke-test bot(s) would inherently be very selective? I don’t see make-check adding noise, rather it would reduce noise. I made ~50 small commits over the past couple of months, and while I didn’t keep statistics, I think build-fail notices were far less frequent than test-fail notices. Most of those were not actionable on my part, or I had already take the necessary action, so a make-check gate would have distinctly reduced the noise I had to deal with.

3 Likes

Rust uses bors. Nobody can merge PRs. You can only say @bors r+.

On this topic: it would be ideal to not have a single “gate” but multiple. For example the slow Linux clang builders should be gated by a fast Linux clang build, the slow Mac clang gated by a fast MacOS clang builder, and the slow flang builder by a fast flang build, etc.

Something annoying is if a slow builder is not running because of an unrelated failure to what this slow builder is testing, which then just increases the blamelist for the slow builder unnecessarily.

1 Like

I agree, this is a great idea.

This matches my experience, but I would note that only doing a build would still have value.

On occasion there will be a sequence like this (oldest build first):

green
red - build error
<repeats N times>
red - test error

(and I am looking at one right now in fact)

Which complicates things 2 ways:

  • Now you have N* to bisect.
  • You cannot bisect main. You have to make a new tree with the build fix(es) cherry-picked to before the test failure presumably started.

We should absolutely run tests if possible within an acceptable response time and stability. If that’s not possible build only is still valuable.

Those few situations really take up a lot of time for us as maintainers.

It sounds like, lots of support for a build-only bot, with a stretch goal of also running tests if it can happen fast enough? I’ll agree that either of these is an improvement over what we have today.

The top-level gatekeeper does need to be fast enough to iterate on each commit individually. Turnaround time on the order of a handful of minutes.

I agree with @tstellar that @mehdi_amini has a great idea, we can have multiple gates for builders with different goals. That feels like another one-plus though, something we can add on later.

I suggest that once we have the idea of gating bots in place, it will likely be simpler to implement different schemes if the value to the community is there.

2 Likes

Thanks very much for sharing this proposal. I think it’s completely reasonable for people to discuss ideas for longer term changes to how the builders work (it’s a useful exercise to make sure work like this is aligned to such ideas), but I wouldn’t want this improvement to be blocked on e.g. moving to a pre-commit build flow. On a similar note, I agree with @nhaehnle that adding any kind of automation to the revert isn’t necessary for getting this proposal off the ground - just having the gatekeeper bots would be a good incremental improvement on what we have today.

+1 on potentially adding slower running gatekeeper bots that run tests. This would be particularly useful for very slow buildbots, preventing them wasting cycles recreating test issues that were already found on faster hardware.

1 Like

Wanted to chime in with an explicit +1 to designating gate keeper builders ASAP. I think this is a major step forward in terms of practical buildbot experience, and is something we should move forward with immediately.

I do not think we need to adopt automation or even a policy change in advance of trying out this approach. We already have a policy of “revert first, ask questions later”. This isn’t always followed (usually with decent cause), but for “obviously broken” commits it usually is.

In terms of gate keeper selection, the main thing is we need a bot owner to step forward and be willing to support the use case. Personally, I think this needs to be one of the existing “fast” builders which are already reasonably well resourced and have an established track record. We only need one to start with; we can sub-divide at a later point if experience shows it’s worthwhile.

I personally would prefer a “make check” bot over a build-only bot for the gate keeper. I see the potential argument on the cycle times, but would want to see data showing a significant difference in average cycle time before accepting that argument. My own experience doing local builds is that tests are a large part of incremental dev, but much less so when pulling a new commit. However, I care a lot less about this point than that we have some gate keeper bot. I will happily leave this choice to the bot owner.