Stella,
Thank you for raising the question. This is a great discussion for us to have publicly.
So folks know, I am the individual Stella mentioned below. I’ll start with a bit of history so that everyone’s on the same page, then dive into the policy question.
My general take is that buildbots are only useful if failure notifications are generally actionable. A couple months back, I was on the edge of setting up mail filter rules to auto-delete a bunch of bots because they were regularly broken, and decided I should try to be constructive first. In the first wave of that, I emailed a couple of bot owners about things which seemed like false positives.
At the time, I thought it was the bot owners responsibility to not be testing a flaky configuration. I got a bit of push back on that from a couple sources - Stella was one - and put that question on hold. This thread is a great opportunity to decide what our policy actually is, and document it.
In the meantime, I’ve been working with Galina to document existing practice where we could, and to try to identify best practices on setting up bots. These changes have been posted publicly, and reviewed through the normal process. We’ve been deliberately trying to stick to non-controversial stuff as we got the docs improved. I’ve been actively reaching out to bot owners to gather feedback in this process, but Stella had not, yet, been one.
Separately, this week I noticed a bot which was repeatedly toggling between red and green. I forget the exact ratio, but in the recent build history, there were multiple transitions, seemingly unrelated to the changes being committed. I emailed Galina asking her to address, and she removed the buildbot until it could be moved to the staging buildmaster, addressed, and then restored. I left Stella off the initial email. Sorry about that, no ill intent, just written in a hurry.
Now, transitioning into a bit of policy discussion…
My personal take is that for a bot to be publicly notifying, “someone” needs to take the responsibility to backstop the normal revert to green process. This “someone” can be developers who work in a particular area, the bot owner, or some combination thereof. I view the responsibility of the bot config owner as being the person responsible for making sure that backstopping is happening. Not necessarily by doing it themselves, but by having the contacts with developers who can, and following up when the normal flow is not working.
In this particular example, we appear to have a bunch of flaky lldb tests. I personally know absolutely nothing about lldb. I have no idea whether the tests are badly designed, the system they’re being run on isn’t yet supported by lldb, or if there’s some recent code bug introduced which causes the failure. “Someone” needs to take the responsibility of figuring that out, and in the meantime spaming developers with inactionable failure notices seems undesirable.
For context, the bot was disabled until it could be moved to the staging buildmaster. Moving to staging is required (currently) to disable developer notification. In the email from Galina, it seems clear that the bot would be fine to move back to production once the issue was triaged. This seems entirely reasonable to me.
Philip
p.s. One thing I’ll note as a definite problem with the current system is that a lot of this happens in private email, and it’s hard to share so that everyone has a good picture of what’s going on. It makes miscommunications all too easy. Last time I spoke with Galina, we were tentative planning to start using github issues for bot operation matters to address that, but as that was in the middle of the transition from bugzilla, we deferred and haven’t gotten back to that yet.
p.p.s. The bot in question is if folks want to examine the history themselves.