Build bot fatigue

My inbox has been filled with llvm.buildmaster@lab.llvm.org build failure notifications lately.

The two problems appear to be:

  1) Getting notifications for breakage that was introduced by an unrelated commit, often in a module I don't work on. Usually the original committer is working on or has already landed the necessary fix.

  2) A cascade of dozens of notifications from various build servers that continue to flood in over the course of 24 hours after the issue was fixed.

These two conflate and produce a high signal-to-noise ratio, and in practice you have to filter them out which means you no longer get a ping on your phone when you need it.

Presumably a full fix is a non-trivial CI engineering problem, but are there simple measures get the situation back under control?

Doesn't have to be perfect as long as it reduces the dozens of mails every day to something more manageable. Ideas:

  1) Only send direct mail when the recipient is the single name in the blame list.

  2) Set an In-Reply-To header in order to thread all failure notifications related to a specific SVN revision. Most email clients will let you silence the thread once you've confirmed the issue has been resolved.

3) Or even simpler, don't send failure mail from any builders outside the "fast" set? Otherwise the important failures blocking everyone's work get drowned out in the noise.

Sorry to send a feature request without patches but I'm not familiar with the CI infrastructure and this looks like a fairly recent development (or is it just me?)

Alp.

My inbox has been filled with llvm.buildmaster@lab.llvm.org build failure
notifications lately.

The two problems appear to be:

1) Getting notifications for breakage that was introduced by an unrelated
commit, often in a module I don't work on. Usually the original committer
is working on or has already landed the necessary fix.

2) A cascade of dozens of notifications from various build servers that
continue to flood in over the course of 24 hours after the issue was fixed.

These two conflate and produce a high signal-to-noise ratio, and in
practice you have to filter them out which means you no longer get a ping
on your phone when you need it.

FWIW, this has generally been my experience.

Nit: I think you mean "low" signal-to-noise ratio.

Presumably a full fix is a non-trivial CI engineering problem, but are
there simple measures get the situation back under control?

Doesn't have to be perfect as long as it reduces the dozens of mails every
day to something more manageable. Ideas:

1) Only send direct mail when the recipient is the single name in the
blame list.

2) Set an In-Reply-To header in order to thread all failure notifications
related to a specific SVN revision. Most email clients will let you silence
the thread once you've confirmed the issue has been resolved.

This seems like it might be a simple, depending on where these emails are
being generated (in one of our scripts, or deep inside some CI application).

-- Sean Silva

1) Only send direct mail when the recipient is the single name in the

blame list.

That would filter out important breakages. I think your option 3 below is
probably the most effective and simplest to implement right now.

An alternative to that would be to filter for the failure's cause (test
failed, file miscompiled, etc) and see if any commit touches that file, and
only send the email to the users that touched any of them. We have all the
info in the page, shouldn't be too hard to grep stuff around...

3) Or even simpler, don't send failure mail from any builders outside the

"fast" set? Otherwise the important failures blocking everyone's work get
drowned out in the noise.

An option on the bot configuration to send or not an email would do. I
wouldn't separate "fast" from "slow", but "unique" from the rest.

For instance, we have two "fast" bots, on on A15 and one on A9. Of course,
the A15 is faster, and the A9 repeats a few minutes later. I'd want to
receive it only from one of them.

I also have a test-suite bot that doesn't "check-all", so if that fails,
it's either a compilation failure, or it's a test-suite failure, and I
really want to be warned when it breaks.

A further step would be to manage emails by bot type. Fast-unique bots
report everything (compilation, svn, make, tests), while other unique bots
only report their own stuff, so my test-suite bot would not report
compilation failures. The problem with that would be if the compilation
*only* happens on the test-suite bot, and then we'd need an extra layer to
diff between error reports, and that would be massive. I don't expect this
to happen ever.

Finally, I think your comment is also valid for IRC messages, they drive me
crazy... Can we have a separate IRC channel for those messages? Like
llvm-buildbots? Or just stick to email?

cheers,
--renato

My inbox has been filled with llvm.buildmaster@lab.llvm.org build
failure notifications lately.

The two problems appear to be:

  1. Getting notifications for breakage that was introduced by an
    unrelated commit, often in a module I don’t work on. Usually the
    original committer is working on or has already landed the necessary fix.

  2. A cascade of dozens of notifications from various build servers
    that continue to flood in over the course of 24 hours after the issue
    was fixed.

These two conflate and produce a high signal-to-noise ratio, and in
practice you have to filter them out which means you no longer get a
ping on your phone when you need it.

Presumably a full fix is a non-trivial CI engineering problem, but are
there simple measures get the situation back under control?

Doesn’t have to be perfect as long as it reduces the dozens of mails
every day to something more manageable. Ideas:

  1. Only send direct mail when the recipient is the single name in the
    blame list.

  2. Set an In-Reply-To header in order to thread all failure
    notifications related to a specific SVN revision. Most email clients
    will let you silence the thread once you’ve confirmed the issue has been
    resolved.

  3. Or even simpler, don’t send failure mail from any builders outside
    the “fast” set? Otherwise the important failures blocking everyone’s
    work get drowned out in the noise.

Sorry to send a feature request without patches but I’m not familiar
with the CI infrastructure and this looks like a fairly recent
development (or is it just me?

This isn’t new. Just how the boys have always worked.

The biggest thing would be to move boots over to the phased builder infrastructure pioneered by apple (they use it internally and I believe most of it has been upstreamed by Daniel Dunbar and David Tweed) that sets up dependencies (eg: testing debug info depends on the compiler paying the basic check first) and refuse/caching of build product (eg: use the output of the basic checks to test the debug info, rather than rebuilding the compiler on every builder).

This would reduce noise and increase build slave efficiency and granularity to produce smaller blame lists.

Agreed. This is perhaps the best way to deal with the problem and still have committers catch important failures.

My personal views (by which I always mean that I'm speaking as one of
the compiler engineers
employed by ARM but not officially on behalf of ARM):

My inbox has been filled with llvm.buildmaster@lab.llvm.org build
failure notifications lately.

The two problems appear to be:

  1) Getting notifications for breakage that was introduced by an
unrelated commit, often in a module I don't work on. Usually the
original committer is working on or has already landed the necessary fix.

  2) A cascade of dozens of notifications from various build servers
that continue to flood in over the course of 24 hours after the issue
was fixed.

These two conflate and produce a high signal-to-noise ratio, and in
practice you have to filter them out which means you no longer get a
ping on your phone when you need it.

Presumably a full fix is a non-trivial CI engineering problem, but are
there simple measures get the situation back under control?

Doesn't have to be perfect as long as it reduces the dozens of mails
every day to something more manageable. Ideas:

  1) Only send direct mail when the recipient is the single name in the
blame list.

I think this would mean less-high-performance builders would never
signal their failures, which as explained below would be unfortunate.

  2) Set an In-Reply-To header in order to thread all failure
notifications related to a specific SVN revision. Most email clients
will let you silence the thread once you've confirmed the issue has been
resolved.

This sounds like a reasonable solution.

3) Or even simpler, don't send failure mail from any builders outside
the "fast" set? Otherwise the important failures blocking everyone's
work get drowned out in the noise.

I think it would certainly be helpful to separate out the builders into
a set which are sufficiently maintained and reliable to get an email
from when something breaks their build/tests, and a more "advisory"
set of builders (eg, there are some builders that appear to be have
borderline stability, often throwing up errors unrelated to the issues
under test). I think declaring only fast builders get to send emails would
have unfortunate effects in terms of testing native builds on
low-power architectures
(which will have a slower turn-around) but are otherwise quite reliable.
(ARM, my employer, spent quite a bit of effort fixing the ARM issues that
had crept in, work which for various reasons has transitioned to Linaro now.)
Modified in to that sense, this also seems a reasonable solution.

This isn't new. Just how the boys have always worked.

The biggest thing would be to move boots over to the phased builder
infrastructure pioneered by apple (they use it internally and I believe most
of it has been upstreamed by Daniel Dunbar and David Tweed) that sets up
dependencies (eg: testing debug info depends on the compiler paying the
basic check first) and refuse/caching of build product (eg: use the output
of the basic checks to test the debug info, rather than rebuilding the
compiler on every builder).

Just to note that I suspect it's someone else you're thinking of
regarding the phased
builder. (Although I did quite a bit of work on the ARM buildbots late last year
I haven't been involved in the phased builder work.)

That would be David Dean.

Right - thanks for the correction & apologies to any & all Davids harmed by that mistake :slight_smile:

  • David