Buildbot Noise

Folks,

David has been particularly militant with broken buildbots recently,
so to make sure we don't throw the baby with the bath water, I'd like
to propose some changes on how we deal with the emails on our
*current* buildmaster, since there's no concrete plans to move it to
anything else at the moment.

The main issue is that managing the buildbots is not a simple task. It
requires build owners to disable on the slave side, or specific people
on the master side. The former can take as long as the owner wants
(which is not nice), and the latter refreshes all active bots
(triggering exceptions) and are harder to revert.

We need to be pragmatic without re-writing the BuildBot product.

Grab some popcorn...

There are two main fronts that we need to discuss the noise: Bot and
test stability.

   1. Bot stability issues

We need to distinguish between four classes of buildbots:

  1.1. Fast && stable && green

These buildbots normally finish under one hour, but most of the time
under 1/2 hour and should be kept green as much as possible.
Therefore, any reasonable noise in these bots are welcomed, since we
want them to go back to green as soon as possible.

They're normally the front-line, and usually catch most of the silly
bugs. But we need some kind of policy that allows us to revert patches
that break them for more than a few hours. We have an agreement
already, and for me that's good enough. People might think
differently.

With the items 2.x below taken care of, we should keep the current
state of our bots for this group.

  1.2. One of: Slow || Unstable || Often Red

These bots are special. They're normally *very* important, but have
some issues, like slow hardware, not too many available boards, or
they take long times to bisect and fix the bugs.

These bots catch the *serious* bugs, like self-hosted Clang
mis-compiling a long-running test which sometimes fails. They can
produce noise, but when the noise is correct, we really need to listen
to it. Writing software to understand that is non-trivial.

So, the idea here is to have a few special treatments for each type of
problem. For example, slow bots need more hardware to reduce the blame
list. Unstable bot need more work to reduce spurious noise to a
minimum (see 2.x below), and red bots *must* remain *silent* until
they come back to green (see 2.x below).

What we *don't* want is to disable or silence them after they're
green. Most of the bugs they find are hard to debug, so the longer we
take to fix it the harder it is to find out what happened. We need to
know as soon as possible when they break.

  1.3. Two of: Slow || Unstable || Often Red

These bots are normally only important to their owners, and they are
on the verge of being disabled. The only way to cope with those bots
is to completely disable their emails / IRC messages, so that no one
gets flooded with noise from broken bots.

However, some bots on the 1.2 category fall into this one for short
periods of time (~1 week), so we need to be careful with what we
disable here. That's the key baby/bathwater issue.

Any hard policy here will be wrong for some bots some of the time, so
I'd love if we could all just trust the bot owners a bit when they say
they're fixing the issue. However, bots that fall here for more than a
month, or more often that a few times during a few months (I'm being
vague on purpose), then we collectively decide to disable the bot.

What I *don't* want is any two or three guys deciding to disable the
buildbot of someone else because they can't stand the noise. Remember,
people do take holidays once in a while, and they may be in the Amazon
or the Sahara having well deserved rest. Returning to work and
learning that all your bots are disabled for a week is not nice.

So far, we have coped with noise, and the result is that people tend
to ignore those bots, which means more work to the bot owner. This is
not a good situation, and we want to move away from it, but we
shouldn't flip all switches off by default. We can still be pragmatic
about this as long as we improve the quality overall (see 2.x below)
with time.

In summary, bots that fall here for too long will have their emails
disabled and candidates for removal in the next spring clean-up, but
not immediately.

  1.4. Slow && Unstable && Red

These bots don't belong here. They should be moved elsewhere,
preferably to a local buildmaster that you can control and that will
never email people or upset our master if you need changes. I have
such a local master myself and it's very easy to setup and maintain.

They *do* have value to *you*, for example to show the progress of
your features cleaning up the failures, or generating some benchmark
numbers, but that's something that is very specific to your project
and should remain separated.

Any of these bots in LLVM Lab should be moved away / removed, but on
consensus, including the bot owner if he/she is still available in the
list.

   2. Test stability issues

These issues, as you may have noticed from the links above, apply to
*all* bots. The less noise we have overall, the lower will be our
threshold for kicking bots out of the critical pool, and the higher
the value of the not-so-perfect buildbots to the rest of the
community.

  2.1 Failed vs Exception

The most critical issue we have to fix is the "red -> exception ->
red" issue. Basically, a bot is red (because you're still
investigating the problem), then someone restarts the master, so you
get an exception. The next build will be a failure, and the
buildmaster recognises the status change and emails everyone. That's
just wrong.

We need to add an extra check to that logic where it searches down for
the next non-exceptional status and compares to that, not just the
immediately previous result.

This is a no-brainer and I don't think anyone would be against it. I
just don't know where this is done, I welcome the knowledge of more
experienced folks.

  2.2 Failure types

The next obvious thing is to detect what the error is. If it's an SVN
error, we *really* don't need to get an email. But this raises the
problem that an SVN failure followed by a genuine failure will not be
reported. So, the reporting mechanism also has to know what's the
previously *reported* failure, not just the previous failure.

Other failures, like timeout, can be either flaky hardware or broken
codegen. A way to be conservative and low noise would be to only warn
on timeouts IFF it's the *second* in a row.

For all these adjustments, we'll need some form of walk-back on the
history to find the previous genuine result, and we'll need to mark
results with some metadata. This may involve some patches to buildbot.

  2.3 Detecting fixed bots

Another interesting feature, that is present in the "GreenBot" is a
warning when a bot you broke was fixed. That, per se, is not a good
idea if the noise levels are high, since this will probably double it.

So, this feature can only be introduced *after* we've done the clean
ups above. But once it's clean, having a "green" email will put the
minds of everyone that haven't seen the "red" email yet to rest, as
they now know they don't even need to look at it at all, just delete
the email.

For those using fetchmail, I'm sure you could create a rule to do that
automatically, but that's optional. :slight_smile:

  2.4 Detecting new failures

This is a wish-list that I have, for the case where the bots are slow
and hard to debug and are still red. Assuming everything above is
fixed, they will emit no noise until they go green again, however,
while I'm debugging the first problem, others can appear. If that
happens, *I* want to know, but not necessarily everyone else.

So, a list of problems reported could be added to the failure report,
and if the failure is different, the bot owner gets an email. This
would have to play nice with exception statuses, as well as spurious
failures like SVN or timeouts, so it's not an easy patch.

The community at large would be already happy with all the changes
minus this one, but folks that have to maintain slow hardware like me
would appreciate this feature. :slight_smile:

Does any one have more concerns?

AFAICS, we should figure out where the walk-back code needs to be
inserted and that would get us 90% of the way. The other 10% will be
to list all the buildbots, check their statuses, owners, and map into
those categories, and take the appropriate action.

Maybe we should also reduce the noise in the IRC channel further (like
only first red, first green), but that's not my primary concern right
now. Feel free to look into it if it is for you.

cheers,
--renato

Hi Renato,

Very useful thoughts, thanks. Need to think what could be done about these.

I will add few comments from my side.

Buildmaster as is configured now should send notifications on status change only for
‘successToFailure’ and ‘failureToSuccess’ events, so always red bots should be quiet.

Also we have group of builders (experimental_scheduled_builders) in configuration file builders.py which also should be quiet. This is place for noisy unstable bots.

If these features are not working properly please let me know and I will also try to watch these.

Unfortunately buildbot currently does not distinguish test and build failures.

I am going to be away on vacation the whole next week, but will keep an eye on buildbot.

Thanks

Galina

I haven't looked at LLVM's configs or buildbot setup at all, but the
buildbot I ran previously, it was possible to have the master reload its
configs without restarting the master or interrupting any of the unmodified
builders.

Might be worth looking into why that doesn't work (if it doesn't)?

For many changes restarting master is not necessary but not for all changes.

But there is way for improvements also.

Thanks

Galina

I agree with almost everything you said. A couple of comments inline.

Folks,

David has been particularly militant with broken buildbots recently,
so to make sure we don't throw the baby with the bath water, I'd like
to propose some changes on how we deal with the emails on our
*current* buildmaster, since there's no concrete plans to move it to
anything else at the moment.

The main issue is that managing the buildbots is not a simple task. It
requires build owners to disable on the slave side, or specific people
on the master side. The former can take as long as the owner wants
(which is not nice), and the latter refreshes all active bots
(triggering exceptions) and are harder to revert.

We need to be pragmatic without re-writing the BuildBot product.

Grab some popcorn...

There are two main fronts that we need to discuss the noise: Bot and
test stability.

    1. Bot stability issues

We need to distinguish between four classes of buildbots:

   1.1. Fast && stable && green

These buildbots normally finish under one hour, but most of the time
under 1/2 hour and should be kept green as much as possible.
Therefore, any reasonable noise in these bots are welcomed, since we
want them to go back to green as soon as possible.

They're normally the front-line, and usually catch most of the silly
bugs. But we need some kind of policy that allows us to revert patches
that break them for more than a few hours. We have an agreement
already, and for me that's good enough. People might think
differently.

With the items 2.x below taken care of, we should keep the current
state of our bots for this group.

   1.2. One of: Slow || Unstable || Often Red

These bots are special. They're normally *very* important, but have
some issues, like slow hardware, not too many available boards, or
they take long times to bisect and fix the bugs.

These bots catch the *serious* bugs, like self-hosted Clang
mis-compiling a long-running test which sometimes fails. They can
produce noise, but when the noise is correct, we really need to listen
to it. Writing software to understand that is non-trivial.

So, the idea here is to have a few special treatments for each type of
problem. For example, slow bots need more hardware to reduce the blame
list. Unstable bot need more work to reduce spurious noise to a
minimum (see 2.x below), and red bots *must* remain *silent* until
they come back to green (see 2.x below).

What we *don't* want is to disable or silence them after they're
green. Most of the bugs they find are hard to debug, so the longer we
take to fix it the harder it is to find out what happened. We need to
know as soon as possible when they break.

I view the three conditions as warranting somewhat different treatment. Specifically:

"slow" these are tolerable if annoying

"unstable" these should be removed immediately. If the failure rate is more than 1 in 5 builds of a known clean revision, that's far too much noise to be notifying. To be clear, I'm specifically referring to spurious *failures* not environmental factors which are global to all bots.

"often red" these are extremely valuable (msan, etc..). Assuming we only notify on green->red, the only rule we should likely enforce is that each bot has been green "recently". I'd suggest a threshold of 2 months. If it hasn't been green in 2 months, it's not really a build bot.

   1.3. Two of: Slow || Unstable || Often Red

These bots are normally only important to their owners, and they are
on the verge of being disabled. The only way to cope with those bots
is to completely disable their emails / IRC messages, so that no one
gets flooded with noise from broken bots.

However, some bots on the 1.2 category fall into this one for short
periods of time (~1 week), so we need to be careful with what we
disable here. That's the key baby/bathwater issue.

+1. Any reasonable threshold is fine. We just need to have one.

Any hard policy here will be wrong for some bots some of the time, so
I'd love if we could all just trust the bot owners a bit when they say
they're fixing the issue. However, bots that fall here for more than a
month, or more often that a few times during a few months (I'm being
vague on purpose), then we collectively decide to disable the bot.

What I *don't* want is any two or three guys deciding to disable the
buildbot of someone else because they can't stand the noise. Remember,
people do take holidays once in a while, and they may be in the Amazon
or the Sahara having well deserved rest. Returning to work and
learning that all your bots are disabled for a week is not nice.

So, maybe I'm missing something, but: why is it any harder to bring a silence bot green than an emailing one?

"unstable" these should be removed immediately. If the failure rate is more
than 1 in 5 builds of a known clean revision, that's far too much noise to
be notifying. To be clear, I'm specifically referring to spurious
*failures* not environmental factors which are global to all bots.

There are some bugs that introduce intermittent behaviour, and it
would be very bad if we just disabled the bots that warned us about
them. Some genuine bugs in Clang or the sanitizers can come and go if
they depend on where the objects are stored in memory, or if the block
happens to be aligned or not.

One example is Clang's inability to cope with alignment when using its
own version of placement new for derived classes. Our ARM bots have
been warning on them for more than a year and we have fixed most of
them. If we had disabled the ARM bots the first time it became
"unstable", we would still have those problems and we wouldn't be
testing on ARM any more. Two very bad outcomes.

We have to protect ourselves from assuming too much, too early.

"often red" these are extremely valuable (msan, etc..). Assuming we only
notify on green->red, the only rule we should likely enforce is that each
bot has been green "recently". I'd suggest a threshold of 2 months. If it
hasn't been green in 2 months, it's not really a build bot.

This is the case I make on 1.4. If the bot is assumed to be red
because someone is slowly fixing its problems, than this bot belongs
to a separate buildmaster.

However, slow bots tend to be red for longer periods, not necessarily
for longer number of builds. OTOH, fast bots can be red for a very
large number of builds, but immediately green when a revert is
applied. So we need to be careful on timings here.

So, maybe I'm missing something, but: why is it any harder to bring a
silence bot green than an emailing one?

It's not. But keeping it green later is, because it takes time to
change the buildmaster. For obvious reasons, not all of us have access
to the buildmaster, meaning we depend on the few people that work on
it directly to move things around.

By adding the uncertainty of commits breaking the build to the
uncertainty of when the master will be updated, you can easily fall
into a deadlock. I have been in situations when in the period of two
weeks I had to bring one bot from red to green 5 times. If in between
someone put that bot to not warn, it could have taken me more time to
realise, and every new failure on top of the original makes the
process non-linearly more complex, especially if whoever fixed the bot
is committing loads of patches to try to fix the mess. Reverting two
sequences of intercalated patches independently is more than twice
harder than one sequence, and so on.

I think if we had different public masters, and if the bot owner had
the responsibility to move between them, that could work well, since
moving masters is in the owners' power, while moving groups in the
master is not.

We can then leave the decision of disabling the bot in the master for
more radical solutions, when the bot owner is not responsive or
uncooperative.

cheers,
--renato

Buildmaster as is configured now should send notifications on status change
only for
'successToFailure' and 'failureToSuccess' events, so always red bots should
be quiet.

Hi Galina,

This is true, but bots that go from red to exception (when the master
is restarted, for instance) to red, we get emails.

Maybe the master is treating exception as success, because it doesn't
want to warn on its own exceptions, but then it creates the problem
described above.

We need a richer logic than just success<->failure.

Also we have group of builders (experimental_scheduled_builders) in
configuration file builders.py which also should be quiet. This is place for
noisy unstable bots.

But moving between them needs configuration changes. These not only
can take a while to happen, but they depend on buildbot admins.

My other proposal is to have different buildmasters, with the same
configurations for all bots, but one that emails, and the other that
doesn't.

As a bot owner, I could easily move between them by just updating one
line on my bot config, and not needing anyone else's help.

cheers,
--renato

We need a richer logic than just success<->failure.
Agree. I will look at this.

My other proposal is to have different buildmasters, with the same
configurations for all bots, but one that emails, and the other that
doesn’t
I think this is useful also.

I am going to be away the next week, and will be back to these after it.

Thanks

Galina

Folks,

David has been particularly militant with broken buildbots recently,
so to make sure we don't throw the baby with the bath water, I'd like
to propose some changes on how we deal with the emails on our
*current* buildmaster, since there's no concrete plans to move it to
anything else at the moment.

The main issue is that managing the buildbots is not a simple task. It
requires build owners to disable on the slave side, or specific people
on the master side. The former can take as long as the owner wants
(which is not nice), and the latter refreshes all active bots
(triggering exceptions) and are harder to revert.

We need to be pragmatic without re-writing the BuildBot product.

Grab some popcorn...

There are two main fronts that we need to discuss the noise: Bot and
test stability.

   1. Bot stability issues

We need to distinguish between four classes of buildbots:

  1.1. Fast && stable && green

These buildbots normally finish under one hour, but most of the time
under 1/2 hour and should be kept green as much as possible.
Therefore, any reasonable noise

Not sure what kind of noise you're referring to here. Flaky fast builders
would be a bad thing, still - so that sort of noise should still be
questioned.

in these bots are welcomed, since we
want them to go back to green as soon as possible.

They're normally the front-line, and usually catch most of the silly
bugs. But we need some kind of policy that allows us to revert patches
that break them for more than a few hours.

I'm not sure if we need extra policy here - but I don't mind documenting
the common community behavior here to make it more clear.

Essentially: if you've provided a contributor with a way to reproduce the
issue, and it seems to clearly be a valid issue, revert to green & let them
look at the reproduction when they have time. We do this pretty regularly
(especially outside office hours when we don't expect someone will be
around to revert it themselves - but honestly, I don't see that as a
requirement - if you've provided the evidence for them to investigate,
revert first & they can investigate whenever they get to it, sooner or
later)

We have an agreement
already, and for me that's good enough. People might think
differently.

With the items 2.x below taken care of, we should keep the current
state of our bots for this group.

  1.2. One of: Slow || Unstable || Often Red

These bots are special. They're normally *very* important, but have
some issues, like slow hardware, not too many available boards, or
they take long times to bisect and fix the bugs.

Long bisection is a function of not enough boards (producing large revision
ranges for each run), generally - no? (or is there some other reason?)

These bots catch the *serious* bugs,

Generally all bots catch serious bugs - it's just a long tail: fast easy to
find bugs, then longer tests find the harder to find bugs, and so on and so
forth. (until we get below the value/bug thershold where it's not worth
expending the CPU cycles to find the next bug)

like self-hosted Clang
mis-compiling a long-running test which sometimes fails. They can
produce noise, but when the noise is correct, we really need to listen
to it. Writing software to understand that is non-trivial.

Again, not sure which kind of noise you're referring to here - it'd be
helpful to clarify/disambiguate. Flaky or often-red results on slow
buildbots without enough resources (long blame lists) are pretty easily
ignored ("oh, it could be any of those 20 other people's patches, I'll just
ignore it - someone else will do the work & tell me if it's my fault").

So, the idea here is to have a few special treatments for each type of
problem.

But they are problems that need to be addressed, is the key - and arguably,
until they are addressed, these bots should only report to the owner, not
to contributors. (as above - if people generally ignore them because
they're not accurate enough to believe that it's 'your' fault, then they
essentially are already leaving it to the owner to do the investigation -
they just have extra email they have to ignore too, let's remove the email
so that we can make those we send more valuable by not getting lost in the
noise)

For example, slow bots need more hardware to reduce the blame
list.

Definitely ^.

Unstable bot need more work to reduce spurious noise to a
minimum (see 2.x below), and red bots *must* remain *silent* until
they come back to green (see 2.x below).

As I mentioned on IRC/other threads - having red bots, even if they don't
send email, does come at some cost. It makes dashboards hard to read. So
for those trying to get a sense of the overall state of the project (what's
on fire/what needs to be investigated) this can be problematic. Having
issues XFAILed (with a bug filed, or someone otherwise owning the issue
until the XFAIL is removed) or reverted aggressively or having bots moved
into a separate group so that there's a clear "this is the stuff we should
expect to be green all the time" group that can be eyeballed quickly, is
nice.

What we *don't* want is to disable or silence them after they're
green. Most of the bugs they find are hard to debug, so the longer we
take to fix it the harder it is to find out what happened. We need to
know as soon as possible when they break.

I still question whether these bots provide value to the community as a
whole when they send email. If the investigation usually falls to the
owners rather than the contributors, then the emails they send (& their
presence on a broader dashboard) may not be beneficial.

So to be actionable they need to have small blame lists and be reliable
(low false positive rate). If either of those is compromised, investigation
will fall to the owner and ideally they should not be present in email/core
dashboard groups.

  1.3. Two of: Slow || Unstable || Often Red

These bots are normally only important to their owners, and they are
on the verge of being disabled.

I don't think they have to be on the verge of being disabled - so long as
they don't send email and are in a separate group, I don't see any problem
with them being on the main llvm buildbot. (no particular benefit either, I
suppose - other than saving the owner the hassle of running their own
master, which is fine)

The only way to cope with those bots
is to completely disable their emails / IRC messages, so that no one
gets flooded with noise from broken bots.

Yep

However, some bots on the 1.2 category fall into this one for short
periods of time (~1 week), so we need to be careful with what we
disable here. That's the key baby/bathwater issue.

Any hard policy here will be wrong for some bots some of the time, so
I'd love if we could all just trust the bot owners a bit when they say
they're fixing the issue.

It's not a question of trust, from my perspective - regardless of whether
they will address the issue or not, the emails add noise and decrease the
overall trust developers have in the signal (via email, dashboards and IRC)
from the buildbots.

If an issue is being investigated we have tools to deal with that: XFAIL,
revert, and buildbot reconfig (we could/should check if the reconfig for
email configuration can be done without a restart - yes, it still relies on
a buildbot admin to be available (perhaps we should have more people
empowered to reconfig the buildmaster to make this cheaper/easier) but
without the interruption to all builds).

If there's enough hardware that blame lists are small and the bot is
reliable, then reverts can happen aggressively. If not, XFAIL is always an
option too.

However, bots that fall here for more than a
month, or more often that a few times during a few months (I'm being
vague on purpose), then we collectively decide to disable the bot.

What I *don't* want is any two or three guys deciding to disable the
buildbot of someone else because they can't stand the noise. Remember,
people do take holidays once in a while, and they may be in the Amazon
or the Sahara having well deserved rest. Returning to work and
learning that all your bots are disabled for a week is not nice.

I disagree here - if the bots remain red, they should be addressed. This is
akin to committing a problematic patch before you leave - you should
expect/hope it is reverted quickly so that you're not interrupting
everyone's work for a week.

If your bot is not flakey and has short blame lists, I think it's possibly
reasonable to expect that people should revert their patches rather than
disable the bot or XFAIL the test on that platform. But without access to
hardware it may be hard for them to investigate the failure - XFAIL is
probably the right tool, then when the owner is back they can provide a
reproduction, extra logs, help remote-debug it, etc.

So far, we have coped with noise, and the result is that people tend
to ignore those bots, which means more work to the bot owner.

The problem is, that work doesn't only fall on the owners of the bots which
produce the noise. It falls on all bot owners because developers become
immune/numb to bot failure mail to a large degree.

This is
not a good situation, and we want to move away from it, but we
shouldn't flip all switches off by default. We can still be pragmatic
about this as long as we improve the quality overall (see 2.x below)
with time.

In summary, bots that fall here for too long will have their emails
disabled and candidates for removal in the next spring clean-up, but
not immediately.

  1.4. Slow && Unstable && Red

These bots don't belong here. They should be moved elsewhere,
preferably to a local buildmaster that you can control and that will
never email people or upset our master if you need changes. I have
such a local master myself and it's very easy to setup and maintain.

Yep - bots that are only useful to the owner (some of the situations above
I think constitute this situation, but anyway) shouldn't email/show up in
the main buildbot group. But I wouldn't mind if we had a separate grouping
in the dashboards for these bots (I think we have an experimental group
which is somewhat like this). No big deal either way to me. If they're not
sending mail/IRC messages, and they're not in the main group on the
dashboard, I'm OK with it.

They *do* have value to *you*, for example to show the progress of
your features cleaning up the failures, or generating some benchmark
numbers, but that's something that is very specific to your project
and should remain separated.

Any of these bots in LLVM Lab should be moved away / removed, but on
consensus, including the bot owner if he/she is still available in the
list.

   2. Test stability issues

These issues, as you may have noticed from the links above, apply to
*all* bots. The less noise we have overall, the lower will be our
threshold for kicking bots out of the critical pool, and the higher
the value of the not-so-perfect buildbots to the rest of the
community.

I'm not quite sure I follow this comment. The less noise we have, the
/more/ problematic any remaining noise will be (because it'll be costing us
more relative to no-noise - when we have lots of noise, any one specific
source of noise isn't critical, we can remove it but it won't change much -
when there's a little noise, removing any one source substantially
decreases our false positive rate, etc)

  2.1 Failed vs Exception

The most critical issue we have to fix is the "red -> exception ->
red" issue. Basically, a bot is red (because you're still
investigating the problem), then someone restarts the master, so you
get an exception. The next build will be a failure, and the
buildmaster recognises the status change and emails everyone. That's
just wrong.

We need to add an extra check to that logic where it searches down for
the next non-exceptional status and compares to that, not just the
immediately previous result.

This is a no-brainer and I don't think anyone would be against it. I
just don't know where this is done, I welcome the knowledge of more
experienced folks.

Yep, sounds like we might be able to have Galina look into that. I have no
context there about where that particular behavior might be (whether it's
in the buildbot code itself, or in the user-provided buildbot
configuration, etc).

  2.2 Failure types

The next obvious thing is to detect what the error is. If it's an SVN
error, we *really* don't need to get an email.

Depends on the error - if it's transient, then this is flakiness as always
& should be addressed as such (by trying to remove/address the flakes).
Though, yes, this sort of failure should, ideally, probably, go to the
buildbot owner but not to users.

But this raises the
problem that an SVN failure followed by a genuine failure will not be
reported. So, the reporting mechanism also has to know what's the
previously *reported* failure, not just the previous failure.

Other failures, like timeout, can be either flaky hardware or broken
codegen. A way to be conservative and low noise would be to only warn
on timeouts IFF it's the *second* in a row.

I don't think this helps - this reduces the incidence, but isn't a real
solution. We should reduce the flakiness of hardware. If hardware is this
unreliable, why would we be building a compiler for it? No user could rely
on it to produce the right answer. (& again, if the flakiness is bad enough
- I think that goes back to an owner-triaged bot, one that doesn't send
mail, etc)

For all these adjustments, we'll need some form of walk-back on the
history to find the previous genuine result, and we'll need to mark
results with some metadata. This may involve some patches to buildbot.

Yeah, having temporally related buildbot results seems dubious/something
I'd be really cautious about.

  2.3 Detecting fixed bots

Another interesting feature, that is present in the "GreenBot" is a
warning when a bot you broke was fixed. That, per se, is not a good
idea if the noise levels are high, since this will probably double it.

So, this feature can only be introduced *after* we've done the clean
ups above. But once it's clean, having a "green" email will put the
minds of everyone that haven't seen the "red" email yet to rest, as
they now know they don't even need to look at it at all, just delete
the email.

For those using fetchmail, I'm sure you could create a rule to do that
automatically, but that's optional. :slight_smile:

Yeah, I don't know what the right solution is here at all - but it
certainly would be handy if there were an easier way to tell if an issue
has been resolved since your commit.

I imagine one of the better options would be some live embedded HTML that
would just show a green square/some indicator that the bot has been green
at least once since this commit.

(that doesn't help if you introduced a flaky test, though... - that's
harder to deal with/convey to users, repeated test execution may be
necessary in that case - that's when temporal information may be useful)

  2.4 Detecting new failures

This is a wish-list that I have, for the case where the bots are slow
and hard to debug and are still red. Assuming everything above is
fixed, they will emit no noise until they go green again, however,
while I'm debugging the first problem, others can appear. If that
happens, *I* want to know, but not necessarily everyone else.

This seems like the place where XFAIL would help you and everyone else. If
the original test failure was XFAILed immediately, the bot would go green,
then red again if a new failure was introduced. Not only would you know,
but so would the auhtor of the change.

These buildbots normally finish under one hour, but most of the time
under 1/2 hour and should be kept green as much as possible.
Therefore, any reasonable noise

Not sure what kind of noise you're referring to here. Flaky fast builders
would be a bad thing, still - so that sort of noise should still be
questioned.

Sorry, I meant "noise" as in "sound", not as opposed to "signal".

These bots are assumed stable, otherwise they would be in another
category below.

I'm not sure if we need extra policy here - but I don't mind documenting the
common community behavior here to make it more clear.

Some people in the community behaves strongly different than others. I
sent this email because I felt we disagree in some fundamental
properties of the buildbots, and before we can agree to a common
strategy, there is no consensus or "common behaviour" to be
documented.

However, I agree, we don't need "policy", just "documented behaviour"
as usual. That was my intention when I said "policy".

Long bisection is a function of not enough boards (producing large revision
ranges for each run), generally - no? (or is there some other reason?)

It's not that simple. Some bugs appear after several iterations of
green results. It may sound odd, but I had at least three this year.

These are the hardest bugs to find and usually standard regression
scripts can't find them automatically, so I have to do most of the
investigation manually. This takes *a lot* of time.

Generally all bots catch serious bugs.

That's not what I meant. Quick bots catch bad new tests (over-assuming
on CHECK lines, forgetting to specify the triple on RUN lines) as well
as simple code issues (32 vs 64 bits, new vs old compiler errors,
etc), just because they're the first to run on a different environment
than the developer uses. Slow bots are most of the time buffered
against those, since patches and fixes (or reverts) tend to come in
bundles, while the slow bot is building.

like self-hosted Clang
mis-compiling a long-running test which sometimes fails. They can
produce noise, but when the noise is correct, we really need to listen
to it. Writing software to understand that is non-trivial.

Again, not sure which kind of noise you're referring to here - it'd be
helpful to clarify/disambiguate.

Noise here is less "sound" and more "noisy signal". Some of the
"noise" in these bots are just noise, others are signal masquerading
as noise.

Of course, the higher the noise level, the harder it is to interpret
the signal, but as it's usual in science, sometimes the only signal we
have is a noisy one.

It's common for mathematicians to scoff the physicists lack of
precision, as is for them to to the same to chemists, then biologists,
etc. When you're on top, it seems folly that some people endure large
amounts of noise in their signal, but when you're at the bottom and
your only signal has a lot of noise, you have to work with it and make
do with what you have.

As I said above, it's not uncommon the case where a failure "passes"
the tests for a few iterations before failing. So, we're not talking
*only* at hardware noise, but also at the code level, which had
assumptions based on the host architecture that might not be valid on
other architectures. Most of us develop on x86 machines, so it's only
logical that PPC, MIPS and ARM buildbots will fail more often than x86
ones. But that's precisely the point of having those bots in the first
place.

Requesting to disable those bots because they generate noise is the
same as asking people to give their opinion about a product, show the
positive reviews, and sue the rest.

But they are problems that need to be addressed, is the key - and arguably,
until they are addressed, these bots should only report to the owner, not to
contributors.

If we didn't have those bots already for many years, and if we had
another way of testing on those architectures, I'd agree with you. But
we don't.

I agree we need to improve. I agree it's the architecture specific
community's responsibility to do so. I just don't agree that we should
disable all noise (with signal, baby/bath) until we do so.

By the time we get there, all sorts of problems will have crept in,
and we'll enter a vicious cycle. Been there, done that.

I still question whether these bots provide value to the community as a
whole when they send email. If the investigation usually falls to the owners
rather than the contributors, then the emails they send (& their presence on
a broader dashboard) may not be beneficial.

Benefit is a spectrum. People have different thresholds. Your
threshold is tougher than mine because I'm used working on an
environment where the noise is almost as loud as the signal.

I don't think we should be bound to either of our thresholds, that's
why I'm opening the discussion to have a migration plan to produce
less noise. But that plan doesn't include killing bots just because
they annoy people.

If you plot a function of value ~ noise OP benefit, you have a surface
with maxima and minima. Your proposal is to set a threshold and cut
all the bots that fall on those minima that are below that line. My
proposal is to move all those bots as high as we can and only then,
cut the bots that didn't make it past the threshold.

So to be actionable they need to have small blame lists and be reliable (low
false positive rate). If either of those is compromised, investigation will
fall to the owner and ideally they should not be present in email/core
dashboard groups.

Ideally, this is where both of us want to be. Realistically, it'll
take a while to get there.

We need changes in the buildbot area, but there are also inherent
problems that cannot be solved.

Any new architecture (like AArch64) will have only experimental
hardware for years, and later on, experimental kernel, then
experimental tools, etc. When developing a new back-end for a
compiler, those unstable and rapidly evolving environments are the
*only* thing you have to test on.

You normally only have one of two (experimental means either *very*
expensive or priceless), so having multiple boxes per bot is highly
unlikely. It can also mean that the experimental device you got last
month is not supported any more because a new one is coming, so you'll
have to live with those bugs until you get the new one, which will
come with its own bugs.

For older ARM cores (v7), this is less of a problem, but since old ARM
hardware was never designed as production machines, their flakiness is
inherent of their form factor. It is possible to get them on a
stable-enough configuration, but it takes time, resources, excess
hardware and people constantly monitoring the infrastructure. We're
getting there, but we're not there yet.

I agree that this is mostly *my* problem and *I* should fix it, and
believe me I *want* to fix it, I just need a bit more time. I suspect
that the other platform folks feel the same way, so I'd appreciate a
little more respect when we talk about acceptable levels of noise and
effort.

I disagree here - if the bots remain red, they should be addressed. This is
akin to committing a problematic patch before you leave - you should
expect/hope it is reverted quickly so that you're not interrupting
everyone's work for a week.

Absolutely not!

Committing a patch and going on holidays is a disrespectful act. Bot
maintainers going on holidays is an inescapable fact.

Silencing a bot while the maintainer is a possible way around, but
disabling it is most disrespectful.

However, I'd like to remind you of the confirmation bias problem,
where people will look at the bot, think it's noise, silence the bot
when they could have easily fixed it. Later on, when the owner gets to
work, surprise new bugs that weren't caught will fill the first weeks.
We have to be extra careful when taking actions without the bot
owners' knowledge.

I'm not quite sure I follow this comment. The less noise we have, the /more/
problematic any remaining noise will be

Yes, I meant what you said. :slight_smile:

Less noise, higher bar to meet.

Depends on the error - if it's transient, then this is flakiness as always &
should be addressed as such (by trying to remove/address the flakes).
Though, yes, this sort of failure should, ideally, probably, go to the
buildbot owner but not to users.

Ideally, SVN errors should go to the site admins, but let's not get
ahead of ourselves. :slight_smile:

Other failures, like timeout, can be either flaky hardware or broken
codegen. A way to be conservative and low noise would be to only warn
on timeouts IFF it's the *second* in a row.

I don't think this helps - this reduces the incidence, but isn't a real
solution.

I agree.

We should reduce the flakiness of hardware. If hardware is this
unreliable, why would we be building a compiler for it?

Because that's the only hardware that exists.

No user could rely on it to produce the right answer.

No user is building trunk every commit (ish). Buildbots are not meant
to be as stable as a user (including distros) would require. That's
why we have extra validation on releases.

Buildbots build potentially unstable compilers, otherwise we wouldn't
need buildbots in the first place.

For all these adjustments, we'll need some form of walk-back on the
history to find the previous genuine result, and we'll need to mark
results with some metadata. This may involve some patches to buildbot.

Yeah, having temporally related buildbot results seems dubious/something I'd
be really cautious about.

This is not temporal, it's just regarding exception as no-change
instead of success.

The only reason why it's success right now is because, the way we're
setup to email on every failure, we don't want to spam people when the
master is reloaded.

That's the wrong meaning for the wrong reason.

I imagine one of the better options would be some live embedded HTML that
would just show a green square/some indicator that the bot has been green at
least once since this commit.

That would be cool! But I suspect at the cost of a big change in the
buildbots. Maybe not...

This is a wish-list that I have, for the case where the bots are slow
and hard to debug and are still red. Assuming everything above is
fixed, they will emit no noise until they go green again, however,
while I'm debugging the first problem, others can appear. If that
happens, *I* want to know, but not necessarily everyone else.

This seems like the place where XFAIL would help you and everyone else. If
the original test failure was XFAILed immediately, the bot would go green,
then red again if a new failure was introduced. Not only would you know, but
so would the auhtor of the change.

I agree in principle. I just worry that it's a lot easier to add an
XFAIL than to remove it later.

Though, it might be just a matter of documenting the common behaviour
and expecting people to follow through.

cheers,
--renato

>> These buildbots normally finish under one hour, but most of the time
>> under 1/2 hour and should be kept green as much as possible.
>> Therefore, any reasonable noise
>
> Not sure what kind of noise you're referring to here. Flaky fast builders
> would be a bad thing, still - so that sort of noise should still be
> questioned.

Sorry, I meant "noise" as in "sound", not as opposed to "signal".

These bots are assumed stable, otherwise they would be in another
category below.

> I'm not sure if we need extra policy here - but I don't mind documenting
the
> common community behavior here to make it more clear.

Some people in the community behaves strongly different than others. I
sent this email because I felt we disagree in some fundamental
properties of the buildbots, and before we can agree to a common
strategy, there is no consensus or "common behaviour" to be
documented.

However, I agree, we don't need "policy", just "documented behaviour"
as usual. That was my intention when I said "policy".

> Long bisection is a function of not enough boards (producing large
revision
> ranges for each run), generally - no? (or is there some other reason?)

It's not that simple. Some bugs appear after several iterations of
green results. It may sound odd, but I had at least three this year.

These are the hardest bugs to find and usually standard regression
scripts can't find them automatically, so I have to do most of the
investigation manually. This takes *a lot* of time.

Flakey failures, yes. I'd expect an XFAIL while it's under investigation,
for sure (or notifications forcibly disabled, if that works better/is
necessary). Because flakey failures produce un-suppressable noise on the
bot (because they're not a continuous run of red).

> Generally all bots catch serious bugs.

That's not what I meant. Quick bots catch bad new tests (over-assuming
on CHECK lines, forgetting to specify the triple on RUN lines) as well
as simple code issues (32 vs 64 bits, new vs old compiler errors,
etc), just because they're the first to run on a different environment
than the developer uses. Slow bots are most of the time buffered
against those, since patches and fixes (or reverts) tend to come in
bundles, while the slow bot is building.

>> like self-hosted Clang
>> mis-compiling a long-running test which sometimes fails. They can
>> produce noise, but when the noise is correct, we really need to listen
>> to it. Writing software to understand that is non-trivial.
>
> Again, not sure which kind of noise you're referring to here - it'd be
> helpful to clarify/disambiguate.

Noise here is less "sound" and more "noisy signal". Some of the
"noise" in these bots are just noise, others are signal masquerading
as noise.

Of course, the higher the noise level, the harder it is to interpret
the signal, but as it's usual in science, sometimes the only signal we
have is a noisy one.

It's common for mathematicians to scoff the physicists lack of
precision, as is for them to to the same to chemists, then biologists,
etc. When you're on top, it seems folly that some people endure large
amounts of noise in their signal, but when you're at the bottom and
your only signal has a lot of noise, you have to work with it and make
do with what you have.

As I said above, it's not uncommon the case where a failure "passes"
the tests for a few iterations before failing.

That would be flakey - yes, there are many sources of flakeyness (& if it
passed a few times before it failed, it probably won't fail regularly now -
it'll continue to oscillate back and forth over passing and failing,
producing a lot of notification noise/spam). Flakeyness should be
addressed, for sure - XFAIL or suppress bot notifications while
investigating, etc.

So, we're not talking
*only* at hardware noise, but also at the code level, which had
assumptions based on the host architecture that might not be valid on
other architectures. Most of us develop on x86 machines, so it's only
logical that PPC, MIPS and ARM buildbots will fail more often than x86
ones. But that's precisely the point of having those bots in the first
place.

Requesting to disable those bots because they generate noise is the
same as asking people to give their opinion about a product, show the
positive reviews, and sue the rest.

When I suggest someone disable notifications from a bot it's because those
notifications aren't actionable to those receiving them. It's not a
suggestion that the platform is unsupported, that the bot should be turned
off, or that the issues are not real. It is a suggestion that the
notifications are only relevant to the owner/persons invested on that
platform, and that a level of triage from them may be necessary or
otherwise appropriate.

> But they are problems that need to be addressed, is the key - and
arguably,
> until they are addressed, these bots should only report to the owner,
not to
> contributors.

If we didn't have those bots already for many years, and if we had
another way of testing on those architectures, I'd agree with you. But
we don't.

I'm not suggesting removing the testing. Merely placing the onus on
responding to/investigating notifications on the parties with the context
to do so. Long blame lists and flakey results on inaccessible hardware
generally amount to unactionable results except for the person who owns/is
invested in the architecture.

I agree we need to improve. I agree it's the architecture specific
community's responsibility to do so. I just don't agree that we should
disable all noise (with signal, baby/bath) until we do so.

If the results aren't actionable by the people receiving them, that's a bug
& we should fix it pretty much immediately. If the architecture specific
community can then produce automated actionable results, great. Until then
I don't think it's a huge cost to say that that community can do the first
level triage.

In cases where the triage is cheap, this shouldn't be a big deal for the
bot owner to do so - and when the triage is expensive, well, that's the
point: imposing that triage on the community at large (especially with
large blame lists), doesn't seem to work.

By the time we get there, all sorts of problems will have crept in,
and we'll enter a vicious cycle. Been there, done that.

Why would that happen? All I'd expect is that you/others watch the negative
bot results, and forward any on that look like actionable true positives.
If that's too expensive, then I don't know how you can expect community
members to incur that cost instead of bot owners?

> I still question whether these bots provide value to the community as a
> whole when they send email. If the investigation usually falls to the
owners
> rather than the contributors, then the emails they send (& their
presence on
> a broader dashboard) may not be beneficial.

Benefit is a spectrum. People have different thresholds. Your
threshold is tougher than mine because I'm used working on an
environment where the noise is almost as loud as the signal.

I don't think we should be bound to either of our thresholds, that's
why I'm opening the discussion to have a migration plan to produce
less noise. But that plan doesn't include killing bots just because
they annoy people.

Again: if the notifications are going to people who can't act on them, we
should disable them, otherwise people will not have confidence in the
positive signals and we are lacking value overall.

Disabling the bot is not the only solution - simply disabling the
notifications for contributors, having them go to the bot owner first for
triage, then they can forward the notification on to the contributor if
it's a good true positive. I assume bot owners are already doing this work
- if they care about their platform presumably they're watching for
platform-specific failures and often having to follow up with contributors
because so many of them ignore the mails today because of the unactionable
nature of them. So I'm not really asking for any more work from the owners
of these bots, if they care about the results & people are already ignoring
them.

I'm just asking to remove the un-actioned notifications to increase
confidence in our notifications.

If you plot a function of value ~ noise OP benefit, you have a surface
with maxima and minima. Your proposal is to set a threshold and cut
all the bots that fall on those minima that are below that line. My
proposal is to move all those bots as high as we can and only then,
cut the bots that didn't make it past the threshold.

The problem with that is that people continue to lose confidence in the
bots (especially new contributors) the longer we maintain the current state
(& I don't have a great deal of confidence in the timespan this will take
to get automated high quality resulrts from all the current bots). Once
people lose confidence in the bots, they're not likely to /gain/ confidence
again - they'll start ignoring them and not have any reason to re-evaluate
that situation in the future. That's my usual approach, but recently
decided to re-evaluate that & be verbose about it. Most peolpe aren't
verbose (or militant, as you put it) because they're already ignoring all
of this. That's a /bad/ thing.

I would like to set the bar high: that bot notifications must be high
quality, and if they aren't, that we disable them aggressively. This places
the onus on the owner to improve the quality before turning the
notifications (back) on. Rather than incurring a distributed cost over the
whole project (that may have a long term effect) while we wait for the
quality to improve.

> So to be actionable they need to have small blame lists and be reliable
(low
> false positive rate). If either of those is compromised, investigation
will
> fall to the owner and ideally they should not be present in email/core
> dashboard groups.

Ideally, this is where both of us want to be. Realistically, it'll
take a while to get there.

We need changes in the buildbot area, but there are also inherent
problems that cannot be solved.

Any new architecture (like AArch64) will have only experimental
hardware for years, and later on, experimental kernel, then
experimental tools, etc. When developing a new back-end for a
compiler, those unstable and rapidly evolving environments are the
*only* thing you have to test on.

You normally only have one of two (experimental means either *very*
expensive or priceless), so having multiple boxes per bot is highly
unlikely. It can also mean that the experimental device you got last
month is not supported any more because a new one is coming, so you'll
have to live with those bugs until you get the new one, which will
come with its own bugs.

For older ARM cores (v7), this is less of a problem, but since old ARM
hardware was never designed as production machines, their flakiness is
inherent of their form factor. It is possible to get them on a
stable-enough configuration, but it takes time, resources, excess
hardware and people constantly monitoring the infrastructure. We're
getting there, but we're not there yet.

I agree that this is mostly *my* problem and *I* should fix it, and
believe me I *want* to fix it, I just need a bit more time. I suspect
that the other platform folks feel the same way, so I'd appreciate a
little more respect when we talk about acceptable levels of noise and
effort.

I'm sorry if I've come across as disrespectful. I do appreciate that a
whole bunch of people care about a whole bunch of different things. Even I
fall prey to the same situation - the GDB buildbot is flakey. I let it go
but I really should fix the flakes - and I wouldn't mind community pressure
to do so. But partly, as I mentioned in my previous reply, existing noise
levels make it not as interesting to improve small amounts of noise
(tragedy of the commons, etc). The lower the noise level, teh more
important/pressure we'll have to keep the remaining sources of noise down.

> I disagree here - if the bots remain red, they should be addressed. This
is
> akin to committing a problematic patch before you leave - you should
> expect/hope it is reverted quickly so that you're not interrupting
> everyone's work for a week.

Absolutely not!

Committing a patch and going on holidays is a disrespectful act. Bot
maintainers going on holidays is an inescapable fact.

Silencing a bot while the maintainer is a possible way around, but
disabling it is most disrespectful.

However, I'd like to remind you of the confirmation bias problem,
where people will look at the bot, think it's noise, silence the bot
when they could have easily fixed it. Later on, when the owner gets to
work, surprise new bugs that weren't caught will fill the first weeks.
We have to be extra careful when taking actions without the bot
owners' knowledge.

I'm looking at the existing behavior of the community - if people are
generally ignoring the result of a bot anyway (& if it's red for weeks at a
time, I think they are) then the notifications are providing no value. The
bot only provides value to the owner - who triages the failures, then
reaches out to the community to provide the reproduction/assist developers
with a fix.

All I want to do is remove notifications people aren't acting on anyway.

> I'm not quite sure I follow this comment. The less noise we have, the
/more/
> problematic any remaining noise will be

Yes, I meant what you said. :slight_smile:

Less noise, higher bar to meet.

> Depends on the error - if it's transient, then this is flakiness as
always &
> should be addressed as such (by trying to remove/address the flakes).
> Though, yes, this sort of failure should, ideally, probably, go to the
> buildbot owner but not to users.

Ideally, SVN errors should go to the site admins, but let's not get
ahead of ourselves. :slight_smile:

>> Other failures, like timeout, can be either flaky hardware or broken
>> codegen. A way to be conservative and low noise would be to only warn
>> on timeouts IFF it's the *second* in a row.
>
> I don't think this helps - this reduces the incidence, but isn't a real
> solution.

I agree.

> We should reduce the flakiness of hardware. If hardware is this
> unreliable, why would we be building a compiler for it?

Because that's the only hardware that exists.

> No user could rely on it to produce the right answer.

No user is building trunk every commit (ish). Buildbots are not meant
to be as stable as a user (including distros) would require.

I disagree with this - I think it's a worthy goal to have continuous
validation that is more robust and comprehensive. At some point the compute
resource cost is not worth the bug finding rate - and we run those tasks
less frequently. But none of that excuse instability. I'd expect infrequent
validation to be acceptably less stable than frequent validation.

Extra validation before release is for work that takes too long (& has a
lower chance of finding bugs) to run on every change/frequently.

That's
why we have extra validation on releases.

Buildbots build potentially unstable compilers,

Potentially - though flakey behavior in the compiler isn't /terribly/
common. It does happen, for sure. More often I see flakey tests/test
infrastructure - which reduces confidence in the quality of the
infrastructure, causing people to ignore true positives due to the high
rate of false positives.

otherwise we wouldn't
need buildbots in the first place.

>> For all these adjustments, we'll need some form of walk-back on the
>> history to find the previous genuine result, and we'll need to mark
>> results with some metadata. This may involve some patches to buildbot.
>
> Yeah, having temporally related buildbot results seems dubious/something
I'd
> be really cautious about.

This is not temporal, it's just regarding exception as no-change
instead of success.

red->exception->red I don't mind too much - the "timeout->timeout" example
you gave is one I disagree with.

The only reason why it's success right now is because, the way we're
setup to email on every failure, we don't want to spam people when the
master is reloaded.

That's the wrong meaning for the wrong reason.

> I imagine one of the better options would be some live embedded HTML that
> would just show a green square/some indicator that the bot has been
green at
> least once since this commit.

That would be cool! But I suspect at the cost of a big change in the
buildbots. Maybe not...

Yeah, not sure how expensive it'd be.

>> This is a wish-list that I have, for the case where the bots are slow
>> and hard to debug and are still red. Assuming everything above is
>> fixed, they will emit no noise until they go green again, however,
>> while I'm debugging the first problem, others can appear. If that
>> happens, *I* want to know, but not necessarily everyone else.
>
> This seems like the place where XFAIL would help you and everyone else.
If
> the original test failure was XFAILed immediately, the bot would go
green,
> then red again if a new failure was introduced. Not only would you know,
but
> so would the auhtor of the change.

I agree in principle. I just worry that it's a lot easier to add an
XFAIL than to remove it later.

How so? If you're actively investigating the issue, and everyone else is
happily ignoring the bot result (& so won't care when it goes green, or red
again) - you're owning the issue to get your bot back to green, and it just
means you have to un-XFAIL it as soon as that happens.

Hi David,

I think we're repeating ourselves here, so I'll reduce to the bare
minimum before replying.

When I suggest someone disable notifications from a bot it's because those
notifications aren't actionable to those receiving them.

This is a very limited view of the utility of buildbots.

I think part of the problem is that you're expecting to get instant
value out of something that cannot provide that to you. If you can't
extract value from it, it's worthless.

Also, it seems, you're associating community buildbots with company
testing infrastructure. When I worked at big companies, there were
validation teams that would test my stuff and deal with *any* noise on
their own, and only the real signal would come to me: 100% actionable.
However, most of the bot owners in open source communities do this as
a secondary task. This has always been the case and until someone
(LLVM Foundation?) starts investing in a better infrastructure overall
(multi master, new slaves, admins), there isn't much we can do to
improve that quick enough.

The alternative is that the less common architectures will always have
noisier bots because less people use them day-to-day, during their
development time. By having a hard line on those, in the long run,
means we'll disable most testing on all secondary architectures, and
LLVM becomes an Intel compiler. But many companies use LLVM for their
production compiler on their own targets, so the inevitable is that
they will *fork* LLVM. I don't think anyone wants that.

I'm not suggesting removing the testing. Merely placing the onus on
responding to/investigating notifications on the parties with the context to
do so.

You still don't get the point. This would make sense on a world where
all parties are equal.

Most people develop and test on x86, even ARM and MIPS engineers. That
means x86 is almost always stable, no matter who's working.

But some bugs that we had to fix this year show up randomly *only* on
ARM. That was a serious misuse of the Itanium C++ ABI, and one that
took a long time to be fixed, and we still don't know if we got them
all.

Bugs like that normally only show up on self-hosting builds, sometimes
on self-hosted Clang compiled test-suite. These bugs have no hard
good/bad line for bisecting, they take hours per cycle, and they may
or may not fail, so automated bisecting won't work. Furthermore, there
is nothing to XFAIL in this case, unless you want to disable building
Clang, which I don't think you do.

While it's taking days, if not weeks, to investigate this bot, the
status may be going from red to green to red. It would be very
simplistic to assume that *any* greed->red transition while I'm
bisecting the problem will be due to the current known instability. It
could be anything, and developers still need to be warned if the alarm
goes off.

The result may be it's still flaky, the developer can't do much, life
goes on. Or it could be his test, he fixes immediately, and I'm
eternally grateful, because I still need to investigate *only one* bug
at a time. By silencing the bot, I'd have to be responsible for
debugging the original hard problem plus any other that would come
while the bot was flaky.

Now, there's the issue of where does the responsibility lies...

I'm responsible for the quality of the ARM code, including the
buildbots. What you're suggesting is that *no matter what* gets
committed, it is *my* responsibility to fix any bug that the original
developers can't *action upon*.

That might seem sensible at first, but the biggest problem here is the
term that you're using over and over again: *acting upon*. It can be a
technical limitation that you can't act upon a bug on an ARM bot, but
it can also be a personal one. I'm not saying *you* would do that, but
we have plenty of people in the community with plenty of their own
problems. You said it yourself, people tend to ignore problems that
they can't understand, but not understanding is *not* the same as not
being able to *act upon*.

For me, that attitude is what's at the core of the problem here. By
raising the bar faster than we can make it better, you're essentially
just giving people the right not to care. The bar will be raised even
further by peer pressure, and that's the kind of behaviour that leads
to a fork. I'm trying to avoid this at all costs.

All I'd expect is that you/others watch the negative
bot results, and forward any on that look like actionable true positives. If
that's too expensive, then I don't know how you can expect community members
to incur that cost instead of bot owners?

Another example of the assumption that bot owners are validation
engineers and that's their only job. It was never like this in LLVM
and it won't start today just because we want to.

My expectation of the LLVM Foundation is that they would take our
validation infrastructure to the next level, but so far I haven't seen
much happening. If you want to make it better, instead of forcing your
way on the existing scenario, why not work with the Foundation to move
this to the next level?

Once people lose
confidence in the bots, they're not likely to /gain/ confidence again -

That's not true. Galina's Panda bots were unstable in 2010, people
lost confidence, she added more boards, people re-gained confidence in
2011. Then it became unstable in 2013, people lost confidence, we
fixed the issues, people re-gained confidence only a few months later.
This year it got unstable again, but because we already have enough
ARM bots elsewhere, she disabled them for good.

You're exaggerating the effects of unstable bots as if people expected
them to be always perfect. I'd love if they could be, but I don't
expect them to be.

I'm looking at the existing behavior of the community - if people are
generally ignoring the result of a bot anyway (& if it's red for weeks at a
time, I think they are) then the notifications are providing no value.

I'm not seeing that, myself. So far, you're the only one that is
shouting out loud that this or that bot is noisy.

Sometimes people ignore bots, but I don't take this as a sign that
everything is doomed, just that people focus on different things at
different times.

No user is building trunk every commit (ish). Buildbots are not meant
to be as stable as a user (including distros) would require.

I disagree with this - I think it's a worthy goal to have continuous
validation that is more robust and comprehensive.

A worthy goal, yes. Doable right now, with the resources that we have,
no. And no amount of shouting will get this done.

If we want quality, we need top-level management, preferably from the
LLVM Foundation, and a bunch of dedicated people working on it, which
could be either funded by the foundation or agreed between the
interested parties. If anyone ever get this conversation going (I
tried), please let me know, as I'm very interested in making that
happen.

red->exception->red I don't mind too much - the "timeout->timeout" example
you gave is one I disagree with.

Ah, yes. I mixed them up.

I agree in principle. I just worry that it's a lot easier to add an
XFAIL than to remove it later.

How so? If you're actively investigating the issue, and everyone else is
happily ignoring the bot result (& so won't care when it goes green, or red
again) - you're owning the issue to get your bot back to green, and it just
means you have to un-XFAIL it as soon as that happens.

From my experience, companies put people to work on open source

projects when they need something done and don't want to bear the
costs of maintaining it later.

So, initially, developers have a high pressure of pushing their
patches through, and you see them very excited in addressing the
review comments, adding tests, fixing bugs.

But once the patch is in, the priority of that task, for that company,
is greatly reduced. Most developers consider investigating an XFAIL
from their commit as important as the commit itself, but not
necessarily most companies do so with the same passion.

Moreover, once developers implement whatever they needed here, it's
not uncommon for their parent companies to move them away from the
project, in which case they can't even contribute any more due to
license issues, etc.

But we also have the not-so-responsible developers, that could create
a bug, assign to themselves, and never look back unless someone
complains.

That's why, at Linaro, I have the policy to only mark XFAIL when I can
guarantee that it's either not supposed to work or the developer will
fix it *before* marking the task closed.

cheers,
--rento

I pointed this out as a major issue when I first started with LLVM. But since nobody actually seemed interested in fixing it, I didn't keep making noise about it. I basically just ignore the failure notices from buildbot, because every commit seems to trigger multiple bogus failure notices, no matter what.

It's just so much easier to ignore those notices than to get involved in a debate about why sending out useless failure notices that have nothing to do with any of the commits being blamed is actively harmful -- *worse* than useless.

I don't know what the solution is, but it's got to somehow move towards trying to avoid blaming committers for already-known problems, or for infrastructure issues (e.g.: svn update failed? Why do I care?). It simply does not help to improve the quality of LLVM to have the buildbots send emails to committers of arbitrary patches when a bot that "everyone" already knows is flaky has failed yet again. *I* don't know which bots are "supposed" to be flaky, so if I actually bothered to fully investigate every such notice, that'd just be a massive waste of effort.

So I just ignore the notices. Sorry...

But since nobody actually seemed interested in fixing it, I didn't keep making noise about it. I basically just ignore the failure notices from buildbot, because every commit seems to trigger multiple bogus failure notices, no matter what.

That's not true, either.

We (buildbot owners and admins) are constantly improving the noise by
adding more boards, investigating stability issues and disabling bots
temporarily when they're too noisy. We may not do it at the speed some
people expect, or to the extent that a fully supported validation team
in a big company would, but we do the best we can.

I don't know what the solution is, but it's got to somehow move towards trying to avoid blaming committers for already-known problems, or for infrastructure issues (e.g.: svn update failed? Why do I care?). It simply does not help to improve the quality of LLVM to have the buildbots send emails to committers of arbitrary patches when a bot that "everyone" already knows is flaky has failed yet again. *I* don't know which bots are "supposed" to be flaky, so if I actually bothered to fully investigate every such notice, that'd just be a massive waste of effort.

The alternative is worse: not testing.

The assumption is wrong: people *do* care, but the problem is harder
than it looks, and needs more than just the bot owner to improve.

I wish I had a magic wand... but I'm not expecting to ever have one.

cheers,
--renato

As a foreword: I haven’t read a lot of the thread here and it’s just a single developer talking here :slight_smile:

But since nobody actually seemed interested in fixing it, I didn’t keep making noise about it. I basically just ignore the failure notices from buildbot, because every commit seems to trigger multiple bogus failure notices, no matter what.

That’s not true, either.

We (buildbot owners and admins) are constantly improving the noise by
adding more boards, investigating stability issues and disabling bots
temporarily when they’re too noisy. We may not do it at the speed some
people expect, or to the extent that a fully supported validation team
in a big company would, but we do the best we can.

Any bot that people just ignore because it’s usually flaky isn’t worth having around. It’s basically just making people not pay attention to the ones that are reliable.

I don’t know what the solution is, but it’s got to somehow move towards trying to avoid blaming committers for already-known problems, or for infrastructure issues (e.g.: svn update failed? Why do I care?). It simply does not help to improve the quality of LLVM to have the buildbots send emails to committers of arbitrary patches when a bot that “everyone” already knows is flaky has failed yet again. I don’t know which bots are “supposed” to be flaky, so if I actually bothered to fully investigate every such notice, that’d just be a massive waste of effort.

The alternative is worse: not testing.

Absolutely agreed. That said, a flaky bot should only go to the people that care about it until it’s considered stable for general use. I.e. if I can rely that I (or someone in the recent commit history) broke a target when the bot tells us then the bot is useful.

Things that are still ok for people to pay attention to on occasion:

Timeouts
svn update failed

these are fairly rare (or should be) such that the noise caused by them isn’t too bad and falls under the “a few false positives are better than a false negative here”.

Solution to a great deal of the failures on the slow bots:

It would be nice if we could get phased builders so that the fast builders could run native tests on, say, linux, darwin, windows quickly and then after that the more board based testers can run and make sure that they don’t fail so often due to “transient” failures that happen on a regular basis that get fixed in a quick followup commit.

I’ve cc’d Chris Matthews who was originally working on getting the phased builders out in a useable fashion for the general community.

Anything that wouldn’t be solved by the above as far as stability for the builders that Dave (and others) are complaining about should probably have them as private bots until they get fixed, otherwise they’re not providing enough signal for the noise.

-eric

I recommend you to, then. Most of your arguments are similar to
David's and they don't take into account the difficulty in maintaining
non-x86 buildbots.

What you're both saying is basically the same as: We want all the cars
we care about in our garage, but only as long as they can race in F1.
However, you care about the whole range, from beetles to McLarens, but
are only willing to cope with the speed and reliability of the latter.
You'll end up with only McLarens in your garage. It just doesn't make
sense.

Also, very briefly, I want the same as both of you: reliability. But I
alone cannot guarantee that. And even with help, I can only get there
in months, not days. To get there, we need to *slowly* move towards
it, not drastically throw away everything that is not a McLaren and
only put them back when they're as fast as a McLaren. It just won't
happen, and the risk of a fork becomes non-trivial.

--renato

As a foreword: I haven’t read a lot of the thread here and it’s just a
single developer talking here :slight_smile:

I recommend you to, then. Most of your arguments are similar to
David’s and they don’t take into account the difficulty in maintaining
non-x86 buildbots.

OK. I’ve now read the rest of the thread and don’t find any of the arguments compelling for keeping flaky bots around for notifications. I also don’t think that the x86-ness of it matters here. The powerpc64 and hexagon bots are very reliable.

What you’re both saying is basically the same as: We want all the cars
we care about in our garage, but only as long as they can race in F1.
However, you care about the whole range, from beetles to McLarens, but
are only willing to cope with the speed and reliability of the latter.
You’ll end up with only McLarens in your garage. It just doesn’t make
sense.

I think this is a poor analogy. You’re also ignoring the solution I gave you in my previous mail for slow bots.

Also, very briefly, I want the same as both of you: reliability. But I
alone cannot guarantee that. And even with help, I can only get there
in months, not days. To get there, we need to slowly move towards
it, not drastically throw away everything that is not a McLaren and
only put them back when they’re as fast as a McLaren. It just won’t
happen, and the risk of a fork becomes non-trivial.

I think this is a completely ridiculous statement. I mean, feel free if that’s the direction you think you need to go, but I’m not going to continue down that thread with you.

Basically what I’m saying is that if you want a bot to be public and people to pay attention to it then you need to have some basic stability guarantees. If you can’t give some basic stability guarantees then the bot is only harming the entire testing infrastructure. That said, having your own internal bots is entirely useful, it just means that it’s up to you to notice failures and provide some sort of test case to the community. We could even have a “beta” bot site if something is reliable enough for that, but not reliable enough for general consumption. I believe you mentioned having a separate bot master before, we have other bot masters as well - see the green dragon stuff with jenkins.

-eric

ps. Have actually added Chris Matthews to talk about the buildbot staging work. Or even moving the rest of the bots to something staged, or anything. :slight_smile:

I think this is a poor analogy. You're also ignoring the solution I gave you
in my previous mail for slow bots.

I'm not ignoring it, I'm acting upon it. But it takes time. I don't
have infinite resources.

If you can't give some basic stability guarantees then the bot
is only harming the entire testing infrastructure.

Define stability. Daniel was talking about "things I can act upon".
That's so vague it means nothing. "Basic stability guarantees" is on a
similar gist.

Any universal rule you try to make will either be too lax for fast and
reliable bots, or too hard on slow and less used bots.

That's what I'm finding hard to understand. All you guys are saying is
that things are bad and need to get better. I agree completely. But
your solution is to turn off everything you don't understand or assume
it's flaky, and that's just wrong.

We had two flaky bots: Pandas and a Juno. Pandas were disabled, the
Juno was fixed. Some of our bots, however, are still slow, and we have
been asked to disable them because they were red for too long.

Most of the problem we find are bad tests from people that didn't
(obviously) test on ARM. The second most common is code that doesn't
take into account 32-bits platforms. The third most common breakages
is the sanitizer tests, which pop in and out on many platforms. The
most common long breakage is due to self-hosted Clang breaking and
making it hard to find what commit to revert or even warn the
developer.

None of those are due to instability of my buildbots. But I got
shouted at many times to disable the bot because it was "red for too
long". I find this behaviour disrespectful.

I'm now trying to get 8 more ARM boards and 3 AArch64, and I plan to
put them as redundant builders. But it takes time. Weeks to make them
work reliably, more weeks to make sure they won't fall under pressure,
more weeks to put in production and stabilise. Meanwhile, I'd
appreciate if people stopped trying to kill the others.

What else do you want us to do?

cheers,
--renato

One strategy I use for our flaky bots is to have them email me only. If the failure is real, then I forward the email to who ever I find on the blame list. For a flaky build, this is least you can do. For our flaky builds I know how and why they are flaky, some person that gets email does not. This is also a great motivator to help me know what is wrong, and how to fix it. By default, all new builds I create I do this, until I decide the SNR is appropriate for the community. Yes I have to triage builds sometimes, but I have an interest in them working, and people always acting on green dragon emails, so I think it is worth it. Beyond that, we have regexes which identify common failures, and highlight them in the build page and log. For instance, a build that fails with a ninja error, will say so, same with a svn failure or a Jenkins exception.

We also have a few polices on email: only email on first failure, don’t email on exception and abort, and don’t email long blame lists (more than 10 people). These require some manual intervention sometimes. But no point in emailing the wrong people, or too many people. We also track the failure span for all of our builds, if any fail for more than a day, I get an email to go shake things up. We also keep a cluster wide health metric, which is the total number of hours of currently failing builds, I use this as an overall indicator of how the builds are doing.

In all our CI cluster we use phased builds. Phase 1 is a fast incremental builder and a no bootstrap release asserts build. If those build, we trigger a release with LTO build, if that works, we trigger all the rest of our compilers and tests. It is a waste to queue long builds on revisions that have not been vetted in some way. In some places the tree of builds is 4 deep, and the turn around time can be upwards of 12 hours after commit, BUT failures in those bots are rare, because so much other testing has gone on beforehand.

Mechanically, staging works by uploading the build artifacts to a central web server, then passing a URL to the next set of builds so they can download the compiler. This also speeds up builds that would otherwise have to build a compiler to run a test. For the lab, I think that won’t work as well because of the diversity of platforms and configurations, but a known good revision could be passed around. Some of the fast reliable builds can run first, and publish all the builds that work.

I do think flaky bots should only email their owners. I also think we should nominate some reliable fast builds to produce vetted revision, and trigger most other builds from those.

For instance, a build that fails with a ninja error, will say so, same with a
svn failure or a Jenkins exception.

Will these get mailed to developers? Or admins?

We also have a few polices on email: only email on first failure, don’t
email on exception and abort, and don’t email long blame lists (more than 10
people).

That's the same. The only problem we found is that exception is
treated as "success" because we don't want to email when the master is
reloaded. But the sequence red->exception->red emails, since
exception->red is treated as good->bad.

In all our CI cluster we use phased builds.

I'd love to have that! :smiley:

I do think flaky bots should only email their owners.

I agree. The problem is defining flaky. A lot of flaky behaviour can
be mapped back to the compiler (like Clang abusing of the C++ ABI, or
code assuming 64-bit types in odd ways).

I also think we
should nominate some reliable fast builds to produce vetted revision, and
trigger most other builds from those.

This would be the perfect world.

When Apple moved to GreenBots, I was expecting that we'd be moving too
not long after. I was also expecting the LLVM Foundation to be driving
this change, and I'd have dived head first to have what you have.

But it makes no sense for me to do that on my own, locally. Nor I have
bandwidth or resources to do that for everyone else.

I don't really care if it's buildbot, Jenkins or orc slaves, as long
as I can spend my time doing something else, I'm happy.

cheers,
--renato