LLVM IRC channel flooded?

Folks,

I know it's a reasonably valuable thing to have the buildbot IRC bot
publishing results, but the channel is kind of flooded with the
messages, and the more bots we put up, the worse it will be.

I think we still need the NOC warnings, but not over IRC. The Buildbot
NOC page is horrible and useless, since it doesn't know the difference
between "it's red and I know it" from "it's broken".

For that reason, I have built my own NOC page:

http://people.linaro.org/~renato.golin/llvm/arm-bots/

But that machine is too slow to cope with all bots. We may need a
project to build such a system on a larger scale.

However, for now, I think not printing the green results in IRC would
go a long way of cleaning the channel up.

Any thoughts?

cheers,
--renato

Folks,

I know it's a reasonably valuable thing to have the buildbot IRC bot
publishing results, but the channel is kind of flooded with the
messages, and the more bots we put up, the worse it will be.

I think we still need the NOC warnings, but not over IRC. The Buildbot
NOC page is horrible and useless, since it doesn't know the difference
between "it's red and I know it" from "it's broken".

For that reason, I have built my own NOC page:

http://people.linaro.org/~renato.golin/llvm/arm-bots/

I like it!

But that machine is too slow to cope with all bots. We may need a
project to build such a system on a larger scale.

However, for now, I think not printing the green results in IRC would
go a long way of cleaning the channel up.

Any thoughts?

Even shortening up the messages from them would go a long way...

For example (one example of failure and success for each):

4:45:58 AM - green-dragon-bot: Project Clang Stage 1: cmake, incremental RA, using system compiler (Build) build r237678 (#9856): FAILURE in 41 sec: http://lab.llvm.org:8080/green/job/clang-stage1-cmake-RA-incremental_build/9856/ - blamelist: chfast

5:13:03 AM - green-dragon-bot: Project Clang Stage 1: cmake, incremental RA, using system compiler (Build) build r237680 (#9857): FIXED in 12 min: http://lab.llvm.org:8080/green/job/clang-stage1-cmake-RA-incremental_build/9857/

7:18:45 AM - bb-chapuni: build #6916 of ninja-clang-x64-mingw64-RA is complete: Failure [failed test-llvm] Build details are at http://bb.pgr.jp/builders/ninja-clang-x64-mingw64-RA/builds/6916 blamelist: Zoran Jovanovic <zoran.jovanovic@imgtec.com>, NAKAMURA Takumi <geek4civic@gmail.com>, Kostya Serebryany <kcc@google.com>, Yaron Keren <yaron.keren@gmail.com>, Tim Northover
7:18:45 AM - bb-chapuni: <tnorthover@apple.com>, David Majnemer <david.majnemer@gmail.com>, Daniel Jasper <djasper@google.com>, Pete Cooper <peter_cooper@apple.com>, Eric Christopher <echristo@gmail.com>, Jozef Kolek <jozef.kolek@imgtec.com>, Michael Kuperstein <michael.m.kuperstein@intel.com>, Pawel Bylica <chfast@gmail.com>, Richard Smith <richard-llvm@metafoo.co.uk>, Alexey Bataev
7:18:45 AM - bb-chapuni: <a.bataev@hotmail.com>, Matthias Braun <matze@braunis.de>, David Blaikie <dblaikie@gmail.com>, Reid Kleckner <reid@kleckner.net>, Tobias Grosser <tobias@grosser.es>, Filipe Cabecinhas <me@filcab.net>

7:43:47 AM - bb-chapuni: build #6917 of ninja-clang-x64-mingw64-RA is complete: Success [build successful] Build details are at http://bb.pgr.jp/builders/ninja-clang-x64-mingw64-RA/builds/6917

8:20:46 AM - llvmbb___: build #3947 of sanitizer-x86_64-linux-autoconf is complete: Failure [failed annotate failed tsan output_tests] Build details are at http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-autoconf/builds/3947 blamelist: yrnkrn, zjovanovic

8:32:33 AM - llvmbb___: build #3948 of sanitizer-x86_64-linux-autoconf is complete: Success [build successful] Build details are at http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-autoconf/builds/3948

In each of those three, the builder's name is repeated, as well as the build #, and there's no mention of the svn revision that they built. In the case of Chapuni's buildbot, the email addresses are not useful in comparison to the svn usernames from the other bots. This is compounded by the slow bots that blame the world when there's a failure.

Also, perhaps the URLs should be shortened?

The format could be: [botname]: [buildername] [short_url] ("Passed"|"Failed:" [usernames])

i.e:
llvmbb__: sanitizer-x86_64-linux-autoconf http://bit.ly/1R0hPbR Failed: yrnkrn, zjovanovic

Cheers,

Jon

Folks,

I know it's a reasonably valuable thing to have the buildbot IRC bot
publishing results, but the channel is kind of flooded with the
messages, and the more bots we put up, the worse it will be.

Are you more concerned about the average size of the bot message, or
the number of times the bots speak up in the channel? I find the
quantity manageable, but the size can sometimes be obnoxious and
distracting.

I think we still need the NOC warnings, but not over IRC. The Buildbot
NOC page is horrible and useless, since it doesn't know the difference
between "it's red and I know it" from "it's broken".

For that reason, I have built my own NOC page:

http://people.linaro.org/~renato.golin/llvm/arm-bots/

I like it!

I like this as well, but do miss having the revision number as part of
the immediate information (that's how I usually tell whether a red bot
is an actual problem for one of my commits -- if the revision is too
old, then I assume it's building my fix and I'm okay).

But that machine is too slow to cope with all bots. We may need a
project to build such a system on a larger scale.

However, for now, I think not printing the green results in IRC would
go a long way of cleaning the channel up.

Any thoughts?

Even shortening up the messages from them would go a long way...

For example (one example of failure and success for each):

4:45:58 AM - green-dragon-bot: Project Clang Stage 1: cmake, incremental RA,
using system compiler (Build) build r237678 (#9856): FAILURE in 41 sec:
http://lab.llvm.org:8080/green/job/clang-stage1-cmake-RA-incremental_build/9856/
- blamelist: chfast

5:13:03 AM - green-dragon-bot: Project Clang Stage 1: cmake, incremental RA,
using system compiler (Build) build r237680 (#9857): FIXED in 12 min:
http://lab.llvm.org:8080/green/job/clang-stage1-cmake-RA-incremental_build/9857/

7:18:45 AM - bb-chapuni: build #6916 of ninja-clang-x64-mingw64-RA is
complete: Failure [failed test-llvm] Build details are at
http://bb.pgr.jp/builders/ninja-clang-x64-mingw64-RA/builds/6916 blamelist:
Zoran Jovanovic <zoran.jovanovic@imgtec.com>, NAKAMURA Takumi
<geek4civic@gmail.com>, Kostya Serebryany <kcc@google.com>, Yaron Keren
<yaron.keren@gmail.com>, Tim Northover
7:18:45 AM - bb-chapuni: <tnorthover@apple.com>, David Majnemer
<david.majnemer@gmail.com>, Daniel Jasper <djasper@google.com>, Pete Cooper
<peter_cooper@apple.com>, Eric Christopher <echristo@gmail.com>, Jozef Kolek
<jozef.kolek@imgtec.com>, Michael Kuperstein
<michael.m.kuperstein@intel.com>, Pawel Bylica <chfast@gmail.com>, Richard
Smith <richard-llvm@metafoo.co.uk>, Alexey Bataev
7:18:45 AM - bb-chapuni: <a.bataev@hotmail.com>, Matthias Braun
<matze@braunis.de>, David Blaikie <dblaikie@gmail.com>, Reid Kleckner
<reid@kleckner.net>, Tobias Grosser <tobias@grosser.es>, Filipe Cabecinhas
<me@filcab.net>

7:43:47 AM - bb-chapuni: build #6917 of ninja-clang-x64-mingw64-RA is
complete: Success [build successful] Build details are at
http://bb.pgr.jp/builders/ninja-clang-x64-mingw64-RA/builds/6917

8:20:46 AM - llvmbb___: build #3947 of sanitizer-x86_64-linux-autoconf is
complete: Failure [failed annotate failed tsan output_tests] Build details
are at
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-autoconf/builds/3947
blamelist: yrnkrn, zjovanovic

8:32:33 AM - llvmbb___: build #3948 of sanitizer-x86_64-linux-autoconf is
complete: Success [build successful] Build details are at
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-autoconf/builds/3948

In each of those three, the builder's name is repeated, as well as the build
#, and there's no mention of the svn revision that they built. In the case
of Chapuni's buildbot, the email addresses are not useful in comparison to
the svn usernames from the other bots. This is compounded by the slow bots
that blame the world when there's a failure.

Also, perhaps the URLs should be shortened?

The format could be: [botname]: [buildername] [short_url]
("Passed"|"Failed:" [usernames])

i.e:
llvmbb__: sanitizer-x86_64-linux-autoconf http://bit.ly/1R0hPbR Failed:
yrnkrn, zjovanovic

I very much like this format for displaying the information in IRC.
It's succinct, but useful.

Removing the "passed" messages from IRC might reduce the spam, but at
the expense of making it more difficult for the person who broke the
bots to track down when each bot is fixed (unless we aggregate that
data across bots). If we reduce the size of the messages, I think that
may go a long way towards being less distracting.

~Aaron

Also, perhaps the URLs should be shortened?

That's a good idea.

The format could be: [botname]: [buildername] [short_url]
("Passed"|"Failed:" [usernames])

The only reason to show "Passed" results is if they were failing
before, as a confirmation that whatever you did to fix, worked.
Otherwise, they're just noise.

cheers,
--renato

Also, perhaps the URLs should be shortened?

That's a good idea.

The format could be: [botname]: [buildername] [short_url]
("Passed"|"Failed:" [usernames])

The only reason to show "Passed" results is if they were failing
before, as a confirmation that whatever you did to fix, worked.
Otherwise, they're just noise.

Good point.

[botname]: [buildername] [short_url] [revision] ("Fixed:"|"FAILED:") [usernames]

Should the bots blame people for fixes?

Jon

I think so. If I commit something that I hope will fix a bot, I'd like
to see my name on a "fixed" bot that I broke.

cheers,
--renato

From: "Renato Golin" <renato.golin@linaro.org>
To: "Jonathan Roelofs" <jonathan@codesourcery.com>
Cc: "Clang Dev" <cfe-dev@cs.uiuc.edu>
Sent: Tuesday, May 19, 2015 10:47:18 AM
Subject: Re: [cfe-dev] LLVM IRC channel flooded?

> [botname]: [buildername] [short_url] [revision]
> ("Fixed:"|"FAILED:")
> [usernames]
>
> Should the bots blame people for fixes?

I think so. If I commit something that I hope will fix a bot, I'd
like
to see my name on a "fixed" bot that I broke.

+1

-Hal

Are you more concerned about the average size of the bot message, or
the number of times the bots speak up in the channel?

Both. Too many messages are not friendly to IRC bots logging the lines
until I connect. Long messages break the conversation, and make it
really hard to read anything else.

I like this as well, but do miss having the revision number as part of
the immediate information (that's how I usually tell whether a red bot
is an actual problem for one of my commits -- if the revision is too
old, then I assume it's building my fix and I'm okay).

I could improve with the SVN revision, but in that list are the bots
that should never be red, nor outdated.

If the bot is red before your commit, you will want to know if your
commit added new failures. I'd wish that buildbot would be able to
spot different breakages, and email blame users again. If that
information is available in the JSON interface, I can grab with that
script.

Removing the "passed" messages from IRC might reduce the spam, but at
the expense of making it more difficult for the person who broke the
bots to track down when each bot is fixed (unless we aggregate that
data across bots). If we reduce the size of the messages, I think that
may go a long way towards being less distracting.

By having "failed" and "fixed" messages, we work around that problem
nicely, while still being short.

cheers,
--renato

From: "Renato Golin" <renato.golin@linaro.org>
To: "Jonathan Roelofs" <jonathan@codesourcery.com>
Cc: "Clang Dev" <cfe-dev@cs.uiuc.edu>
Sent: Tuesday, May 19, 2015 10:47:18 AM
Subject: Re: [cfe-dev] LLVM IRC channel flooded?

> [botname]: [buildername] [short_url] [revision]
> ("Fixed:"|"FAILED:")
> [usernames]
>
> Should the bots blame people for fixes?

I think so. If I commit something that I hope will fix a bot, I'd
like
to see my name on a "fixed" bot that I broke.

+1

I'll second this +1 (I like the idea of the fixed messages instead of
passed messages).

~Aaron

Jonathan Roelofs <jonathan@codesourcery.com> writes:

Also, perhaps the URLs should be shortened?

That's a good idea.

The format could be: [botname]: [buildername] [short_url]
("Passed"|"Failed:" [usernames])

The only reason to show "Passed" results is if they were failing
before, as a confirmation that whatever you did to fix, worked.
Otherwise, they're just noise.

Good point.

[botname]: [buildername] [short_url] [revision] ("Fixed:"|"FAILED:")
[usernames]

Most (I thought all) of the bots already do the fixed failed thing.
There's something to be said for a consistent format though, and it'd be
nice if more of the bots mentioned the revision (green-dragon does, but
llvmbb and bb-chapuni don't).

The other thing that would help is limiting the number of names that can
show up on a blamelist. If there are a hundred names on the blamelist
the message is way too long and the notifications don't help much. The
green-dragon bot just says "[n] people on blamelist" instead in these
cases.

Should the bots blame people for fixes?

I don't think so - if you fixed it intentionally you usually know it,
and if not it's just another beeping window to ignore.

If “flooding” is the issue, the only long term solution is to A) have a bot channel that can be flooded or B) curate the list of bots which notify IRC.

A more fun solution would be to have the bots implement a first failure policy for notifications. We move the bots to their own channel, then have an IRC bot watch that channel for failures, parse the rev, and only notify if it is the first to find a failure at that rev.

I agree that long blame lists are useless and should be removed. URL shortening would add a dependence on an external service which is a new source of failure and trust.

Folks,

I know it's a reasonably valuable thing to have the buildbot IRC bot
publishing results, but the channel is kind of flooded with the
messages, and the more bots we put up, the worse it will be.

I think we still need the NOC warnings,

NOC?

but not over IRC. The Buildbot
NOC page is horrible and useless, since it doesn't know the difference
between "it's red and I know it" from "it's broken".

What distinction are you drawing there? The difference between freshly red
and previously red?

For that reason, I have built my own NOC page:

http://people.linaro.org/~renato.golin/llvm/arm-bots/

What does this do differently from the main buildbot page? (other than only
show arm bots?) Is it something we could do to the buildbot page (remove
always-red builders, recategorize flaky/problematic builders so at least
they're off in the "experimental" section, etc)?

Yes, I also find the amount of bot spam in #llvm is basically intolerable. It makes it difficult to see actual people talking. At first, I just put all the bots on /ignore. Now I have an xchat script to move the botspam to another tab (tabify-004.pl). I’d recommend that the bots should just be moved to #llvm-bots and fix the problem for everyone. Those who are committing changes can join that channel, too, and others don’t care.

While we’re on this subject, I also find the official buildbot page (lab.llvm.org:8011) almost unusable, since so many columns are either always red, or else are so flaky that they basically randomly alternate between passing and failing. So, at a glance, it’s impossible to tell whether the current state of the tree is good. (I certainly haven’t memorized which ones are “supposed” to be red, and which are not. Maybe others have). Having flaky and always-failing builds show up on the buildbot pages, and notifying IRC, really has negative utility, since it not only is not providing useful information, but is serving to obscure the actual important failures, and causing people to spend time investigating non-problems.

Someone gave me the hint to use the http://bb.pgr.jp/ buildbot page instead, which was a great recommendation – that page shows problems much more clearly. But it’s unfortunate that there needs to be a separate “sane builders only” buildmaster.

E.g. (and not to pick on this particular bot, this is just one example of many): http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/27655 – passed, while the previous failed. But, it’s not caused by the commit, it’s just arbitrary.

Or, yesterday, on #llvm: “Anyone want to give me a clue as to why this bot failed? http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/builds/18017” – answer: because it’s randomly broken. Wasted the questioner’s time trying to investigate the failure.

If all the flaky or always-broken builder configurations got hidden from the main pages of buildbot, and stopped sending emails/IRC notifications to anyone but their “owner”, that would be a substantial improvement.

If “flooding” is the issue, the only long term solution is to A) have a
bot channel that can be flooded or B) curate the list of bots which notify
IRC.

Alternatively: The phased builder approach Apple was upstreaming for a
while that seemed to get stalled at some point.

Or a pseudo-phased system: Have builders that do "interesting" things not
fail (but simply warn/no-result) when early steps fail (I've wanted/meant
to do this with the GDB buildbot for example: I don't care if LLVM fails to
compile on that buildbot, I know other bots will catch it - so I'd be happy
to mark that as a warning/do-not-continue rather than
error/do-not-continue, then just carry on with the next run). You miss the
efficiency gains of a phased builder, but you get the same reduced
noise/spam.

- David

If “flooding” is the issue, the only long term solution is to A) have
a bot channel that can be flooded or B) curate the list of bots which
notify IRC.

Realistically, who is going to subscribe to such a bot channel in (A)? I probably wouldn't. The point is to tame the noise, not hide it.

(B) sounds useful regardless.

A more fun solution would be to have the bots implement a first
failure policy for notifications. We move the bots to their own
channel, then have an IRC bot watch that channel for failures, parse
the rev, and only notify if it is the first to find a failure at that
rev.

Ack.

I agree that long blame lists are useless and should be removed. URL
shortening would add a dependence on an external service which is a
new source of failure and trust.

It doesn't have to be an external service...

http://lab.llvm.org:<port>/<hash>

Isn't /so/ bad, and is leaps & bounds better than the existing URLs. Granted, then maintaining that services becomes our problem as a community.

Jon

+1 for hiding flaky bots. I routinely see some bot randomally failing after a non-related commit.
sanitizer-x86_64-linux may be the worst one. This wastes time and hides real problems.

NOC?

Sorry, NOC is "network operations centre". the room with big screens
showing the status of a data centre, where operators sit and fix
problems, always looking at the big screens on the wall, in case they
go red.

What distinction are you drawing there? The difference between freshly red
and previously red?

Basically, yes.

What does this do differently from the main buildbot page? (other than only
show arm bots?) Is it something we could do to the buildbot page (remove
always-red builders, recategorize flaky/problematic builders so at least
they're off in the "experimental" section, etc)?

Separating ARM from the rest is the most important thing to me. but
classifying them and only showing the information I want is also
important.

James Knight has summarised well the problems I have with the current
buildbot page.

cheers,
--renato

When we built green dragon we tried to be really accountable for this sort of cruft, with a goal of 99% useful notifications, or nothing.

On green dragon we curate both which builds notify the IRC and which builds show up on the main page. Anything that fails for reasons unrelated to the commit is not allowed to do either. We use the phased build approach to make sure we notify only once per failure. Builds that are red for more than a week are disabled, if we can’t fix it in a week, its not worth building anymore. Because of that, libcxx builds, LLDB and performance builds do not notify and some are disabled. When we email the blamelist, I am CCed on every email, and they are not filtered from my inbox. If the blame list is long, it only emails me, and I track down who broke it.

Of course green dragon only runs a small proportion of the total builds.

If you can’t look at the build page, and know that everything that is red is a real problem, we have a real problem. Even within builds, if most of the steps are marked as failures, you don’t know what when wrong.

Yes, I also find the amount of bot spam in #llvm is basically intolerable.
It makes it difficult to see actual people talking. At first, I just put
all the bots on /ignore. Now I have an xchat script to move the botspam to
another tab (tabify-004.pl). I'd recommend that the bots should just be
moved to #llvm-bots and fix the problem for everyone. Those who are
committing changes can join that channel, too, and others don't care.

While we're on this subject, I also find the official buildbot page (
lab.llvm.org:8011) almost unusable, since so many columns are either
always red, or else are so flaky that they basically randomly alternate
between passing and failing. So, at a glance, it's impossible to tell
whether the current state of the tree is good. (I certainly haven't
memorized which ones are "supposed" to be red, and which are not. Maybe
others have). Having flaky and always-failing builds show up on the
buildbot pages, and notifying IRC, really has negative utility, since it
not only is not providing useful information, but is serving to obscure the
actual important failures, and causing people to spend time investigating
non-problems.

Someone gave me the hint to use the http://bb.pgr.jp/ buildbot page
instead, which was a great recommendation -- that page shows problems much
more clearly. But it's unfortunate that there *needs* to be a separate
"sane builders only" buildmaster.

E.g. (and not to pick on this particular bot, this is just one example of
many):
http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a9/builds/27655 --
passed, while the previous failed. But, it's not caused by the commit, it's
just arbitrary.

Or, yesterday, on #llvm: "Anyone want to give me a clue as to why this bot
failed?
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/builds/18017" --
answer: because it's randomly broken. Wasted the questioner's time trying
to investigate the failure.

Whenever you get crappy fail-mail, please forward it to llvm-dev, cc'ing
the bot owner and request the issue be addressed or the bot be removed.
Yeah, I know it's not an ideal process, but it's something to keep issues
visible/pushed on.

But, yes, having some more formal process to deal with this sort of thing
would be nice (I can imagine some process along the lines of "bots start in
experimental and need a track record of low flake/false positive results
for some period of time before being promoted out of experimental so they
can send mail to blame lists and IRC, etc" coupled with some mechanism for
demoting a buildbot back into experimental if it starts behaving poorly)

- David

+1 I completely agree with everything said in this thread. I wish I had a “this error had nothing to do with the commits” link in the bot mails and if a certain amount of people clicked that link the bot would be stopped from sending mails or spamming the IRC channel…

I know setting up buildbots and keeping them running in a stable fashion is a hard task (I have done that for other projects) and we have to thank the people doing that ungrateful job, but wasting everyones time with too many false positives isn’t good either.

- Matthias