LLVM IRC channel flooded?

Just some stats, after looking through lab.llvm.org:8011

Maybe these should be marked as experimental, and removed from the builders link on the main page.

Never passed at all:
libcxx-libcxxabi-x86_64-linux-ubuntu-cxx03
libcxx-libcxxabi-x86_64-linux-ubuntu-ubsan
libcxx-libcxxabi-x86_64-linux-ubuntu-tsan
libcxx-libcxxabi-x86_64-linux-ubuntu-gcc
libcxx-libcxxabi-x86_64-apple-darwin14-system-lib
lldb-x86_64-ubuntu-14.04-android
llgo-x86_64-linux

Not pass in at least a month:

llvm-clang-lld-x86_64-debian-fast
clang-native-mingw32-win7
clang-x86_64-linux-selfhost-abi-test
clang-x64-ninja-win7-debug
perf-x86_64-penryn-O3-polly-detect-only
sanitizer-x86_64-linux-bootstrap
sanitizer_x86_64-freebsd
sanitizer-windows
libcxx-libcxxabi-x86_64-apple-darwin14-tot-clang
clang-amd64-openbsd
lldb-x86_64-debian-clang
lldb-x86_64-freebsd
lldb-x86_64-ubuntu-14.10

Chris, thanks for going through this!

I am all in favor of removing/disabling these bots (and would be OK with being even more aggressive).

The one Polly bot listed is a performance buildbot which has emails or IRC messages disabled. I now removed it completely from the buildbot list to also keep the web interface clean.

Best,
Tobias

Just some stats, after looking through lab.llvm.org
<http://lab.llvm.org>:8011

Maybe these should be marked as experimental, and removed from the
builders link on the main page.

Never passed at all:
libcxx-libcxxabi-x86_64-linux-ubuntu-cxx03
libcxx-libcxxabi-x86_64-linux-ubuntu-ubsan
libcxx-libcxxabi-x86_64-linux-ubuntu-tsan
libcxx-libcxxabi-x86_64-linux-ubuntu-gcc
libcxx-libcxxabi-x86_64-apple-darwin14-system-lib
lldb-x86_64-ubuntu-14.04-android
llgo-x86_64-linux

Not pass in at least a month:

llvm-clang-lld-x86_64-debian-fast
clang-native-mingw32-win7
clang-x86_64-linux-selfhost-abi-test
clang-x64-ninja-win7-debug
perf-x86_64-penryn-O3-polly-detect-only
sanitizer-x86_64-linux-bootstrap
sanitizer_x86_64-freebsd
sanitizer-windows
libcxx-libcxxabi-x86_64-apple-darwin14-tot-clang
clang-amd64-openbsd
lldb-x86_64-debian-clang
lldb-x86_64-freebsd
lldb-x86_64-ubuntu-14.10

Chris, thanks for going through this!

Yep, totally - is there an easy way to gather this data on an ongoing
basis? (perhaps every month or so?)

I am all in favor of removing/disabling these bots (and would be OK with
being even more aggressive).

Yep, agreed - I'd say removing the buildbots that've never passed and at
least moving the ones that haven't passed recently to "experimental" is a
totally reasonable approach.

So, I tried to track down what went wrong here, and the oldest build I can
find is:
http://lab.llvm.org:8011/builders/sanitizer-windows/builds/3916

This raises a different problem: the buildmaster doesn't hold onto enough
logs. That build is from five days ago, and already I can't find the
relevant blamelist causing the breakage. =/

+1!

--renato

Right now, I have two "failing" LNT bots. One of them is a known LNT
server instability, and I brought the bot down myself. If you shut
down the bot gracefully, no one gets an email, so if you fix it and
bring it back, no one gets annoyed.

If the bot owners are not willing to do that kind of management, or
are unresponsive, we should take the bots out of the "official" list
and not report anything from them. If anyone wants to put a bot up and
not care about it, they can also put up their own buildmaster, so that
we don't have to mix lost bots with production bots. No emails, no
reds on the production page.

cheers,
--renato

> Maybe these should be marked as experimental, and removed from the
builders
> link on the main page.

Right now, I have two "failing" LNT bots. One of them is a known LNT
server instability, and I brought the bot down myself. If you shut
down the bot gracefully, no one gets an email, so if you fix it and
bring it back, no one gets annoyed.

If the bot owners are not willing to do that kind of management, or
are unresponsive, we should take the bots out of the "official" list
and not report anything from them. If anyone wants to put a bot up and
not care about it, they can also put up their own buildmaster, so that
we don't have to mix lost bots with production bots. No emails, no
reds on the production page.

Yep, having a limited lifetime in the "experimental" category would be
useful too - if you're not working to get it out of there, just remove it
entirely.

I agree. It's very hard to keep track of real conversations some time.

I would prefer to have a known central web page where bots publish
their status instead of spamming a conversation channel. It serves no
useful purpose.

Diego.

Either it's important and useful to you to see the notifications, so you'll
want to subscribe, or it's not, and you won't. Right? I went out of my way
to filter the bots to a separate window rather than /ignore'ing them,
because it seemed useful, but was in the way of the actual use of the
channel as a communication medium. I would join the bot channel. I'm sure
other people wouldn't. Either way it seems like an improvement to have it
separated.

Just some stats, after looking through lab.llvm.org:8011

Maybe these should be marked as experimental, and removed from the builders link on the main page.

Never passed at all:
libcxx-libcxxabi-x86_64-linux-ubuntu-cxx03
libcxx-libcxxabi-x86_64-linux-ubuntu-ubsan
libcxx-libcxxabi-x86_64-linux-ubuntu-tsan
libcxx-libcxxabi-x86_64-linux-ubuntu-gcc
libcxx-libcxxabi-x86_64-apple-darwin14-system-lib
lldb-x86_64-ubuntu-14.04-android
llgo-x86_64-linux

Hi,

llgo-x86_64-linux is mine. Sorry, I had disabled the slave agent to avoid spurious emails, but hadn’t considered its impact on the status pages. I’m fine with disabling it altogether for now; I’m waiting on a fix to Ninja to be merged.

Cheers,
Andrew

Maybe these should be marked as experimental, and removed from the builders link on the main page.

Never passed at all:
libcxx-libcxxabi-x86_64-linux-ubuntu-cxx03
libcxx-libcxxabi-x86_64-linux-ubuntu-ubsan
libcxx-libcxxabi-x86_64-linux-ubuntu-tsan
libcxx-libcxxabi-x86_64-linux-ubuntu-gcc
libcxx-libcxxabi-x86_64-apple-darwin14-system-lib

Sorry about these bots. When I originally put them up I had hoped to
deal with the failures in short order. However that hasn't happened.

I'm happy to move these bots to an experimental section and off the main page.

/Eric

+1, this is a really good summary of issues. I’m in full support of any and all efforts to reduce noise here. I’ve gotten to the point where I only watch a small handful bots. Anything other than that, I pretty much ignore unless someone emails me directly or replies to a commit. I’m not quite to point of marking buildbot emails as spam, but I’m definitely not paying them much attention either. One particular irritant is getting emails 12-24 hours later about someone else’s breakage that has already been fixed. The long cycling bots are really irritating in that respect.

That's not that easy to fix, and I think we'll have to cope with that
forever. Not all machines are fast, and some buildbots do a full
self-host, with compiler-rt and running all tests. Others do a full
benchmark run of LNT, running it 5-8 times, which can take several
hours on an ARM box.

The benchmark bots should be marked not to spam, since they're not
there to pick up errors, but the full self-hosting ones do need to
warn on errors. For example, right now I have a bug only on a thumbv7a
self-hosting bot, and not on others. I'm now bisecting it to find the
culprit, but this is not always clear, as the longer it takes for me
to realise, the harder it will be to fix it.

The only way out of it is for people to look at the fast bots, and if
they're fixed, check the commit that did it and see if the slow bot
has been fixed by the same commit later.

Buildbot owners will eventually pick those problems up, but as I said,
the longer it takes, the harder it is to get to the bottom of it, and
the higher is the probability of getting more regressions introduced
because the bot is red and won't warn.

cheers,
--renato

Also, perhaps the URLs should be shortened?

That's a good idea.

The format could be: [botname]: [buildername] [short_url]
("Passed"|"Failed:" [usernames])

The only reason to show "Passed" results is if they were failing
before, as a confirmation that whatever you did to fix, worked.
Otherwise, they're just noise.

Good point.

[botname]: [buildername] [short_url] [revision] ("Fixed:"|"FAILED:")
[usernames]

Should the bots blame people for fixes?

I filed a feature request ticket for this with the buildbot team, here:

http://trac.buildbot.net/ticket/3261#ticket

Cheers,

Jon

I did a quick data mine on May’s IRC log. Bot traffic is 24% of total. Top IRC posters:

llvmbb 1434
echristo 334
majnemer 320
Fiora 286
bb-chapuni 237
chandlerc 236
jroelofs 199
myeisha 160
green-dragon-bot 140
compnerd 124
nlewycky 120
EricWF 117
AaronBallman 114
jyknight 113
nbjoerg 113

Just out of curiosity, what’s the percentage in volume (chars), not just lines?

By chars in message 37%:

chandlerc 358
BenL 455
r4nt 479
majnemer 509
green-dragon-bot 879
jyknight 965
nlewycky 1002
bb-chapuni 1090
echristo 2469
llvmbb 6381

total 22145

Percent chars sender each:

chandlerc 1.616618%
BenL 2.054640%
r4nt 2.163016%
majnemer 2.298487%
green-dragon-bot 3.969293%
jyknight 4.357643%
nlewycky 4.524723%
bb-chapuni 4.922104%
echristo 11.149244%
llvmbb 28.814631%

My only other stat for the day. Bot success ratios for builds in the last 30 days:

One particular irritant is getting emails 12-24 hours later about someone else's
breakage that has *already been fixed*. The long cycling bots are really
irritating in that respect.

That's not that easy to fix, and I think we'll have to cope with that
forever. Not all machines are fast, and some buildbots do a full
self-host, with compiler-rt and running all tests. Others do a full
benchmark run of LNT, running it 5-8 times, which can take several
hours on an ARM box.

I agree it's not easy, but it's not something we should just live with either. There are ways to address the problem and we should consider them.

As a randomly chosen example, one thing we could do would be to have the notion of a "last good commit". Fast builders would cycle off ToT, if any one (or some subset) passed, that advances the notion of last good commit. Slower builders should cycle off the last good commit, not ToT. We have all the mechanisms to implement this today. It could be as simple as parsing the JSON output of buildbot in the script that runs the slower build bots and sync to a particular revision rather than ToT.

The benchmark bots should be marked not to spam, since they're not
there to pick up errors, but the full self-hosting ones do need to
warn on errors. For example, right now I have a bug only on a thumbv7a
self-hosting bot, and not on others. I'm now bisecting it to find the
culprit, but this is not always clear, as the longer it takes for me
to realise, the harder it will be to fix it.

At this point, you're long past the point I was grossing about. I'm not arguing that long running bots shouldn't notify; I'm arguing they shouldn't report *obvious* false positives.

Also, the bisect step really should be automated... :slight_smile:

The only way out of it is for people to look at the fast bots, and if
they're fixed, check the commit that did it and see if the slow bot
has been fixed by the same commit later.

You've now wasted 10 minutes or more my time per slow noisy bot. When I routinely get 10+ builder failure emails for changes that are clean, that's not worthwhile investment.

Buildbot owners will eventually pick those problems up, but as I said,
the longer it takes, the harder it is to get to the bottom of it, and
the higher is the probability of getting more regressions introduced
because the bot is red and won't warn.

I agree. All I'm suggesting is reducing noise so that real failures are likely to be noticed quickly.

sanitizer-windows

So, I tried to track down what went wrong here, and the oldest build I can
find is:
http://lab.llvm.org:8011/builders/sanitizer-windows/builds/3916

This raises a different problem: the buildmaster doesn't hold onto enough
logs. That build is from five days ago, and already I can't find the
relevant blamelist causing the breakage. =/

+1, in today's day and age, it's senseless to not just keep infinite logs.

-- Sean Silva