generally sounds good, though I’m not sure how much structure is needed for many of these changes - in the past I’ve made a point of pushing back on red bots or bots sending fail mail that’s unhelpful (noisy, frequent unactionable failures, especially long blame lists, etc) & removing bots if needed & I encourage everyone to do more of that (either a small dedicated group, or the community at large) for instance.
during LLVM conf we had a roundtable discussing the state of buildbot. Here are the notes of what we discussed.
The summary is that there’s lots of appetite for improving the state of LLVM’s infra, with lots of good shorter and longer term ideas (see below).
Several people were interested in starting an (open for all who are interested) “LLVM infra team”, with possibly a dedicated mailing list, and with possibly the powers to make infra changes with just consensus from people on that infra team.
FWIW, we did have a lab/infrastructure mailing list many years ago & not much happened with it - perhaps this one will be different, but I don’t think it will necessarily create more authority. Invested parties, I think, would be as well to propose things on the usual mailing lists & do the work to make these changes - making them without community buy-in will be reasonably pushed back against in either form.
But if it helps to have a common group to get involved in these discussions, that’s good - so they don’t die on the mailing list with no discussion/progress (we do this in the debug info area by having an informal grouping - most mailings go out to the list + the usual folks who are interested in that area).
(Sorry for the delayed email, I wrote this up right after the meeting but forgot to hit “Send”.)
The actual notes:
Problems with buildbot
- console view loads slowly
- many bots take a long time to cycle
- many bots are perma-red
Some bots are configured in the “Do not send email”/for bot maintainers to maintain/triage/etc, which I encourage - if that’s insufficiently called out in the UI & making the console hard to read, yeah, it’d be great to make that grouping more clear or if it’s impractical to do so, perhaps just saying that use case is not supported on the primary buildbot instance & those folks can run their own CI infrastructure entirely?
- test output on some bots is huge due to the bots printing all tests, not just failing ones, making it difficult to see failing tests
- it’s sometimes difficult to reproduce failures on the bots locally
- display better machine info on all builders (OS, host compiler with detailed version, binutils version, cmake version, ninja version, kernel/userspace bitness)
- require bots to use a cmake cache file, for easy local matching execution?
- remove perma-red bots
- remove slow bots
– or put on faster hw, llvm foundation has funds
Does it have enough funds for significant investment here? What would that look like (what are the current gaps? How much would they cost to fill? In what sort of priority ordering?).
— what about slow boards?
---- decouple build and test phases?
---- shard tests over multiple devices?
Yep - Apple internally (maybe externally on green dragon) had/has some form of tiered buildbot infrastructure - eg: stage 1 build result is a separate “builder” but its output is used by stage 2/bootstrapping builders, and test-suite builders, etc. So there’s less redundant work and redundant bot spam. Something like that would be lovely to haev - but someone’s got to invest in building/maintaining/etc it & so far no one has - that’s what this has mostly come down to in the past: lots of things people would like, but no one signing up to build it, maintain it, etc. If you’ve got folks with teh time/resources, yeah - lots of things on this list & general community desire for build infrastructure can be done.
My ideal would be a tiered build flow as described above, with a time window threshold goal (eg: any bot that sends mail must have a way to keep its cycle period to less than an hour, say - that might mean if it necessarily runs 2 hours of testing, it has twice the infrastructure so hourly snapshots can be taken and tested without falling behind) - which could be achieved with either more hardware, or narrower testing as suited to the particular scenario. Tiering keeps the redundant noise down, window threshold keeps the mails targeted/relatively actionable by the recipient. Also flakiness tolerance should be low. If any of thoes 3 criteria can’t be met, it should be up to the party interested in that workload to maintain the bots, triage the failures, and manually send mail that conforms to those 3 criteria (even if it takes a human 3 days to investigate a failure - if that failure has a low false positive rate and is actionable by the person it goes to (narrow blame list & good reproduction steps) then I think that’s golden - it does mean if it’s 3 days later a revert might not be immediately viable (it often is viable, though))
- make fast bots trigger slow bots, only when fast builds are successful, for fewer emails
- have support tier lists?
– e.g. tier 0 pledges bots that cycle in < 15 min, in return are on tier 0 waterfall and can revert breakages after ~ 15 min
– tier 1 pledges bots that cyclle in < 1 day, can revert breakages after 1 day
I think the general rule is if you have reproduction steps & it’s a supported scenario- you can revert immediately. I’m not sure the value in waiting a day to revert because your bot takes a while to cycle (though I don’t think we have any bots that have a 24 hour cycle time, do we? I guess if it’s a 12 hour cycle you could end up, at worst, 24 hours from patch submission to result)
- update buildbot to current version?
– lots of api changes
- have pre-commit tests
– kuhnel has prototype for this on linux, will send separate announcement, positive reception
— several requests to have the same for win
- move build off buildbot to github actions?
– jyknight has prototype, works great, except that custom hardware isn’t (yet?) supported, so cycle times are prohibitively long
- have a dedicated llvm infra team
– dedicated llvm-infra mailing list
– and group of deciders with llvm foundation’s blessing?
- have a buildbot view that shows only red bots?
- have an in-tree script for setting up a build + prereqs (eg. new-enough host gcc, gnuwin tools on win, new enough cmake, etc)?
Can be handy - but also can have a large support surface (what platforms would that be supported on?).
- current bots don’t cover multi-config cmake generators (ie Xcode, msvc before 2019)
– explicitly say we don’t support those? would allow some cleanups
— msvc 2019 cmake support generates ninja builds for both debug and release and calls ninja for the actual build
— maybe do something similar for xcode?
That seems a bit orthogonal to the rest of the discussion, maybe more suited to the cmake update conversation/thread (though tangential there too, perhaps).