What follows are my notes from the Buildbots roundtable at LLVM Dev US 2022.
I managed to be late for my own roundtable, so these notes pick up about 15 minutes in. Lucky for me, there were many in attendance and the conversation was already going by the time I arrived.
Some points have been shuffled around to make thematic sense. If you have more notes or want to correct me feel free to reply here.
- Buildkite being used for libcxx precommit.
- Google donated time on Google Cloud for this.
- Fragmentation of CI.
- Buildbot
- Buildkite
- Downstream bots
- Should we move away from Buildbot?
- Is it appropriate for precommit?
- Or is it lack of policies that is the real issue.
- The role of CI:
- Catching simple broad errors.
- Unusual configurations, custom configurations.
- Not all bots will fit in precommit but most could.
- Without cultural change, a lot of commits will still skip review and CI.
- Process change to gate merge on precommit testing?
- Current Phabricator precommit:
- Failures for no discernible reason.
- People are ignoring it, and being advised to ignore it.
- Too flaky to rely on.
- Base revision not being set is a big problem here.
- “Failed or ongoing build” keeps cropping up when landing a change.
- Patches rebased onto top of main by Buildkite, at least it tries to.
- Is there a common config we can agree on as the starting point for precommit?
- Later this came down to “the fastest config”.
- When running in parallel, the slowest bot sets the response time (vs. post commit where builds have more freedom).
- Should have the ability to request certain bots are run on precommit (to supplement the small set of defaults).
- This needs to share infrastructure between pre and post.
- Potential RFC to make a single bot the gating bot for merging.
- Who has fast builders?
- Linaro
- Sony
- For gating we want the fastest builder, it doesn’t matter really what it builds.
- Should the foundation fund the gating bot?
- Who has fast builders?
- Patch may exist for turning a bit silent if the blamelist is > a certain length.
- Long build times = giant blame list.
- Notify when short list.
- Otherwise send only to the maintainer.
- More subtle than just moving them to staging.
- Linaro has seen this issue with our armv7 bots.
- How about starting a bisection when a slow bot fails?
- Needs spare hardware for that build, and if you had that, you’d just use it to do more builds, right?
- Better option may be to run slower, niche builds after faster, common builds. So that the result is more interesting (even without a bisection).
- Release branches are using Github actions already.
- Cost of this overall is not known.
- Also using pull requests for release merges.
- Best practices for fast builds:
- Ccache
- Documentation has improved recently.
- Still some best practices that could be collected.
- A lot of builders are doing clean builds because of incremental build issues.
- Attendee said that they were doing incremental builds just fine, so there may be some paranoia to this.
- Current Buildbot will do a clean build if there are cmake changes.
- Use ninja, lld, …etc.
- Some buildbots are on stale configs, need to reach out to maintainers.
- Minimum tools version bots? (Getting Started with the LLVM System — LLVM 16.0.0git documentation)
- Can’t be the gating bots because by definition they won’t be the fastest.
- Requires that someone care about it, as someone needs machines to host them.
- Buildbot Web UI speed (or lack of)
- Could be per-commit builders taking resources
- I also heard from others running separate buildmasters that UI responsiveness is not great. Specific cause unknown.
- Incentives problem when moving bots around
- Attempts have been made to move some to staging, there were public objections.
- Do we need a minimum quality level for a buildbot, to reduce case by case debate?
- (Linaro is happy to help design and live by these)
- Attempts have been made to do that.
- Needs someone to push the effort overall.
- Galina shouldn’t have to be the enforcer. There should be a community set policy.