Buildbots roundtable notes US LLVM Dev 2022

What follows are my notes from the Buildbots roundtable at LLVM Dev US 2022.

I managed to be late for my own roundtable, so these notes pick up about 15 minutes in. Lucky for me, there were many in attendance and the conversation was already going by the time I arrived.

Some points have been shuffled around to make thematic sense. If you have more notes or want to correct me feel free to reply here.

  • Buildkite being used for libcxx precommit.
    • Google donated time on Google Cloud for this.
  • Fragmentation of CI.
    • Buildbot
    • Buildkite
    • Downstream bots
  • Should we move away from Buildbot?
    • Is it appropriate for precommit?
    • Or is it lack of policies that is the real issue.
  • The role of CI:
    • Catching simple broad errors.
    • Unusual configurations, custom configurations.
  • Not all bots will fit in precommit but most could.
  • Without cultural change, a lot of commits will still skip review and CI.
    • Process change to gate merge on precommit testing?
  • Current Phabricator precommit:
    • Failures for no discernible reason.
    • People are ignoring it, and being advised to ignore it.
    • Too flaky to rely on.
  • Base revision not being set is a big problem here.
  • “Failed or ongoing build” keeps cropping up when landing a change.
  • Patches rebased onto top of main by Buildkite, at least it tries to.
  • Is there a common config we can agree on as the starting point for precommit?
    • Later this came down to “the fastest config”.
  • When running in parallel, the slowest bot sets the response time (vs. post commit where builds have more freedom).
  • Should have the ability to request certain bots are run on precommit (to supplement the small set of defaults).
    • This needs to share infrastructure between pre and post.
  • Potential RFC to make a single bot the gating bot for merging.
    • Who has fast builders?
      • Linaro
      • Sony
    • For gating we want the fastest builder, it doesn’t matter really what it builds.
    • Should the foundation fund the gating bot?
  • Patch may exist for turning a bit silent if the blamelist is > a certain length.
    • Long build times = giant blame list.
    • Notify when short list.
    • Otherwise send only to the maintainer.
    • More subtle than just moving them to staging.
    • Linaro has seen this issue with our armv7 bots.
  • How about starting a bisection when a slow bot fails?
    • Needs spare hardware for that build, and if you had that, you’d just use it to do more builds, right?
    • Better option may be to run slower, niche builds after faster, common builds. So that the result is more interesting (even without a bisection).
  • Release branches are using Github actions already.
    • Cost of this overall is not known.
    • Also using pull requests for release merges.
  • Best practices for fast builds:
    • Ccache
    • Documentation has improved recently.
    • Still some best practices that could be collected.
    • A lot of builders are doing clean builds because of incremental build issues.
    • Attendee said that they were doing incremental builds just fine, so there may be some paranoia to this.
    • Current Buildbot will do a clean build if there are cmake changes.
    • Use ninja, lld, …etc.
    • Some buildbots are on stale configs, need to reach out to maintainers.
  • Minimum tools version bots? (Getting Started with the LLVM System — LLVM 16.0.0git documentation)
    • Can’t be the gating bots because by definition they won’t be the fastest.
    • Requires that someone care about it, as someone needs machines to host them.
  • Buildbot Web UI speed (or lack of)
    • Could be per-commit builders taking resources
    • I also heard from others running separate buildmasters that UI responsiveness is not great. Specific cause unknown.
  • Incentives problem when moving bots around
    • Attempts have been made to move some to staging, there were public objections.
    • Do we need a minimum quality level for a buildbot, to reduce case by case debate?
    • (Linaro is happy to help design and live by these)
    • Attempts have been made to do that.
    • Needs someone to push the effort overall.
    • Galina shouldn’t have to be the enforcer. There should be a community set policy.
3 Likes

thank you for the write up, David! I was not at the table, much appreciated.

I believe someone mentioned that after the (expected but not officially proposed) move from Phab to GitHub PRs, this could be automated. I don’t know details of how that automation would work, presumably using actions or hooks or something.

1 Like

Indeed, this is a GitHub native feature: protected branches:
https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/defining-the-mergeability-of-pull-requests/about-protected-branches#require-status-checks-before-merging

1 Like

Our buildbot config has a lot of history, our infra is sketchy and our integration with precommit is zilch. Jenkins, Buildkite, etc won’t change some of those things, but it may renovate our approach to CI.

Since we’ll probably move to Github sometime in the future, we could move our main driver to Github Actions, and then pre-commit tests can be chosen from a list of fast bots easily. The bots can still be in whatever they are now, just as long as we trigger them all from the same place.

I don’t think we should be spending the little time we have on infrastructure to move all bots (up/downstream) to the technology-of-the-day. We had enough of those backfiring already.

Use Github PR.

Not quite. “The fastest config” that adds coverage over other fast configs.

Having 10 x86 fast builders will add nothing. We need at least one builder per major arch with the core projects being built and tested. Projects can have different builders if that makes more sense, for example clang+llvm, mlir+llvm, lld+llvm, or all-in-one if the builder is fast enough. We also need runtime libraries (RT, libc++, etc).

Splitting the builders actually makes a lot of sense, since we can only trigger clang builds if the patch touches clang, for example.

This is a losing strategy. If the builders are independent, and stable, we can have as many as we want. All is needed is to have a single aggregating place that can do that, and Github Actions can.

This is dangerous if the maintainer is on holidays / unavailable. Things can escalate.

If those are still the TK1 builders, do we really need them still?

This would be awesome, but it’s really hard to do automatically, as you need to capture the error at the right level and reproducing it may be tricky for a shell script.

An option I’ve seen done is to only rebuild clean when a dirty build goes red, like an automatic retrial. This keeps builds slow only when they fail. Another thing is to automatically detect infrastructure errors (most are easy to spot with grep) and re-trigger after a delay.

Buildbot is perhaps the worst UI I’ve seen do far of all builders.

Not all hardware enjoys the speed that the major vendors have. I think it’s easier to think about core targets versus side targets, like the support policy we have. If you can provide a fast builder, you can cope with the volume of changes, support for your target will be greater.

I agree it would be much easier to set constraints and force hardware vendors to adhere to those to be in the noisy bots list. Same for pre-commit (but much more strict here).

Absolutely! No one should be in that position alone. Even if we create a policy, anyone should be able to call it out and push for changes, like we do for all our other policies.

I think the point here was to avoid 24 hours of build-fail notices from slow bots, after a fast bot has alerted the committer and the problem is already fixed. If the slower bots build only revisions that are known to pass on a fast bot, we avoid a lot of irrelevant noise.

Agreed, I should clarify that the context was more what would the very first builder(s) be.

And the first instinct was oh what’s the most applicable, the most common. But for the very first, speed is going to trump most of that.

Overall you are right though. We’d be aiming to keep the total time to some bar, unless it added a worthwhile amount of coverage.

Again, long term you’re right but the context was in the short term. More like can we get people to accept even a single gating bot.

They are and we’re not quite ready to let go of armv7 yet. We have been meaning to try out qemu-system as a replacement and I spoke to a few other folks at llvm dev who had the same issues.

We’re going to try qemu-system instead. Which I’ve no doubt will beat the decade or more old “real” v7.

I find it pretty easy to work with, bar the responsiveness. Maybe I’ve been burned by too many systems trying to show me all the information all at once.

Sure, this discussion was more about the basics though of this bot has been red forever, is holding up changes, is spamming, etc. The real basics.

I agree that some sort of tiers makes sense overall as well. Not all bots will fit into the pre-commit window, so that’s one obvious difference between potential support levels.

The problem I called out in my talk was “All builders build every commit - even if it’s known to be bad.”.

I wasn’t thinking of this idea when I wrote that, but similar sentiment. Maintainer might have a fast bot as well that they can look at and say “oh, this will be fixed I just have to wait another ”.

(then the follow up of should it even build it at all, but that’s what pre-commit would filter out)

That’s what I thought the policy was meant to fix. My guess is somethinglike:

  • Pre-merge builds can’t be slower than X minutes
  • Pre-merge builds that fail for spurious reasons more than T times per week are removed
  • Core target builds can’t be slower than Y minutes (Y >= X)
  • Core builds that fail for spurious reasons more than S (S >= T) times per week are removed
  • Stable target (noisy) builds can’t accumulate (on avg) more than C commits
  • All other builders are silent

If the bots follow that policy, it doesn’t matter how many of them we have in each category.

1 Like

At the very least, something that builds all projects in a release state would catch build errors, which are easy to fix and can catch stuff that developers may not (for instance, if they don’t build those projects).

So, an LLVM ADT change that was tested on Clang but not MLIR will be caught, if it changes the API.

I’m guessing QEMU on AArch64, so faster emulation than on an x86. Honestly, creating and maintaining those bots was a major pain, whatever you can do to deprecate them will make the current maintainers (you?) very happy.

I’d say three bot tiers:

  • Pre-commit: really fast, really stable, reasonable coverage
  • Post-commit core: Core targets/projects, check-all, test-suite, fast, stable, wider coverage
  • Everything else: Silent bots
2 Likes

Thanks for the notes! I wanted to give some brief high level input:

  • GitHub Pull Requests: We should prioritize this migration first to avoid wasted work integrating with Phabricator.
  • Pre-commit testing: It is time to make the cultural change to move towards mandatory testing in the cloud.
  • Test reliability, speed, and flakiness policies: We should absolutely have and enforce a policy of reliability for the tests and bots we run in this configuration. Renato’s suggested tiers make sense, and we have some existing docs about this.

The rest of the issues are important and can’t be ignored, but they are details, and I would encourage anyone working here to focus on the high level vision of reliable test infrastructure that is accessible to new community members so we can sustain the long-term viability of the developer community.

I repeatedly hear from newcomers to the project that our infrastructure is a major barrier to onboarding. Buildbot fail mail is scary. The Phabricator Harbormaster UI is hard to discover and understand. The LLVM project will not continue to thrive if we don’t invest in basics like infrastructure. As a foundation board member and a manager, I don’t have time to do this work myself, but this is the message that I try to send as often as I can to underscore the importance of this work.

1 Like

I believe this can be achieved quite naturally by making the system appealing to users instead of mandating anything. That is the mandate should come after we reach the point where people are already using it “almost always” because it provides them value and does not add too much friction.
Now it requires first to build the infrastructure, tooling, and integration to actually make it a low-friction / high-value feature (GitHub pull-request will help with this drastically I believe). I’d be concerned about trying to force anything that would regress the development fluidity (having to fight flakiness, baby-sit a pull-request for days before being able to merge it, …).

On another topic: while buildkite and GitHub action seems very nice for pre-merge testing, I haven’t seen anything like buildbot (other than Jenkins and some proprietary solutions) to handle post-merge CI: the most basic feature being blame-list for example, but also some sort dashboard that allows to track the state of the main branch across build configuration as well as tracking the state of a single configuration (these features are present in most solution, but not all at the same time, for example buildkite didn’t have blamelist and git commit author notification last I checked).

I think I agree on all points, but this sounds like a matter of “how to get there” not “where we are going”. Any precommit CI system will have to start off as advisory before it can graduate to mandatory after we gain experience and improve it.

1 Like

We are using the free tier runners, so there is no cost. There are more powerful runners available now (up to 64-core) and we are discussing with GitHub about pricing.

+1

If, say, a GitHub PR-based process actually works well enough that folks feel comfortable using it for essentially all their changes, then requiring a mandatory pre-commit check that adds very little friction is not an issue.

In fact, I think (lightweight) mandatory pre-commit checking is a much easier pill to swallow than getting mandatory pre-commit (human) reviews for all changes.

1 Like

Hi, I would like to chime in on this discussion and add my thoughts that I’ve already shared with my team (@nikic , @tstellar , @serge-sans-paille , @tbaeder , @cuviper ) and with the LLVM IWG (@mehdi_amini ) and @gkistanova .

First I think we must salute to the buildbot fleet maintainers and @gkistanova for providing such a nice variety of build and test flavours. This is great and I think we have a pretty decent post-merge setup right now and lots of issues can be found with it!

I heard from my colleagues that sometimes builders fail that you don’t know about or expected them to fail. They mentioned that it would be nice if you could re-run tests on those builder on demand.

I don’t want to worry so much about the time a pre-merge test takes to complete as long as it is not enforced.

People before me have mentioned that the current pre-commit tests in Phabricator are too flaky. They can find good errors but have their problems and they are completely distinct to the vast fleet of buildbot configurations out there. They represent a distilled set of test configurations, sort of like a lowest common denominator. There’s nothing wrong with it per se.

Nomatter what technology (Phabricator or Github PRs) we use, I think we can borrow from the post-merge buildbot fleet that we already have.

There’s a concept in builtbot called trybot. Given that you have the username and password this concept essentially lets you run a build on a builder of your choice. This sounds scary at first, not very secure and vulnerable to DoS attacks.

But what if you added a layer in between the developer and the trybot feature?

  • This would not only make it more secure and protect our buildbot infra for the post-merge tests.
  • It would also allow us to federate access to buildbots on a whole different level.
    • Not only could developers have a say “where” to test their patch.
    • We could also decide which buildbots want to participate in the trybot scenario at all (on an opt-in basis for the buildbot maintainers).
    • We could have buildbot owners setup a list of paths that, if touched in a patch, their buildbot is triggered.
  • We could collect statistics about which buildbots are often requested or fail the most.
    • Based on these stats we could invest in specific upgrades rather than coming up with a test config that doesn’t meet the real needs and is only worth the paper is is written on.
  • The testing infra gets more transparent to developers.
    • The buildbots not only bug you after you’ve merged your patch but they’ll become a team-player to help you get the best.
  • If you decide to run tests for your patch on a slow but important buildbot from your perspective, you can. Nobody would prohibit this. Afterall it is your choice.
  • Running tests with the same buildbots that we have for post-merge, does make the testing infra less hard to maintain (no buildkite and alike is needed, but it is not excluded).
    • No contractor needs to maintain it but sure can, while the bulk of work is done by buildbots.
  • We maintain the independence of companies/communities to provide buildbots that are important to them and even if they are slow they can still use them in pre-merge tests.
  • We could move to this approach on a very granular level and improve it over time.
    • In gitlab you’ve so called “quick actions” that are text-based controls for issues and merge requests. I’ve a something similar in mind (and in my drawer) that works with Github PRs.
      • You could start with commands like /test-on-builder XY, /re-run-failed-builder, /test-on-recommended-builder.
      • Depending on what you need this can be extended over time incrementally.
      • You can limit whose allowed to run those commands and for whom they have no effect.
      • Github play the role of federating the access to the trybot feature.
      • FYI, the PoC in my drawer even allows you develop the /commands using a locally isolated buildbot setup that somewhat mimics what we have in LLVM (a buildbot master and several builders). It works by running several containers one of which is a github runner that ties up any upstream github repo with your locally running buildbot setup. I found this was needed because we certainly need to tweak the buildbot setup in some ways.
  • The post-merge builtbot fleet can stay intact
    • Buildbot maintainers can toggle a switch to open their buildbot bot a selected list of users/groups for pre-commit testing. No additional effort is need from their end.
  • I think there’s no debate that cleaner patches to the main branch help to improve the quality. Once we start to see people begin to value this, the cultural change comes naturally but it is still not enforced.

Given that all of this is opt-in, from a developer’s and buildbot owner’s perspective, IMHO this setup provides what @mehdi_amini called “appealing to users”. I find it non-invasive compared to having a non-debatable pre-merge testing setup, that only runs the lowest common denominator.

I’d love to hear how this sounds to all the skeptics of pre-merge testing and to other critiques.

1 Like

Yes! An important detail easy to overlook. We don’t (and won’t) have a simple multiplier for CPU time vs. post commit only, and that makes budgeting difficult. Better it’s a gradual opt in.

If we assume that in the near term, some configurations will not fit into precommit and long term, may never, being able to choose to run a specific bot in precommit is desirable.

Regardless of whether there is also a set of non-debatable pre-commit bots.

I have seen this over in compiler-explorer, where new contributors don’t get automatically get checks run until a maintainer allows it. Always a good option to have.

I am very interested to see what that looks like. One issue with some of our current setup is hacking on it takes an age to get set up, this sounds great.

Definitely. I think there are issues that it doesn’t address, such as what to run and how to interpret the results, but it doesn’t have to. It would be a good first step, a proving ground for future directions, and has value in isolation.

This also fits well with a concern which is that to get value out of mandatory pre-commit, it actually has to be mandatory. We can’t have changes bypassing review.

If we do the pre-commit bots will either be in advisory mode (confusing people as to whether they should care or not) or constantly be being maintained (and blocking people’s work when they are broken).

What you’ve set out, being opt-in, bypasses that concern and get us pre-commit experience without making the normal developer experience worse.

Thanks for setting this out so clearly!

1 Like

I think we are trading off a few goals right now:

  1. Developer experience: Tests should be fast, and failures should be actionable
  2. Stability: Detect bugs earlier and prevent them from landing, reduce the time spent on them downstream
  3. Portability: Ensure that LLVM tested and correct on a diverse set of platforms

I think we need more emphasis on goals 1 and 2, without completely losing goal 3 in the process. I think buildbot is designed to serve goal 3 well, but it compromises goals 1 and 2. It allows us to federate testing resources and cover diverse configurations.

However, nobody really knows what to do when tests start failing on non-standard buildbot configurations, and it’s not clear who to escalate the issue to, other than the buildbot admin, who often redirects back to the test suite owners, leaving final responsibility for flakiness or test quality unclear. We also suffer from notification fatigue, and new developers find it particularly discouraging. @dblaikie and @preames both mentioned this to me.

I think, in the long term interest of the project, we should prioritize the goals above in the order that they appear, and I think that means moving more in the direction of distilling, curating, and blessing lowest-common denominator configurations.

What I have in mind is inspired by the Rust platform support tiers. We should write down the set of platforms for which LLVM is guaranteed to build, and have a set of requirements for adding to that list. Compilation should not be flaky. We should be able to guarantee that those failures are actionable. Once we have that, that is what I would like to see become mandatory, not advisory. Once that works, we can start expanding validation to include the test suites which are known to be reliable (lit, FileCheck, minimal execution testing) on platforms with resources to run them.

I think perhaps the most important part of the next steps here isn’t so much which approach or technology we choose, but that we agree to standards of test actionability and actually enforce them. I think if I had confidence that folks would follow up on your approach to gather stats on bot failures and address them, I’d be OK with that direction, but I personally think a better way forward is to drastically reduce testing scope, commit to a flakiness error budget, and add things to that set over time. Either way requires work, and if you are ready to put resources into the approach you outlined, I could be on board, as long as we are really committed to raising the bar for buildbot reliability.

1 Like