Notes from the Presubmit Testing roundtable 11 October 2023

This is an attempt to summarize my digital scribbles from the Presubmit Testing roundtable at the dev meeting on 11 October 2023.

The participants were obviously a selection of people especially interested in presubmit testing. It would be interesting to hear thoughts from the rest of the community too!

Kick-off

  • In Chromium, about 50% of the issues we hit when trying to update the toolchain are basic issues which could have been caught by better presubmit testing
  • Hans’s dream: Whenever someone uploads a patch for code review, the system would automatically tell them: will this break the basic build or lit tests on Linux, Windows, or Mac

Should presubmit tests be blocking?

  • It’s generally easy to get people to agree to having presubmit testing that catches lit failures on the basic platforms
  • But it’s hard to get people to agree on blocking tests, as there are concerns about velocity etc.
  • There is a sense that blocking presubmit tests would not go over so well in LLVM
  • People at the table seemed to mostly favor a system where presubmit tests are mandatory (failures would block merging), but with a bypass mechanism
  • Step 1 might not be to make it mandatory, but that could be a goal
  • Not making it mandatory can give negative signal to the community: tests always being red due to failures on trunk, etc.
    • You have to stand down and say that some checks are requires: start with smallest possible, so people buy in, then add to it from there.
    • You can always start with “does it compile”. Even that catches lots of issues.
  • If tests are mandatory, or even strongly advised, what does it mean if I bypass and get it wrong? Social stigma?
  • We’ve heard many people say extensive mandatory testing is good. Two folks said efficiency is a concern.

Policy

  • Is there an official LLVM policy to rever to green? Yes.
  • Then a presubmit failure is strictly better than a post-submit failure and revert to green
  • But green is ambigous: what platforms, etc…

Scope

  • We have a lot of buildbots. What if all of those became the presubmit testing. What’s the concern? Cost?
  • It’s expensive and labor intensive to maintain
  • Also commonly agreed problems with the buildbots: they take a long time, and batch changes — you get an email hours or days later about a breakage and it’s not clear how it relates to your change.
  • Suppose the buildbots are divided into “official” and “non-official”, where the official ones are non-batching and made part of presubmit testing?
  • There’s a risk that everyone wants to add their tests. Where do we draw the line? Who owns that? Area team?

Phases of Development

  • Maybe testing should be different in different phases of development
  • GCC does phases where breaking stuff in the early ones is not so problematic, but then in later stages you’re not allowed to break anything
  • LLVM does stabilization on release branches
  • Maybe we should say that when we start looking at things going into the release, that’s when we tie things down?
  • We have the technology, this is not rocket science, but at what stage it’s done affects the velocity of development.

GCC

  • Supposedly GCC now has presubmit testing, implemented by Linaro (not sure where we heard this)
  • Release process notwithstanding, projects seems to identify a basic bar that they want to check before committing changes
  • It doesn’t have to be everything

Velocity

  • We have developers who want to be able to get patches in quickly. People may urgently be waiting for something to get merged, to unblock them. Having to wait 2-3 hours will set people on fire.

  • There’s always overrides

  • Letting breaking changes into trunk can also slow velocity: while I’m working on fixing something, someone breaks something else and it’s harder for everyone to make progress

  • For feature work, it takes weeks to get a patch reviewed, so another six hours to get it merged may not be a big issue, so long as there is an override for emergencies.

  • Getting an email two days later saying that you broke something is very annoying. Would love an option to say “please submit my patch once you’ve run all the tests”. Don’t care what everyone else does, but it would save me trouble.

  • Related to the keynote on compassion in open source: Presubmit testing may cause my velocity to go down, but everyone else’s may go up since they are not blocked on red bots — helps the community, perhaps for a little expense of the individual

  • “One or two hours is long enough that I’d say absolutely not and go around it”. 10 minutes absolute maximum.

    • LLVM has strongly pushed people towards incremental workflow
    • We often split out unrelated changes: NFCs, pre-submitting test baselines
    • This needs a tight iteration loop
  • At the same time, those NFC one-liners are exactly what one would want presubmit testing for

  • Having a minimum test bar with fast cycle, and on opt-in “test the heck out of this” would make total sense

Experience from libc++

  • Presubmit testing on every PR
  • Very extensive, almost 60 jobs
  • All supported configs — having presubmit tests is a requirement for being a supported platform
  • Very costly, but invaluable for quality, they love it
  • Almost never have to revert changes
  • Better experience for contributors, especially new ones: don’t get their patches reverted
  • The main problem is bandwidth: the testing is slow,
  • One solution is setting up a merge queue: only run a small set of tests on PR upload, then after review, press a button and let the merge queue run full tests before merging if they pass
  • There’s a github feature for this (it’s per-project, but maybe it could be set up to only run tests for libc++ changes for now)
  • Would greatly reduce strain on main CI, give faster turnaround, and maintain exhaustive testing on every PR
  • We like this so much, we think most libc++ devs would be happy to make the merge queue mandatory
  • Rolling this out saw little pushback from the libc++ community, but it was also very small at the time.
  • Some have heard that developers complain libc++ premerge testing is expensive and slow, though.
  • Merge queue should solve that.
  • One of the bit benefits is having this configured in the main repository
    • People are eager to add coverage for new platforms: can just send patch
    • Not dependent on buildbot master restarts, config changes are just a patch

Experience from Swift

  • We’ve had presubmit testing for 9+ years in public
  • Testing is mandatory before merging anything
  • Build LLVM, Swift, and run all the tests
  • Do people complain? The complain about the speed, but see the benefits. And it’s not optional.
  • It was slow originally, but using caching sped it up.
  • The infrastructure is costly, but it outweighs the engineering cost.
  • You have to add a comment to ask for testing
    • You can ask for smoke tests, that’s what’s required to merge
    • You can ask for full tests, that gives full coverage of each platform
    • You can ask for benchmarks, which prints out what you regressed on the PR itself
    • It’s all about the waterfall, giving people the ability to do the testing they think is right for their change
    • But there needs to be a minimum set of mandatory tests
  • Running on Jenkins backed by a cloud service provider

Make it easy to test

  • One benefit or presubmit testing is that it makes it easy for developers to test
  • Especially for platforms that they don’t have access to
  • Especially for newcomers

What we have today (outside of libc++)

  • If this is such an obviously good idea, why haven’t we done it yet? Some of us had this exact same conversation last year.
  • We do have it mostly: presubmit testing existed in Phabricator, and we moved it to GitHub PRs.
  • There some sense that it doesn’t work great though:
    • Often fails for unrelated reasons “red all the time”
    • There’s a huge backlog, takes like 24 hours to get a result (people had mixed experiences)
    • Can we get some data on how often it fails?
    • Testing is Linux only, or did we also have Windows? Definitely no Mac
  • Maybe there are two reasons:
    • Nobody has stepped up and said they’re going to make this great for their project (like libc++)
    • Machines. It’s hard to get enough machines.
  • We run self-hosted on a mix of machines, mostly very powerful.

Limited testing based on patch contents

  • Let’s say you touch the X86 backend, you’re probably not going to break ARM
  • Maybe then a smaller set of tests could be run?
  • Seems hard to implement with any accuracy in our current build
  • Also watch out for implicit interactions, e.g. a libc++ change can break lldb data formatters
  • The fact that LLVM is an inherent cross-compiler is in large part tied to our testing
    • For example, risc-v tests will mostly run on x86, because that’s where most of our tests run
    • If I could use presubmit checks running on risc-v, maybe i only care about my tests running on risc-v
    • If we move to a model where presubmit bots runs a bunch of things on different platforms, the developer loses the value of writing platform agnostic lit checks

Buildbot

  • Seem pretty unloved
  • libc++ thinks we should ditch buildbots and just go with what they have instead
  • Do we have statistics? It feels like finding a good LLVM revision was easier a few years ago. Teams could get weekly releases out, now they’re happy if they can do it monthly.
  • We don’t have organizational power to say no to a failing buildbot.

How to make progress

  • If we’re concerned about libc++ presubmit speed, maybe the first thing we should do is help make that better
  • If we get libc++ presubmit testing looking really good, we have a showcase, and then other projects can start adopting the same system
  • To start incrementally, maybe we should start with the runtimes?
  • Put this through the RFC process? Mandatory tests seems like a big change.
7 Likes

Oh I missed that there was a round-table on this topic, it’s one that it super important! Thanks for taking the time to write notes and organize it @hansw2000 : highly valuable.

This is what I call the “fire and forget” mode: I don’t mind that my commit does not land immediately and will take 2h. However I care about the overhead for myself:

  • “Fire and forget” means I don’t get more overhead landing the commit, assuming the test passes.
  • I only get to do some work if there is a failure: but if there was a post-merge failure I would also have to do some work, so this isn’t overhead! (on the contrary actually, it removes the urgency to fix).

Caveat: this requires a high-confidence that a CI failure means that “my patch is broken” (and not: flaky test, infra failure, trunk broken, etc.).
→ I need actual zero-overhead for my NFC changes (and wiling to wait 2h for these to land).

Other aspects:

  • that I haven’t seen addressed here is revert: a post-merge buildbot is broken and I need to revert, can I push directly? Are we delaying getting back to “green”? The revert could also fix the bot but break the premerge config for another reason
  • “mid-air collision”: unless you serialize everything (is it realistic?), trunk will break because of conflicting changes landing at the same time: do we block everyone’s contribution until trunk is fixed?

Testing time seems quite OK actually, it likely depends on what you’re working on, here is a run on a recent PR: Github pull requests #8565
Because it touches MLIR, it’ll also run the flang tests, but it won’t run the clang or LLVM tests (just build what’s needed as dependencies).

  • Linux build/test: “Ran in 7m 34s”, “Waited 8m 47s”
  • Windows build/test: “Ran in 24m 34s”, “Waited 45m 55s”

The main issue historically was keeping this config green: without a post-merge buildbot, there is no signal that “trunk is broken” and so “revert back to green” does not happen, or not quickly enough.
A bot was recently added: Add builders for premerge-like configurations by metaflow · Pull Request #44 · llvm/llvm-zorg · GitHub ; but the worker is dead for a while: Buildbot (@goncharov).
Without this we get in the mode where folks ignore the CI all the time because it’s always red anyway for various reasons.

I actually like the service buildbot provide, in particular I don’t quite see what the alternative should be?
But regardless, we should discuss requirements, that’s the best way to figure out what actually solves the problem!
In particular I believe that a post-merge system has different constraints than just “running the build and reporting a red/green signal” which can be enough for pre-merge testing.

For example post-merge systems have to be able to decide when to batch commits when a bot can’t keep up. They need to keep track of the history, if only to build a blame list. Then there is a need for notification of the blamelist (and build cop) when something fails. A post-merge bot can also do incremental build differently.
Investigating failures is also putting pressure on dashboarding: when did this bot started failing? Was this bot flaky? For a given change, where has it been tested? (this last point is not straightforward when you include commit batching! GitHub Build status don’t help AFAIK…).

Oh and for roll-out: my take is that if you have a “fire and forget” system, that is just providing enough value: people will naturally use it and the transition to “mandatory” becomes easier.
Over time, the few people who “bypass” and break the CI will get peer pressure (“please run the CI next time” comment on their reverted PR…).

Step 1 is “proving” that a robust system is possible (and not for a small isolated component, rather for at least LLVM+Clang+CompilerRT), and actually can be “zero overhead”.

Going from “opt-in” to “mandatory blocking” should be a non-controversial decision based on existing practices, we have yet to get there IMO.

I think the first thing we need is some community compute resources that anyone can easily use. GitHub Actions has this, and just in the last 2 months we already have a few pre-commit jobs that check code formatting and test build the documentation.

However, the free tier GitHub Actions machines are not really powerful enough to implement the kind of CI we are talking about here, so only people with access to large compute resources are able to experiment and develop the more resource intensive CI systems that we need. I really would like to change this so that anyone can contribute to a new CI system.

I have been looking recently at how to set up a cluster of self-hosted GitHub action machines. I found this blog post which describes a very efficient way to manage the self-hosted runners. If anyone has experience with kubernetes, terraform, or any of the cloud services and is interested in working on this, please let me know and we can try to get something up and running.

We talked about setting this sort of thing up a little bit at EuroLLVM last year (details start around RFC: Prototyping Pre-commit Testing Using Buildbot - #27 by kwk). One of the main things that was brought up was the cost of running pre-commit CI. I’m not sure who’s interested in footing the bill for this sort of thing.

Also, the pre-merge CI right now does somewhat work. I know there’s been some interest in moving the infrastructure over to Github actions, but not sure what the progress has been like there. I think that would be a great opportunity to consolidate everything (rather than having stuff in some google/ github repo) while also making it easier for third parties to hack on. Converting that over would also somewhat solve the problem of who’s paying for the compute and would prevent duplication. I’m not too familiar with the cloud setup that they use currently. I think it might even be Kubernetes on top of GCP, but not entirely sure.

I have some experience with self-hosted Kubernetes, but none with Terraform/OpenTofu on any of the major cloud providers. I’d be interested in contributing to these efforts once we have consensus on how we want to set everything up, but I probably won’t have sufficient amounts of time to really get things working until I’m out for winter break in December. Can probably contribute sporadically between now and then though.

Does anyone know about how many cores are needed to build llvm+clang+compiler-rt in 10 minutes?

Seems to me that while the question is interesting, the answer does not directly apply to a CI environment where it’s all about the caching infrastructure: when a large part of the build is cached, the actual time is dominated by running the tests.

Edit: for example one of the MLIR buildbot (GCP VM with 16 cores and 32GB) which does a clean build but using ccache:

  • 32s to run cmake
  • 37s for the build (thanks ccache)
  • 1min16 for the tests (and MLIR tests are much much smaller than LLVM).

Another example, a build from clang-x86_64-debian-fast(Buildbot) (96 CPUs) which also does a clean build with ccache but runs llvm/clang tests:

  • 27s cmake
  • 55s build
  • 3min56 for the tests

This is already done I believe.
(who pays and provide workers is slightly orthogonal)

There’s integration with Github actions (job statuses get reported in Github actions), but from what I can tell all the pipeline definitions and general infrastructure are managed through buildkite. https://github.com/llvm/llvm-premerge-checks is still actively maintained and the permerge checks link to buildkite from within Github. @goncharov can probably provide more info on that though.

1 Like

I quoted to much, I was referring to the fact that the scripts migrated from a “google” repo to LLVM (here I think?)

Does anyone know about how many cores are needed to build llvm+clang+compiler-rt in 10 minutes?

32 core virtual machine does this just fine as a daily driver.

From a large branch rebase (basically a clean build, check target):
real 14m36.972s
user 382m25.218s
sys 66m30.832s

From an incremental rebuild/test (check-llvm target) on the same commit:
|real|2m10.509s|
|user|38m50.537s|
|sys|22m13.222s|

The later may seem unfair as it’s incremental, but well, why is it unreasonable for CI to be incremental? And this case is the normal one for iterative development - add a new test, run check, submit, rebase, patch, run check, update review.

Hi Tom, as I replied somewhere already - I am down to do that. But probably after moving from Phabricator completely - then we will be able to move VMs from serving buildkite to self hosted GitHub runners.

https://github.com/llvm/llvm-premerge-checks is still here but it’s not orchestrating anything really. It only contains the code for Docker images and small server for Phabricator integration. All “interesting” parts are now in llmv-project/.ci folder that generates pipelines.

On this note: I don’t see much interest in contributing to CI even after it got moved to llvm-project sources. Probably it’s becase there is no good documentation on how it works and how to modify / test things. What will be a good place for it? Just a README.md in this folder? Maybe something in the official docs on web?

10 minutes for basic build is really hard to get, I think it’s only reasonable to expect this from linux builds. Windows might be able to complete the build but not tests for sure.
I wonder if there is a way to have a “optional commit queue” - so I can give away my CL for submission if I am not in a hurry but otherwise break glass and submit as today. I think that introducing even small bumps for the blind NFC commits to main will have a positive impact on how people land changes - e.g. I submit straight to main as it’s just so easy. One step forward here would be to only allow commits that have a PR (save bots of course).

One thing I learned about roundtables is that it’s impossible to take extensive notes and participate in the discussion at the same time :slight_smile:

So here are some more details of where I’m coming from. These are a sample of issues I wish had been caught before reaching main:

The failures may seem small, but tracking them down can take a lot of time, which we’d rather spend on other things.

LLVM also gets lots of commits like “try to fix X on platform Y”. For common platforms it would be much nicer if the “try” could happen before merging rather than after.

This example is especially interesting:

Again, seems like a great use case for presubmit testing.

So what’s missing?

Thanks to Mikhail, Christian and others we already have some presubmit testing.
But that’s been focused on Linux, which seems like the least important platform since most developers seem to test there locally anyway.
Windows gets tested sometimes, but not always? E.g. on my latest PR ([Driver] Ignore non-clang pch files when -include a.h probes a.h.gch by zmodem · Pull Request #69204 · llvm/llvm-project · GitHub) I only see Linux testing.

And we’ve never had Mac presubmit testing, which seems almost ironic given the project’s history.

So rather than big questions about how to revamp our workflow, I’m mostly interested in two things:

  1. How can we get Windows presubmit testing to run reliably
  2. How do we get testing on Mac? Would Apple be interested in providing machines? Could we rent VMs somewhere (I think AWS has Macs)?
1 Like

@goncharov Do you have one spare instance that we could use to setup a prototype self-hosted runner?

That’s a great idea, I don’t recall discussing this before, and I feel like it could be a clear path forward to get where we want to go that creates value for the community along the way.

I think empowering the community is broadly good, but I want to push back a little on the idea that we as a project are going to liberally allow folks to add new resource intensive premerge checks. We are currently undergoing a period of rapid development with the new PR system and I don’t want to slow that down by imposing new process, but in the long run, as a community, we need to maintain a high quality bar for new premerge checks.


One of the questions I tried to ask at the roundtable and would like to get a clear answer for from the community is, what is our latency tolerance for premerge testing in a fire & forget system? Suppose we have a command, like “@llvmbot /testandmerge”, what should our latency service level objective (SLO) be, in minutes? That budget should guide our curation of tests, build configurations, compute resource allocation, etc.

I want to emphasize that there is no right answer to this question, everyone I ask gives me a different answer. Some folks want rapid iteration, and some folks want extremely reliable landing, with little to no risk of reverts and follow-on communication of issues after landing. I think the best way to get a clear answer is probably to start a new thread with a survey using buckets for time. So, let’s table this question for now, and look forward to a new thread in the future.

I don’t think we should let anyone add any premerge checks that they want either. I just want to make sure there are resources available so anyone can design and prototype a CI system. Without community resources, the number of people who can contribute to this is very limited.

I’ve seen studios take exactly this approach. There’s an optional build queue that will run build and CI before merging. If it fails, it gets bounced back to the submitter to try again. People who go direct and break the build will take heat for it. Adoption tends to go up because it helps keep things (at least the main things) green.

Independently (IMO) would be nice to have a way to request a specific bot to build/test my fork, so I can diagnose/repair the failure after someone reverts my (non-CI-breaking but other-config-breaking) commit. But that’s separate from Presubmit Testing.

The build queueing system I’ve seen yes serializes everything. The build might be parallelized but the queue is not (this is inherently how git works). If two people queue conflicting changes, the first one wins and the loser needs to do the accommodation.

1 Like

sure, I can create one