RFC: Future of Windows pre-commit CI

For the past few weeks, the Buildkite Windows CI jobs have been intermittently failing with the error:

So far there has been no progress made on getting this fixed. We’re not even sure if the buildkite maintainers are aware of this issue (@goncharov).

It’s hard to get exact numbers on how often this is happening, but I just counted 5 of the last 15 failures had this error. The current pass rate for the full pull request pipeline (which includes the Linux builds) is 40%, so a rough estimate would be that (6/10) * (1/3) = ~20% of the CI builds are failing with this error.

There have been recent disscussions about potentially migrating the Windows CI to GitHub Actions, and I would like to propose that we start that transition as soon as possible.

My proposal is to shut off the Buildkite Windows testing in a week if we haven’t resolved the failure mentioned above, and in the mean time immediately enable our existing GitHub Actions based CI for PRs targeting the main branch. The CI tests I’m proposing would:

  • Run on Windows only.
  • Test LLVM, clang, and lld sub-projects only
  • Run in 1-4 hours depending on caching.

There would be room for future improvements either by adopting this proposal or by adding some self-hosted runners to GitHub, but for now, this is probably the best we can do with GitHub Actions.

Since the GitHub Actions based tests are limited in what they can test and how fast they can run, the decision comes down to: Is having less thorough, but more accurate tests better than the current status quo with buildkite?

7 Likes

PRs I put up and updated for the past two weeks seem to have 100% hit rate by this issue. I don’t hold a strong opinion on shutting down BuildKite Windows pipeline, as we can’t repurpose its resources for anything else at the moment (it seems), but you say it still works sometimes. But I’m in full support of testing on Windows using GitHub Actions as soon as we can.

1 Like

I’m curious how did you come up with this list?
Do we have statistics about the cost of adding each of the subprojects on top of LLVM?
If the tests are slow (comparatively speaking to the build), have we considered a “build-only” CI? (the “build” is cacheable, the “test” aren’t with our system: making the tests a likely bottleneck).

These are the sub-projects that we are currently building and testing on the release branch, so I have confidence we can start testing them on the main branch immediately.

We chose these projects for testing on the release branch, because at the time, these were the most the we could build without running into the disk space limitations or the 6 hour build timeout.

The default github runners have more disk space and CPUs now, so it’s possible we could add more, but I have not tested this.

This is an option. It just depends on if we value the lit tests more or less than build testing other sub-projects. I, personally, don’t have a strong opinion on this.

Providing a build-only CI looks good to me if we cannot get a low-flakiness bot.

If the bot has more bandwidth, lit provides a feature to shard tests and run just one shard:
LIT_NUM_SHARDS=100 LIT_RUN_SHARD=1 ninja check-llvm

Ideally, the bot infers the to-be-tested directories from the modified files. We can even restrict the tests to a few directories that more likely exhibit Windows-specific failures, e.g.

// incomplete list
clang/test/Driver
compiler-rt/test
llvm/test/Support

Some directories, such as llvm/test/CodeGen, are slow to test but rarely exhibit Windows-specific failures. Note, we are only interesting in Windows-specific failures. If a change causing a failure Windows very likely causes a failure on Linux, testing it on Windows does not give much return-on-invest.

compiler-rt/test contains runtime tests, which frequently exhibit Windows-specific failures. If check-compiler-rt is too slow, run just check-asan for compiler-rt/lib/asan changes.

The goal of pre-merge testing IMO should be to avoid breaking buildbots, and limiting the risk of reverts. So I don’t quite understand the rational of excluding other subprojects on the account of the release branch, seems totally unrelated to me.

It doesn’t really have anything to do with the release branch specifically, it’s just that we have data from testing that branch which tells us what works within the constraints of the GitHub hosted runners.

If we want to enable something ASAP, then I think our best option is to start with those sub-projects.

I am concerned about creating a discrepancy of support in the project right now across subprojects, so this proposal to move forward under pressure is a bit concerning to me right now.
What about first trying to enable all projects in a build only fashion for example?

If people still feel they are getting value out of the existing Windows CI, then I don’t think there is much urgency here, but my impression is that the false positive rate has become too high recently for the CI to be useful. This is what I was alluding to at the end of my proposal when I asked:

If it’s not better than the status quo, then we can take our time and what to transition once we have something better.

If someone wants to send a patch in for that, I’ll take a look.

Any pointer about where is the setup?

Trying to understand the path filtering logic?
How do we avoid to run LLVM tests even when LLVM itself isn’t touched?

As a reminder the current pre-merge script implements cross-project dependencies (a PR touching clang won’t run LLVM tests, a PR touching MLIR won’t build or run LLVM or Clang tests, but will build Flang and run the MLIR+Flang tests, etc.).

There is a different file for each set of path filters, so if you look at the clang-tests.yml file, you’ll see how it only runs the clang tests when the clang paths are modified.

1 Like

Hi! I am aware of this issue but unfortunatelly I don’t develop the premerges anymore, please reach @lnihlen for that matters.

Is there a way that we could get community members access to the premerge machines? Especially as we work on transitioning to Github actions, it would be good to have access for those working on that transition. If not, that transition should hopefully reduce the maintenance that needs to be done on the machines (or at least that is one of the goals), but it would be quite convenient if the community could have some access to these machines.

Are we confident that switching to the GitHub actions version will actually fix the issue we’re seeing? It would be unfortunate to switch because of the problem only to find that the problem still exists… (to be clear, I’m not opposed to the switch in general).

I’m concerned like others that we’re at risk of losing a substantial amount of coverage by making this switch as things stand due to the limitations of the runners, so what I want to know is, what’s the plan to regain this coverage once we’ve switched? Windows shouldn’t be a second-class citizen.

1 Like

Using Tom’s suggestion above, yes. The hosted Github runners do not suffer from the same issue and the windows job running on the release branch is reasonably reliable from what I understand. The workflow definitions would also live completely within the monorepo, so they can be adjusted without depending on anyone external.

Moving to self-hosted Github runners might muddy the waters a little bit. I would think that running everything in a containerized environment (which is the proposal in [RFC] LLVM Precommit CI through Github Actions) would alleviate the problem, but I’m not completely certain.

The windows checks would run as a Github action. Instead of running through a Buildkite pipeline combined with Linux (with two seprate jobs on the Buildkite side), the windows checks would show up as a separate CI job on each PR.

This alone would be a major improvement. The frequently failing Windows builds causing the buildkite check to show a failure really decreases the value of pre-commit checks in my view (you become used to seeing the red cross all the time, and clicking through to verify it really is a false failure gets tedious).

3 Likes

If we can enable testing of subset of projects now, adding the rest of projects later, it is a net win over everyone waiting for all the projects, right? With GitHub Actions, it’s possible for anyone to step up and add other subprojects to an already working pipeline, if they don’t want to wait for our “usual” infrastructure people to do that.

I consider running Clang tests under Windows much more important than building for Windows itself. So while it’s helpful over status quo, it doesn’t give me confidence to merge.

If we can just enable the exact same testing, it is a net win for everyone isn’t it?

This was exactly the same Without GitHub Action. The script is in the repo…
Not sure what you’re talking about