Asking for help with Windows CI resources

Hello LLVM community,

I’m posting this to ask for help with a pressing issue we’re currently facing with our LLVM pre-commit CI, specifically concerning the Windows runners.

Over the past few weeks, we’ve observed an increasing delay in job execution times within our Windows CI pipeline. This delay has directly impacted the overall development process, causing jobs to take significantly longer than usual. As a result, pull requests are waiting for extended periods to be processed by the Windows runners. Currently, this is hosted on buildkite, but it’s been proven to be pretty cumbersome to maintain since there are problems for the community to access this configuration. We want to move this over to GitHub but we need resources.

To address this challenge, we are actively seeking solutions to optimize and expedite our Windows CI pipeline. One promising avenue is the acquisition of additional GitHub credits to allocate toward faster Windows runners. These credits would help alleviate the current delays, enabling us to provide timely feedback on PRs and ensuring a more responsive development environment for all contributors.

There are other ways to help including adding custom runners to the LLVM organization, but it requires a bit more investment.

We understand the significance of Windows compatibility within LLVM, and we believe that by addressing these CI challenges, we can collectively enhance the stability and performance of LLVM on the Windows platform.

If your company is invested in LLVM development and wishes to contribute to resolving this issue, we seek your support through the donation of GitHub credits or any other resources that can help expedite our Windows runners.

Thanks.

13 Likes

@goncharov

Hi @tobiashieta ! We have a few machines running in GCP for buildkite builds. Do you want to set up a github runners? I guess we can look into donating some of the machines to github actions (as we should move to them anyway). That will be custom runners, not something that is provided by GitHub.

Or current problem is exactly with those buildkite windows runners that don’t have enough capacity?

As me and others have observed this weekend, our existing runners had hard time catching up with the queue. It appears that they started idling somewhere in the second half of Sunday. This was achieved with a help of a machine @philnik added to the pool.

We have a related thread from December:

I think I agree with Anton that the queue times are inactionably long, and this goes back to the comments at the round table about establishing a premerge testing time budget. Clearly, 8 hours exceeds any reasonable time budget.

If anyone else can add runners to the pool, that would help.

I also think we need to explore ways to run the premerge checks less often. Currently, I believe we build and test on every push, which wastes scarce cloud compute resources. We could trigger the Windows tests manually, or only on initial upload, or only after approval. This would reduce the number of jobs clogging the queue.

2 Likes

Manually means people remembering to ask for it; I wouldn’t count on that happening.
After approval seems kind of late in the process, this is supposed to be pre merge testing after all.
Automatic when the PR is first created, with a way to rerun, seems like the best option.

Given that Sony does care about the toolchain running on Windows (although not targeting Windows), I had relayed this request to an internal channel. I’ll poke again, but our decision-making process is not the fastest I’ve ever seen.

This is how many “merge queue” systems work actually I believe: after approval but right before merging the CI “merges” with the current main branch and test this candidate merge, before pushing it to main.

Speaking of, would merge queues be a solution for windows builds?
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/incorporating-changes-from-a-pull-request/merging-a-pull-request-with-a-merge-queue

1 Like

I understand that, but I’m not aware of any active proposals to move the project to merge queues.

When this was discussed on Discord we saw two problems with the current pre-commit CI configuration right now:

  1. We don’t have enough runners, leading to very long wait times.
  2. The community doesn’t have full access to the buildkite configuration and it’s not clear how we can add more resources even if we had them.

Everyone I have talked to is of course very grateful that you managed to get this configuration working before and have maintained it. But our idea to migrate to GitHub was just to make sure that we don’t have a buss-factor of 1 and to increase the ability for us to help out.

Maybe this can be solved with Buildkite, more resources added to the pool, and access shared (and explained, I haven’t worked with BK before).

I think some of this has been raised in the other thread here [RFC] LLVM Precommit CI through Github Actions.

I am open to here suggestions on how we can reduce the wait for windows runners and how to improve access to the pre-commit ci. I think this can be handled differently from the discussion about moving to GitHub for now.

3 Likes

Agree! Moving from buildkite to github actions should be a nice step. I have experimented already with using custom runners and we set up one machine with @tstellar for linux, not sure what is the status of it now.

Moving forward: the next step should be to convert windows build to be based on github actions. I can definitely help with moving windows machines and converting them to “cusom runner” in terms of github - from that point they should be ~black box that anyone can use. It would be great if someone is ready to step in to create and configure github actions for that. @boomanaiden154-1 do you want to participate?

1 Like

Yes. I would be interested in helping to setup configurations and any other tasks related to moving the existing infrastructure over to Github Actions.

I’m probably not that qualified to help with the Windows CI specifically as I don’t do any development on that platform, but I’m willing to help figure out how to set it up if there isn’t anyone else interested in doing that part of the pipeline.

I can help with Windows to some extent, as I have experience working with it on Azure Pipelines. Probably not writing from scratch, but reviewing.

2 Likes

Great! Thank you @boomanaiden154-1 and @Endill ! I will try to setup one windows worker this week so we can start running some tests on win

Could we also start looking into moving out of a Google-owned GCP project so that community member can contribute to the VMs maintenance? We should also reduce the “bus factor” or the dependency on the “goodwill of Google employees” for unblocking ourselves when a machine goes out of disk space or similar problems.

Ideally we’d have foundation-owned projects on GCP/AWS/Azure and sponsor could donate credits (or funding) for these.

I don’t mean to derail setting-up self-hosted Github runner, please proceed with this of course!

1 Like

So far there are no such credits available. And no sponsors as far as I know…

1 Like

Yes that is something I am trying to arrange. Does your company has any interest in supporting premerge? mac machines seems to be in great demand…

1 Like

I have added one more machine to windows queue, making it 7 in total.

3 Likes

So of the problem we’ve seen recently with Windows CI go beyond the availability of the machines, there are “errno 22” errors on some source file opening (in general when running lit): Windows `github-pull-request` pre-merge CI fails with "OSError: [Errno 22] Invalid argument" on various tests · Issue #77086 · llvm/llvm-project · GitHub
(I saw more instances of this just yesterday again)