I’m posting this to ask for help with a pressing issue we’re currently facing with our LLVM pre-commit CI, specifically concerning the Windows runners.
Over the past few weeks, we’ve observed an increasing delay in job execution times within our Windows CI pipeline. This delay has directly impacted the overall development process, causing jobs to take significantly longer than usual. As a result, pull requests are waiting for extended periods to be processed by the Windows runners. Currently, this is hosted on buildkite, but it’s been proven to be pretty cumbersome to maintain since there are problems for the community to access this configuration. We want to move this over to GitHub but we need resources.
To address this challenge, we are actively seeking solutions to optimize and expedite our Windows CI pipeline. One promising avenue is the acquisition of additional GitHub credits to allocate toward faster Windows runners. These credits would help alleviate the current delays, enabling us to provide timely feedback on PRs and ensuring a more responsive development environment for all contributors.
There are other ways to help including adding custom runners to the LLVM organization, but it requires a bit more investment.
We understand the significance of Windows compatibility within LLVM, and we believe that by addressing these CI challenges, we can collectively enhance the stability and performance of LLVM on the Windows platform.
If your company is invested in LLVM development and wishes to contribute to resolving this issue, we seek your support through the donation of GitHub credits or any other resources that can help expedite our Windows runners.
Hi @tobiashieta ! We have a few machines running in GCP for buildkite builds. Do you want to set up a github runners? I guess we can look into donating some of the machines to github actions (as we should move to them anyway). That will be custom runners, not something that is provided by GitHub.
Or current problem is exactly with those buildkite windows runners that don’t have enough capacity?
As me and others have observed this weekend, our existing runners had hard time catching up with the queue. It appears that they started idling somewhere in the second half of Sunday. This was achieved with a help of a machine @philnik added to the pool.
I think I agree with Anton that the queue times are inactionably long, and this goes back to the comments at the round table about establishing a premerge testing time budget. Clearly, 8 hours exceeds any reasonable time budget.
If anyone else can add runners to the pool, that would help.
I also think we need to explore ways to run the premerge checks less often. Currently, I believe we build and test on every push, which wastes scarce cloud compute resources. We could trigger the Windows tests manually, or only on initial upload, or only after approval. This would reduce the number of jobs clogging the queue.
Manually means people remembering to ask for it; I wouldn’t count on that happening.
After approval seems kind of late in the process, this is supposed to be pre merge testing after all.
Automatic when the PR is first created, with a way to rerun, seems like the best option.
Given that Sony does care about the toolchain running on Windows (although not targeting Windows), I had relayed this request to an internal channel. I’ll poke again, but our decision-making process is not the fastest I’ve ever seen.
This is how many “merge queue” systems work actually I believe: after approval but right before merging the CI “merges” with the current main branch and test this candidate merge, before pushing it to main.
When this was discussed on Discord we saw two problems with the current pre-commit CI configuration right now:
We don’t have enough runners, leading to very long wait times.
The community doesn’t have full access to the buildkite configuration and it’s not clear how we can add more resources even if we had them.
Everyone I have talked to is of course very grateful that you managed to get this configuration working before and have maintained it. But our idea to migrate to GitHub was just to make sure that we don’t have a buss-factor of 1 and to increase the ability for us to help out.
Maybe this can be solved with Buildkite, more resources added to the pool, and access shared (and explained, I haven’t worked with BK before).
I am open to here suggestions on how we can reduce the wait for windows runners and how to improve access to the pre-commit ci. I think this can be handled differently from the discussion about moving to GitHub for now.
Agree! Moving from buildkite to github actions should be a nice step. I have experimented already with using custom runners and we set up one machine with @tstellar for linux, not sure what is the status of it now.
Moving forward: the next step should be to convert windows build to be based on github actions. I can definitely help with moving windows machines and converting them to “cusom runner” in terms of github - from that point they should be ~black box that anyone can use. It would be great if someone is ready to step in to create and configure github actions for that. @boomanaiden154-1 do you want to participate?
Yes. I would be interested in helping to setup configurations and any other tasks related to moving the existing infrastructure over to Github Actions.
I’m probably not that qualified to help with the Windows CI specifically as I don’t do any development on that platform, but I’m willing to help figure out how to set it up if there isn’t anyone else interested in doing that part of the pipeline.
Could we also start looking into moving out of a Google-owned GCP project so that community member can contribute to the VMs maintenance? We should also reduce the “bus factor” or the dependency on the “goodwill of Google employees” for unblocking ourselves when a machine goes out of disk space or similar problems.
Ideally we’d have foundation-owned projects on GCP/AWS/Azure and sponsor could donate credits (or funding) for these.
I don’t mean to derail setting-up self-hosted Github runner, please proceed with this of course!