[RFC] LLVM Precommit CI through Github Actions

Thank you everyone for your feedback so far. A couple adjustments to the original proposal based on feedback:

Motivation:

Moving to Github actions will allow more people to hack on the precommit CI (including bug fixes and extending it to cover new cases) on top of being much better integrated with our current tooling for precommit code review (Github PRs). In addition, the general consensus within the community (from what I can gather) is that we should go in this direction and there is a lot of talk about moving the current infrastructure over to Github actions. Recently, the windows CI also started throwing a variety of errors, some of which are seemingly related to the OS configuration (my best guess for Windows Defender warnings), and this gap in coverage has been detrimental to some subprojects like clang.

Design:

Scaling back the original proposal, this revision proposes the following:

  1. A single job per operating system.
  2. Initially testing a limited number of subprojects. Starting off with a set like LLVM/Clang and then iterating from there as we fix reliability issues/better evaluate the throughput of the system.
  3. The addition of Windows to the original proposal to help remediate the gap in the current CI.
  4. Build things in such a way that it will be easy to move things over to the self-hosted runners from the Buildkite infrastructure once they are in a state to migrate over. Particularly, this means utilizing containers on Linux and Windows so that we get a consistent environment defined in the monorepo that is used on all runners, taking a lot of the environment maintenance burden away from the underlying host. Once we migrate to these machines, we can scale back up testing and adequately test dependent projects.

Scalability

We believe that we can probably get this to work within the 60 job Github limit, but only if we are careful with this initial rollout. This would mean more limited testing at first (probably path base filtering and only testing projects where changes were made) to work with throughput limitations, but we might be able to enable more over time. We’re unsure of the exact throughput limitations at scale currently. Throughput issues are also further compounded by the fact that libc++ has moved their Windows testing to the free Github runners due to issues with Buildkite, but based on current data, it seems like there is still quite a bit of spare capacity.

Some other thoughts

Ultimately, this is planned as a transition period. We want to get something working with the resources that we have available while projects are currently struggling (especially with Windows builds) to hopefully help alleviate some of those issues and get everything ready and in a working state so we can easily transition to self-hosted runners where we automatically benefit from massively reduced test latency.

We don’t expect this to replace the current CI immediately (key for leaving Linux coverage for other projects we don’t plan on enabling initially), but we do plan on having this infrastructure replace the current precommit CI when we can utilize self-hosted runners and provide the same test coverage as the current pipeline.

3 Likes