Motivation
LLVM already has precommit CI in the form of the Buildkite pipeline that supports both Linux and Windows. However, this pipeline is not well integrated within Github, and there have sometimes been reliability issues. Moving over to Github actions allows for better integration within Github, allows for people to more easily hack on the CI pipeline as many more are familiar with Github actions over Buildkite, and should allow for the fixing of some reliability issues. This is much more feasible now that the Github Actions runner specs have been bumped for open source projects (GitHub-hosted runners: Double the power for open source - The GitHub Blog).
Design
We propose here to use the default Github Actions runners as they have now been bumped to 4vCPUs for open source projects. These machines are not particularly fast, but they can do a build of LLVM with all targets enabled without any cache in about an hour (with the standard Github actions toolchain) and can build all of LLVM with a warm cache in under ten minutes. We expect to get the cold-cache build time down significantly through the use of an optimized clang based toolchain (PGO+ThinLTO+BOLT). Running the full test suite is not super fast at about 15 minutes, but we believe the full latency for testing LLVM being about an hour (with a cold cache, less than half that with a warm cache) should be reasonable enough for precommit CI.
In regards to the workflow setup, we want to split jobs up into separate build and test phases to decrease latency when jobs touch multiple projects (eg LLVM+clang) and to make artifacts easily reusable for other workflows (eg testing a projectâs Python bindings might be a separate test target that can reuse artifacts from that projectâs build rather than needing to start from scratch). This does cause some overhead, but the overhead is quite low. Uploading artifacts from an LLVM build takes slightly over ten seconds without compression, and about two and half minutes with a moderate level of compression (gzip level 6). We believe this is low enough to not outweigh the benefits created by this approach. Moving around artifacts does require some slight manipulation of the build system as timestamps get changed when being moved between jobs. However, this is a relatively simple operation, and if this approach ends up scaling well, adding support to upstream ninja (to update .ninja_log with recent timestamps) should be relatively straightforward. Jobs would be run or skipped using a label based system to overcome the fact that workflows that trigger with path based filtering canât be. This separation should also allow pulling in versions of the build from the main branch when only a subproject is changed. For example, if someone writes a patch only touching lld/, we could pull in the artifacts from main for llvm instead of having to rebuild them completely (although this probably wonât be implemented in the initial version).
Separating parts of the workflow into separate jobs also makes it easier to determine exactly what failed for a user. For example, if code formatting is the issue, the code formatting job (which already exists) fails and the builds run independently of that, allowing the user to easily diagnose what the issue is, rather than having to read the logs in a monolithic pipeline.
In regards to the build configuration that we want to test, we would try to make it as close to âvanillaâ as possible. A release build with assertions enabled is the current plan. This should be one of the most common configurations that is used and other more specific configurations are left to post-commit testing. The idea is that the precommit CI tests the baseline configuration and filters out the easy to catch errors and leaving specifics to post-commit where we already have a lot of infrastructure setup to catch such issues. We plan on having this configuration run postcommit (through Github actions) so that it is easy to determine if tip of tree is broken in this configuration.
Currently this proposal is not aimed at addressing precommit Windows builds. However, the ideas presented and iterated on as part of the development of this proposal should be easily extendable to get Windows builds. Adding windows support might not be as trivial though depending upon the amount of time needed for each build, which might require using self-hosted runners.
Scalability
We believe this approach should be reasonably scalable. We are able to run up to 60 concurrent Github actions jobs. Saying that we need to run the pipeline approximately 500 times a day (which is most likely an overestimate as not all of these actions require running full tests). That gives us a maximum amount of time of about three hours per build with the current numbers, not including actions needs from other workflows like the code formatting action, but those should be negligible in comparison.
This proposal also relies heavily on artifact storage within Github. Github does not impose any restrictions for public projects on artifact storage usage, so there shouldnât be any issues here. Using a reasonably short retention time should also keep our overall storage usage at a somewhat manageable level.
Timeline
I have a very basic prototype validating the ideas in this proposal already working. Iâm hoping to start incrementally adding on functionality soon (maybe even in the next couple days) and finish out the entirety of the proposal. However, I do have other LLVM-related commitments and am currently a full-time student, so things might not move as fast as I would like.