[RFC] LLVM Precommit CI through Github Actions

I don’t think any of these problems are new. I am also not aware of anyone having spent significant time working on fixing these issues. I would think that implies one of two things: either people don’t care about the CI, or there is something else blocking people from working on the CI like lack of familiar tooling or barriers like only a few people having admin access to the Buildkite side of things. (Or that I’m not that in tune with current efforts in the community) Based on the amount of activity that we have already seen since the transition to Github PRs, I don’t think it’s that people don’t care about the CI, it’s just that the current pipeline is esoteric enough that it’s not easy (or at least easy enough) to work on.

Sure, but it goes back to the same point as above. No one seems to be that interested in fixing it there. Especially given that the current consensus within the community (or at least what I thought was the consensus before starting this thread, maybe this plan of action isn’t as popular as I thought) is to move from Buildkite to Github actions.

I mentioned the statistics for google/llvm-premerge-checks briefly in my post above. My apologies for potentially misrepresenting the statistics. I still think the conclusion is similar. There are only 13 contributors to that repository that have submitted more than one patch over the entire life of the repository (almost five years at this point from what I can tell). We have 22 in the past four months on the Github side, and that’s without the core of the premerge CI even living there. I think that these numbers demonstrate significantly higher community buy-in in the Github Actions side of things over the current premerge checks.

I think the claim “the pipeline is unreliable” reasonably follows from seeing a pipeline in 70% of an (admittedly small) sample of my PRs fail for spurious reasons. You’re right, that doesn’t work towards any root cause analysis, but I think that’s beside the point when making a statement purely about the reliability (true positives + true negatives over all runs).

With the specific details of this proposal, I believe it would eliminate a lot of the variables that could currently be contributing to a lack of perceived reliability. The machines are all run and managed by Github, so we don’t have to worry about them. We’d be starting off with testing on an extremely simple configuration and then building up from there.

Just moving over to Github actions might not help much in and of itself if we’re using the exact same pipeline script. But I think there are a lot more people willing to hack on a Github Actions based pipeline than what we have currently, which will end up improving reliability as people are able to fix the issues that they come across. We’ve already seen this with other workflows in Github Actions and the CI jobs that do run through Github actions have become significantly more reliable thanks to the community’s efforts.

Sure, a lot of the reliability issues do seem to be related to testing bad commits from main. However, there are quite a few solutions that don’t involve any sort of policy changes regarding requiring/suggesting using the precommit CI or enabling something like a merge queue. We could have the CI build only on top of a known-good main commit. Or selectively disable subprojects that are known broken on main (as quite a few of the failures I saw were related to MLIR and BOLT). Or we could skip the job or still run and report a warning if main is known to broken. Again however, I don’t see any efforts with the existing pipeline to mitigate these issues.

Blocking the merge requests (with a manual override) on not having criteria met seems to have consensus in the thread. I’m not convinced that will end up solving the problem though. If people still have/develop new alert fatigue from the CI being unreliable, people will just begin to automatically click past the extra button. I think we need to work on building reliable CI from the start through other methods rather than starting to suggest/require it more. Once we have truly reliable CI, I think we can began to get community buy-in for efforts like merge queues or even mandatory pre-merge checks.

At this point I’m struggling to understand what your point is. That thread explicitly talks about moving the current builders over to Github Actions. That course of action is extremely similar to this proposal except with different implementation details to optimize things for the self-hosted runners.

I certainly don’t deny that there’s value in Linux pre-submit tests. My point is that I think Windows and Mac testing would be even more valuable to the community.

I don’t remember when I last fixed or reverted a commit due to basic build or LIT test failures on Linux, but for Mac and Windows this is a regular occurrence.

1 Like

That is assuming that

  1. You can’t get access to buildkite (did you ask?)
  2. Buildkite is actually the problem (do you have data about this? I asked many times but you haven’t been pointing to a problem root caused there?).

You seem to be making wild assumptions here. For example invested significant amount of time on the pre-merge check (before the move to GitHub PR) to the point where I felt the script was good enough. The maintainers of the pre-merge were always responsive when I had questions, and reviewed and merged the patches I sent.
Did you try to fix any issue you found with the pre-merge?

No they don’t. They just demonstrate that the move to GitHub PR led to the need of new tooling (formatting, triaging issues, etc.)
Basically the move to GitHub created problem that broke everyone’s workflow, annoyed a lot of people, and that created the need to fix things.

Sorry but that does not make any sense here: if my house is cold during the winter, I’m not gonna move to another house before closing the windows, or trying to turn on the heater!
This is how your approach comes across to me: something is wrong but you’re not trying to understand what is going on and you’re jumping to a “solution” which may not solve the actual problem.

I looked at all the (linux) failure your provided, I didn’t see a single one that can be root caused to the machine used.
Can you show me at least one of your recent PR where the Linux build failed but would pass by using GitHub provided machines?

Are you claiming that a pipeline involving multiple machines with artifacts across them is an “extremely simple configuration”?

As a reminder: the baseline (the current config) is a single make invocation and a single ninja execution.

And I don’t see anything in your proposal that is going in this direction either, otherwise we’d have a more productive discussion.

Your proposal involves a convoluted and hacky build/test pipeline: that concerns me greatly. Also you haven’t been able to substantiate or connect any of your quite vague claim about unreliability to your “solution”.

I agree with you about Mac/Windows being very valuable in pre-merge.
But just to provide some data points as a maintainer of a couple of post-merge Linux bot: I still revert a couple of patches every week.
One reason is that folks merge when the CI is broken (either it was already broken in main and was hiding the failure, or they didn’t pay attention), but that shows that relying on folks building/testing locally isn’t enough and pre-merge on Linux is really valuable.
Other reasons for breakage often include:

  • tests that are excluded from running for the users but run in the bot (because REQUIRE: … specificities for example, or the need for python bindings, or the need for Cuda, or …),
  • missing cmake dependencies that the bot exposes with the use of -DBUILD_SHARED_LIBS=ON but are hidden in static builds.
  • failures specific to the toolchain (specific version of older gcc for example).

I could almost certainly get access to the Buildkite side of things. And buildkite itself probably isn’t the source of these problems. But it significantly increases the amount of friction required for someone to just drop a patch in unless they’re already familiar with the system. A lot of people are already familiar with how Github actions works and there is a plethora of documentation available for people to get up to speed quickly.

I was also under the impression that there was a consensus to move the current precommit over to Github actions soon (from discussion with the current maintainers and others involved). I’m not interested in working on fixing something that is just going to be moved over soon afterwards. But apparently I might be mistaken that this was the plan?

I have fixed plenty of issues on the Github actions side of things. I have not looked into fixing this on the Buildkite side of things, assuming that we will be moving away from that sometime soon.

Do you have data to support that claim? I have seen a lot of new additions that wouldn’t have been possible with how things were setup within Phabricator like new contributor comments. I think the code formatting action that we have now is a significant improvement over what we had previously for checking code formatting. It’s user friendly and reliable in a way that wouldn’t have been easy with the current precommit pipeline. I would think the majority of commits would be on enabling new functionality that was not possible before rather than fixing workflows that broke due to the move, although there definitely was some of that.

If I see everyone else moving out of the house (libc++ moving to Github Actions) and also believe the house is going to be torn down eventually (eventual move to Github Actions) while the heater is broken (patches that need to be made to the current system to fix these issues), I would probably be reasonably interested in moving to another house.

Now maybe I’m not understanding the current state of things in regards to moving the current runners over to Github runners. But if that is the plan, I’m not interested in getting the current infrastructure working when it is going to get moved shortly.

My point there wasn’t that it would fix those issues. It helps eliminate another possible variable. I guess that was not particularly relevant and probably shouldn’t have been brought up. Sorry.

Moving artifacts between Github runners is pretty well tested. The plan was to use actions/upload-artifact and actions/download-artifact which are used all over Github and thus quite well tested. Changing timestamps on builds is definitely not the well-trodden path however.

That’s also an implementation detail that I’m perfectly fine with changing for now. If we want to just see how this scales with a single job that just uses a larger cache and exchanges the latency advantages from job sharding for simplicity, sure. That seems like a reasonable tradeoff at this point to me.

I think this proposal moves us into a state where we begin to really start investigating and fixing such issues (if we end up carrying some of them over into the new implementation).

But if we’re not deprecating the current pipeline and I’ve understood that incorrectly, then I guess moving everything over isn’t as important to start working on these issues.

Then can we discuss the implementation details within this proposal specifically rather than whether or not we should move to Github actions at all?

If the current infrastructure is going to be migrated over anyways, there isn’t much interested in fixing those issues within the current framework.

I also think moving to Github actions will increase community involvement which will end up creating a more reliable and useful CI system. We seem to not agree on the specific metrics used to back up the claim that Github actions even has more community involvement in the first place however, so maybe this isn’t a good point to continue with. I do know that others and myself (or at least I was) are interested in doing work with CI in various parts of the monorepo, but this work seems to center around Github actions as more people are familiar with it and theoretically the Buildkite pipeline is going away.

Please allow a few opinions (even if ultimately, the system needs to suit the needs of people doing the thankless job of maintaining it <3)

  • I think having the pre-commit infra on Github might improve visibility and make it easier for more people to contribute more improvements - of course I do not know how that would work out financially

  • Given limited resources, using the merge queue features and do some additional builds before merging (including windows/mac) would reduce the risk of breaking main, could reduce the pressure on windows runners and probably reduce code reviewers workload. It of course would not replace the large post-CI, which i don’t think anyone is suggesting doing.
    Seems worth exploring.

  • If the solution to not enough runners is “we need people to run runners”, then it would be lovely to have community-maintained containers and associated documentation so that the burden of setting those ups and maintaining them is minimal, which would, I think, increase the likelihood of companies willingness to provide them. Even if the simplest solution for orgs might be to be able to throw money at the problem. Alas, runner minutes seem expensive in the long term :frowning:

2 Likes

I agree. Opt-in premerge testing for platforms where we have some resources but not many could definitely help a lot in the areas that you mentioned.

This is a good point. I have it on the TODO list to write up some documentation about the current Github actions and how things interact + best practices. If we end up having other organizations run self hosted runners, documentation is definitely something I would like to focus on. Tom has already started working on containerizing some of the workflows, but that currently is intended for use on the hosted runners rather than self-hosted runners. We definitely want to make this as easy as possible so we can get increased community buy-in. Runner minutes are definitely quite expensive, which is why just using the hosted runners isn’t really a viable option at all at this point. :pensive:

1 Like

I do not share this opinion in terms of how things are going in Clang. I’m still routinely seeing PRs wait for numerous hours before starting, especially for Windows runners. For example:

clang-ci #10282 (Windows waited 21hrs)
clang-ci #10245 (Windows waited 18hrs 39min)
clang-ci #10832 (Windows waited almost 7hrs)
clang-ci #10580 (Windows waited 10hrs)

I realize folks are working to improve this situation and those efforts are very appreciated, but things are in a pretty disruptive state currently for Clang rather than being pretty close to optimal IMO.

1 Like

Thank you everyone for your feedback so far. A couple adjustments to the original proposal based on feedback:

Motivation:

Moving to Github actions will allow more people to hack on the precommit CI (including bug fixes and extending it to cover new cases) on top of being much better integrated with our current tooling for precommit code review (Github PRs). In addition, the general consensus within the community (from what I can gather) is that we should go in this direction and there is a lot of talk about moving the current infrastructure over to Github actions. Recently, the windows CI also started throwing a variety of errors, some of which are seemingly related to the OS configuration (my best guess for Windows Defender warnings), and this gap in coverage has been detrimental to some subprojects like clang.

Design:

Scaling back the original proposal, this revision proposes the following:

  1. A single job per operating system.
  2. Initially testing a limited number of subprojects. Starting off with a set like LLVM/Clang and then iterating from there as we fix reliability issues/better evaluate the throughput of the system.
  3. The addition of Windows to the original proposal to help remediate the gap in the current CI.
  4. Build things in such a way that it will be easy to move things over to the self-hosted runners from the Buildkite infrastructure once they are in a state to migrate over. Particularly, this means utilizing containers on Linux and Windows so that we get a consistent environment defined in the monorepo that is used on all runners, taking a lot of the environment maintenance burden away from the underlying host. Once we migrate to these machines, we can scale back up testing and adequately test dependent projects.

Scalability

We believe that we can probably get this to work within the 60 job Github limit, but only if we are careful with this initial rollout. This would mean more limited testing at first (probably path base filtering and only testing projects where changes were made) to work with throughput limitations, but we might be able to enable more over time. We’re unsure of the exact throughput limitations at scale currently. Throughput issues are also further compounded by the fact that libc++ has moved their Windows testing to the free Github runners due to issues with Buildkite, but based on current data, it seems like there is still quite a bit of spare capacity.

Some other thoughts

Ultimately, this is planned as a transition period. We want to get something working with the resources that we have available while projects are currently struggling (especially with Windows builds) to hopefully help alleviate some of those issues and get everything ready and in a working state so we can easily transition to self-hosted runners where we automatically benefit from massively reduced test latency.

We don’t expect this to replace the current CI immediately (key for leaving Linux coverage for other projects we don’t plan on enabling initially), but we do plan on having this infrastructure replace the current precommit CI when we can utilize self-hosted runners and provide the same test coverage as the current pipeline.

3 Likes

This is windows, the assumption in the thread here that I made explicit before was that windows and issues specific to windows were out of scope for this discussion: so basically you’re taking my point out of context :slight_smile:
(My point was about the Linux setup)

Otherwise I agree with you: the windows machine availability and in general “LLVM on windows” isn’t satisfactory. I share the pain, I see it as a community issue overall, and a lack of windows specific investment overall.

1 Like

This is what the current script is doing I believe: shall I assume we’ll build on top of the existing knowledge and logic we encoded for this?

Yes. It will end up getting encoded in a different form due to how Github actions expects the paths to be presented, but I am going to try keep elements like that consistent. There was work put in to figure out what jobs to run and it would be good to leverage that.

We’ll only be supporting a subset initially, but replicating the behavior of the current script seems like the most reasonable starting point to me.

1 Like

I’m not sure whether this was implied or not, but it would be nice to avoid encoding too much in YAML, so that folks at home have less troubles replicating locally what CI is doing. It’s not a hard requirement, but would be nice to incorporate early in implementation.

Definitely. We want to make it as easy to recreate the build configuration as possible. I’m not sure moving everything into a shell-script or outside yaml will help significantly as I think the biggest issue would be checking out the ref that is being tested. One of my ideas for this was to have the job leave a comment on the PR similar to the code formatting action with a git command to checkout the exact ref that was tested (the PR merged into main since we’re using the pull_request event) and the CMake/Ninja invocation, including OS details/other links to reproduce the build. I want to make sure the tests are reliable first before we start dumping comments on a bunch of PRs however, so initially I was thinking a job step that runs on failure with the reproduction information. Not as discoverable, but it will be something and then we can move to a comment once we have proven reliability.

Open to other ideas to help with this though. I’ve definitely had trouble reproducing buildbot failures before due to inaccurate/missing reproduction info, and I don’t want that to happen here.