I don’t think any of these problems are new. I am also not aware of anyone having spent significant time working on fixing these issues. I would think that implies one of two things: either people don’t care about the CI, or there is something else blocking people from working on the CI like lack of familiar tooling or barriers like only a few people having admin access to the Buildkite side of things. (Or that I’m not that in tune with current efforts in the community) Based on the amount of activity that we have already seen since the transition to Github PRs, I don’t think it’s that people don’t care about the CI, it’s just that the current pipeline is esoteric enough that it’s not easy (or at least easy enough) to work on.
Sure, but it goes back to the same point as above. No one seems to be that interested in fixing it there. Especially given that the current consensus within the community (or at least what I thought was the consensus before starting this thread, maybe this plan of action isn’t as popular as I thought) is to move from Buildkite to Github actions.
I mentioned the statistics for google/llvm-premerge-checks
briefly in my post above. My apologies for potentially misrepresenting the statistics. I still think the conclusion is similar. There are only 13 contributors to that repository that have submitted more than one patch over the entire life of the repository (almost five years at this point from what I can tell). We have 22 in the past four months on the Github side, and that’s without the core of the premerge CI even living there. I think that these numbers demonstrate significantly higher community buy-in in the Github Actions side of things over the current premerge checks.
I think the claim “the pipeline is unreliable” reasonably follows from seeing a pipeline in 70% of an (admittedly small) sample of my PRs fail for spurious reasons. You’re right, that doesn’t work towards any root cause analysis, but I think that’s beside the point when making a statement purely about the reliability (true positives + true negatives over all runs).
With the specific details of this proposal, I believe it would eliminate a lot of the variables that could currently be contributing to a lack of perceived reliability. The machines are all run and managed by Github, so we don’t have to worry about them. We’d be starting off with testing on an extremely simple configuration and then building up from there.
Just moving over to Github actions might not help much in and of itself if we’re using the exact same pipeline script. But I think there are a lot more people willing to hack on a Github Actions based pipeline than what we have currently, which will end up improving reliability as people are able to fix the issues that they come across. We’ve already seen this with other workflows in Github Actions and the CI jobs that do run through Github actions have become significantly more reliable thanks to the community’s efforts.
Sure, a lot of the reliability issues do seem to be related to testing bad commits from main
. However, there are quite a few solutions that don’t involve any sort of policy changes regarding requiring/suggesting using the precommit CI or enabling something like a merge queue. We could have the CI build only on top of a known-good main commit. Or selectively disable subprojects that are known broken on main (as quite a few of the failures I saw were related to MLIR and BOLT). Or we could skip the job or still run and report a warning if main
is known to broken. Again however, I don’t see any efforts with the existing pipeline to mitigate these issues.
Blocking the merge requests (with a manual override) on not having criteria met seems to have consensus in the thread. I’m not convinced that will end up solving the problem though. If people still have/develop new alert fatigue from the CI being unreliable, people will just begin to automatically click past the extra button. I think we need to work on building reliable CI from the start through other methods rather than starting to suggest/require it more. Once we have truly reliable CI, I think we can began to get community buy-in for efforts like merge queues or even mandatory pre-merge checks.
At this point I’m struggling to understand what your point is. That thread explicitly talks about moving the current builders over to Github Actions. That course of action is extremely similar to this proposal except with different implementation details to optimize things for the self-hosted runners.