[RFC] LLVM Precommit CI through Github Actions

Motivation

LLVM already has precommit CI in the form of the Buildkite pipeline that supports both Linux and Windows. However, this pipeline is not well integrated within Github, and there have sometimes been reliability issues. Moving over to Github actions allows for better integration within Github, allows for people to more easily hack on the CI pipeline as many more are familiar with Github actions over Buildkite, and should allow for the fixing of some reliability issues. This is much more feasible now that the Github Actions runner specs have been bumped for open source projects (GitHub-hosted runners: Double the power for open source - The GitHub Blog).

Design

We propose here to use the default Github Actions runners as they have now been bumped to 4vCPUs for open source projects. These machines are not particularly fast, but they can do a build of LLVM with all targets enabled without any cache in about an hour (with the standard Github actions toolchain) and can build all of LLVM with a warm cache in under ten minutes. We expect to get the cold-cache build time down significantly through the use of an optimized clang based toolchain (PGO+ThinLTO+BOLT). Running the full test suite is not super fast at about 15 minutes, but we believe the full latency for testing LLVM being about an hour (with a cold cache, less than half that with a warm cache) should be reasonable enough for precommit CI.

In regards to the workflow setup, we want to split jobs up into separate build and test phases to decrease latency when jobs touch multiple projects (eg LLVM+clang) and to make artifacts easily reusable for other workflows (eg testing a project’s Python bindings might be a separate test target that can reuse artifacts from that project’s build rather than needing to start from scratch). This does cause some overhead, but the overhead is quite low. Uploading artifacts from an LLVM build takes slightly over ten seconds without compression, and about two and half minutes with a moderate level of compression (gzip level 6). We believe this is low enough to not outweigh the benefits created by this approach. Moving around artifacts does require some slight manipulation of the build system as timestamps get changed when being moved between jobs. However, this is a relatively simple operation, and if this approach ends up scaling well, adding support to upstream ninja (to update .ninja_log with recent timestamps) should be relatively straightforward. Jobs would be run or skipped using a label based system to overcome the fact that workflows that trigger with path based filtering can’t be. This separation should also allow pulling in versions of the build from the main branch when only a subproject is changed. For example, if someone writes a patch only touching lld/, we could pull in the artifacts from main for llvm instead of having to rebuild them completely (although this probably won’t be implemented in the initial version).

Separating parts of the workflow into separate jobs also makes it easier to determine exactly what failed for a user. For example, if code formatting is the issue, the code formatting job (which already exists) fails and the builds run independently of that, allowing the user to easily diagnose what the issue is, rather than having to read the logs in a monolithic pipeline.

In regards to the build configuration that we want to test, we would try to make it as close to “vanilla” as possible. A release build with assertions enabled is the current plan. This should be one of the most common configurations that is used and other more specific configurations are left to post-commit testing. The idea is that the precommit CI tests the baseline configuration and filters out the easy to catch errors and leaving specifics to post-commit where we already have a lot of infrastructure setup to catch such issues. We plan on having this configuration run postcommit (through Github actions) so that it is easy to determine if tip of tree is broken in this configuration.

Currently this proposal is not aimed at addressing precommit Windows builds. However, the ideas presented and iterated on as part of the development of this proposal should be easily extendable to get Windows builds. Adding windows support might not be as trivial though depending upon the amount of time needed for each build, which might require using self-hosted runners.

Scalability

We believe this approach should be reasonably scalable. We are able to run up to 60 concurrent Github actions jobs. Saying that we need to run the pipeline approximately 500 times a day (which is most likely an overestimate as not all of these actions require running full tests). That gives us a maximum amount of time of about three hours per build with the current numbers, not including actions needs from other workflows like the code formatting action, but those should be negligible in comparison.

This proposal also relies heavily on artifact storage within Github. Github does not impose any restrictions for public projects on artifact storage usage, so there shouldn’t be any issues here. Using a reasonably short retention time should also keep our overall storage usage at a somewhat manageable level.

Timeline

I have a very basic prototype validating the ideas in this proposal already working. I’m hoping to start incrementally adding on functionality soon (maybe even in the next couple days) and finish out the entirety of the proposal. However, I do have other LLVM-related commitments and am currently a full-time student, so things might not move as fast as I would like.

6 Likes

I have strong concerns with starting to use GitHub actions for post-merge CI: there is no dashboarding, notifications, or other reporting mechanism that I know of. Diverging from our buildbot infra likely requires careful consideration IMO, and not something we should just enable lightly.

It does not remove the underlying aspects that the only thing that we can reliably keep “green” right now are configurations which are checked through the buildbot infrastructure, via the various blamelist notifications mechanism (and buildbot owners who revert breakage quickly).

Since a pre-merge CI is only useful when it is validated on the main branch, it seems critical to me that we have a consistent way to ensure that a buildbot config is setup and well support to test the exact same configuration (with the same OS, machines, etc.) as the pre-merge config. Otherwise this config will be broken and becomes noisy (which has been the experience with the pre0merge setup in the past, exactly for this reason).

Can you clarify how the separation of build and test phases decreases the latency?

Is this expected to provide significant improvements over CCache?
I see rather very fast build time with CCache on the buildbot I maintain right now.

IMO a SHARED_LIBS build is appropriate for CI: it catches a lot of the common missing CMake dependencies that a static build does not (these are linker failure in a SHARED_LIBS build, but hidden otherwise). Also this likely reduces the size of the build artifacts significantly if the tools link to the shared libs.

1 Like

The point of running this post-merge is only to validate that the CI is green on the main branch so that we avoid the problem where the precommit CI is failing because it’s actually main that is red. It does create a divergence in the postcommit infrastructure, and I agree it requires careful consideration. But in this case, I think the cost-benefit analysis works out. Running the post-commit within Github actions allows us to recreate the CI environment exactly rather than creating a (probably close) approximation of it on buildbot. Additionally, doing it on GIthub allows us to more easily query the results through the API (although this should be possible with buildbot), enabling us to signal to the user that the CI failing is not necessarily a problem with their patch. There should also be post-commit configurations already quite close to what we’re running that I believe would capture similar issues. We don’t want to enable post-commit CI through Github actions anywhere else, just to validate that the precommit pipeline isn’t broken on main.

If someone makes a change that touches multiple projects (eg LLVM + clang), we can build LLVM, upload the artifacts, and then spin up two parallel jobs, one that builds clang, and one that runs check-llvm. This should decrease the total latency by a decent amount as opposed to doing everything serially.

Yes. From my testing downloading even quite large artifacts takes under thirty seconds. Doing a build of LLVM (in the default configuration) with a fully warm ccache still takes 8-9 minutes.

That’s a good point. Decreasing the artifact size would certainly be nice, and being able to check the dependencies precommit would also be good. It also shouldn’t deviate from the standard build in a way that prevents detection of failures i.e., I don’t think there should be many issues that happen in the static configuration that don’t happen in the dynamic configuration (and it should be the opposite given the dependency failures in the shared libraries build).

Thank you for your feedback!

You’re jumping to a conclusion on the right tradeoff here, and to be honest I’m far from being convinced.

So basically this is “increase the amount of parallelism” during the testing phase by sharing it to multiple separate jobs.
It seems somehow a bit like an arbitrary split to me: one could also build LLVM and then split the build of the various subprojects to different jobs downstream for example.
It’s not clear to me also how you measure the cost/benefit of sharding the tests, or how you balance the possible latency gain with the overall extra overhead (and thus lower throughput) for the infrastructure?

This is surprising to me: it takes 10s on my machine. What’s the spec of the machine you’re using? Is this measuring the time to build the dependency for ninja check-llvm?

I very well might be. What do you suggest would be the optimal configuration here?

Yes. The current version of the proposal does propose building subprojects separately too. So for example with a LLVM+clang build, the jobs would be as follows:

  1. Run CMake configure + build LLVM
  2. Run LLVM tests (dependent on job 1)
  3. Build clang (dependent on job 1)
  4. Run Clang tests (dependent on job 3)

I don’t have any hard numbers for this currently. The latency gain should be significant (on the order of 15+ minutes) given how long it takes to run all the tests. The extra overhead should be under two minutes per job. This does decrease the overall throughput of the infrastructure, but I don’t believe we should be running into capacity issues. The scalability section in the top post goes over some of my thoughts on this. In addition, during peak times, we get 20-30 diff updates per hour (based on some numbers on a prior thread based on phabricator diff updates). Given a maximum of 60 machines, each diff update should have approximately 2 machine hours. Given, these are theoretical calculations. We can do estimates of where things might fail to scale, but we can’t really know the exact failure points until systems actually get deployed.

These are just the Github actions runners. They’re relatively tiny machines at 4vCPU and 16GB of RAM. That was the time building the default ninja target with no additional projects configured, so just a base (Release) configuration and then running ninja. I don’t think the numbers are completely off. A local build on a machine with really fast storage could probably get significantly higher throughput per thread than a VM with medium-performance storage. I’d guess the thread count would make up the rest of the difference.

1 Like

The current situation seems close to optimal to me (for LLVM and the projects depending on building LLVM, see below timings by the way).
We have almost as low of a latency as we can hope for (I think), and the config is tested by a buildbot. I would just make sure this is all in llvm-zorg and in a GCP project controlled by the LLVM foundation (I think the machines are only accessible by Google right now?).
Using a GitHub action to drive the pre-merge check process does not seem like a problem, but it likely does not imply using their under-powered runner hopefully.

Probably worth looking harder at the assumption you’re making here, because the 15+ minutes improvement does not line up with where we are at today. So we don’t seem to be starting from the same point in the discussion. If I look as a baseline at what we have today (the current pre-merge CI), here are some examples:

I guess this is a point where we differ. I (and others I’ve talked to) have found the signal to noise ratio of the existing precommit CI leaves a lot to be desired. I’m not sure if this has improved recently however.

The latency of the current pipeline is definitely a lot better. The current pipeline is relatively opaque however. Some of it (the core scripts from what I understand) live in <llvm-monorepo>/.ci, and other parts (from what I understand) live in a google owned repository. The infrastructure has not been moved to Zorg and the GCP instances are not controlled by the foundation. I also don’t believe there is a lot of interest in making some of those moves as the current plan from the Buildkite pipeline maintainers is to move the pipeline to Github Actions. I’m not sure what the plan is with the GCP instances.

Adding on to the point about the Buildkite pipeline being opaque, there have already some contributors (~5) already hacking on Github actions infrastructure in various places, oftentimes in substantial ways. A lot of people are a lot more familiar with it, and I believe it is also significantly more discoverable given the plethora of documentation. There hasn’t been any similar activity around the premerge checks. A handful of patches over the last six months, almost all from the current maintainers.

Other projects are also already migrating to Github Actions/setting up for the first time in Github actions. Libc++ runs all their precommit CI through Github actions. Some other minor users have also popped up like SPIR-V and the libclang python bindings.

Having the shared infrastructure proposed here would help enable some of the additions that people want to add as currently they’re somewhat ad-hoc despite being useful for specific use cases. It would also help unify some of the existing bits and pieces (although this proposal isn’t currently proposing any changes to the libc++ precommit CI, and they are probably the biggest user currently).

This proposal currently is focused on using the free Github actions runners. They’re not particularly fast, but they’re available and with agressive caching, the overall latencies are manageable.

This is based on numbers from the free Github actions runners, particularly based on a check-llvm time of 15 minutes.

Thanks for the additional timing information.

We definitely aren’t able to hit these sorts of numbers with the hardware through the free Github actions runners. However, with the plan to migrate the current precommit CI to Github actions, a lot of hardware should be freed up that we can then use to run these jobs on. Some changes might need to be made to get things running efficiently (like where caches/artifacts get stored), but they shouldn’t be incredibly substantial. This proposal would pave the way for that transition.

Anecdotical “feel” is always a good starting point for an investigation, but IMO that is just this: a starting point. That is I would think you would be able to come back with actual data showing the amount of noise, and some sort of characterization of the source of this noise. I am aware of many issues with the windows build, but you’ve declared this out-of-scope here, and so I assume we’re talking about the Linux pre-merge (the one I sent link to example runs in my previous message).
Are the machines flaky? Is Buildkite itself an issue? Is it that we just genuinely break this config in the main branch too often?

It’s hard to judge how your “solution” would do better than the current state, if we don’t know what the current problem is!

As far as I know, the whole pipeline is open and people can contribute to it. I certainly contributed multiple patches to these scripts in the past. I’m not sure how what you’re proposing would be “less opaque”!

If the issue is about the GCP instances, then we can look into fixing this in particular. If the issue is with buildkite, we can also look into alternative (can the same script we have now be executed by self-hosted GitHub runner?). I’m not sure however how you justify throwing away everything for a whole new setup…
(If there is missing documentation about the current script, this also can be fixed instead of starting entirely from scratch.)

I don’t see this as a bad thing: there is almost no activity on the config of many buildbot for example. When something “just works” and requires little maintainance, it seems like a good thing to me.
You can’t really compare to the activity around new GitHub actions when a lot of these have been setup alongside the migration to pull-request!

Overall we have something that works well (I’m waiting on strong data showing otherwise) and you’re not demonstrating at the moment how your proposed setup, which seems overly convoluted for a much inferior result (to your own acknowledgment), would help fix any issue. It seems like instead it runs the risk of being more fragile from the complexity associated with requiring coordinating multiple workers, with various intermediate artifacts to exchange between machines, with what seems like “hacks” like “touching timestamps” in the build artifacts.

I support the direction this RFC is taking, even if I not necessarily agree on some implementation details.

First of all, my understanding is that this RFC can be implemented as a pure addition to our existing infrastructure, not requiring any additional computational resources from the Foundation or community, because it leverages free GitHub infrastructure. If this RFC gathers enough support but fails to reach consensus, we can always settle on asking the proponents to implement a prototype and gather feedback.

  1. This RFC talks about free GitHub runners, and centers trade-off discussion around them. But we have an existing fleet of powerful machines backing up our existing pre-commit CI, so it would be nice to incorporate them in discussion in case we settle on this approach and tear down existing CI. For instance, I’m not convinced that spending one minute or more to compress and decompress artifacts is the right trade-off for 48-64 core machines.

  2. It would be nice to assess how much data GitHub actions provide us, including historical one. Specifically, for a regular contributor like me without admin access, it seems impossible to gather historical data on our Windows pipeline, making it impossible to analyze whether there was a point when thing started to go south much faster than before (which probably would lead me to a point where libc++ moved to GitHub Actions).

  3. As far as I understand, this RFC is aimed at pre-commit CI, with a post-commit job to support it. Given that all this can be implemented in addition to our existing pre-commit infrastructure, I don’t agree with Mehdi saying we’re throwing away “everything” here, and I think the whole buildbot discussion is off the topic.

  4. After living through Phabricator story, “a GCP project controlled by the LLVM foundation” to me sounds “one day our pre-commit CI goes away without prior notice”. So I definitely want to see an alternative, especially if it can co-exist with existing infrastructure.

  5. Recent Windows pre-commit CI issues and the way they were handled highlight that we can’t claim that our CI “just works”. As a see it, we failed to acknowledge the problem before it affected many people, and it took us time to realize what caused it (albeit it seems a hypothesis at the moment). Opaqueness of buildkite on historical data and varied performance of Windows CI (AFAIK depends on the time of day and day of the week) certainly didn’t help us to react to this in a more timely manner. So I claim there are problems to address.

  6. I wonder how this RFC fits into recent push to produce release binaries on pipelines instead of producing release binaries manually. CC @tstellar @tobiashieta

  7. I tend to agree with Mehdi that initial described setup sounds a bit involved, with multiple points of failure. It would be nice if author can separately describe basic setup and improvements we should consider (and interaction between them, if applicable).

This is concerning to me.

In my opinion, Linux is the least interesting platform for pre-submit testing, since most LLVM developers run Linux and can easily run the tests themselves before submitting.

Windows and Mac is where pre-submit testing brings (or would bring) the most value, since most developers don’t have access to those.

I saw GitHub promoting new Mac runners recently: Introducing the new, Apple silicon powered M1 macOS larger runner for GitHub Actions - The GitHub Blog Maybe we could get some of those? :slight_smile:

2 Likes

Most value for the developers you are referring to maybe. But remember that there are other users who will get value from it in different ways. Even if it is just the confidence of knowing that what they’re doing locally matches up to the “official” results.

Whether one sort of value outweighs any another, just like others in this thread I have no numbers :slight_smile: But something to consider at least.

I do agree that Windows/Mac/etc. should be considered in the longer term but I also understand limiting the scope to Linux right now as it the system develops.

  1. You’re saying “this can be implemented in addition to our existing pre-commit infrastructure”, is the proposal to have multiple pre-merge systems here? I understood this proposal as a replacement.
  2. I was referring to the existing pre-commit infra when I said throway, you’re bringing up the buildbot, are we talking about the same thing? Or is this about the post-merge aspects of it?

The problem with Phab is that it was a GCP project not controlled by the LLVM foundation. Can you elaborate why using machines in a community-controlled cloud environment would fit the pattern of access seen with Phabricator?

This is what I was referring to with a “a GCP project controlled by the LLVM foundation”. As far as I understand the machines using in pre-commit CI with buildkite aren’t in such a space right now, and this is kind the “Phabricator” problem you referred to (unless I missed something).
Using these as self-hosted runners for GitHub action could simplify the whole pipeline (preserving the current scripts, and avoiding spending minutes on artifacts management for jobs that are O(min) already).
(I actually don’t know what Buildkite brings in the equation, pros or cons…)

Per my understanding, this proposal can be either. No need to reject it on a basis that implementing this proposal means we have to tear down existing pre-commit CI.

First you said that GCP project is controlled by LLVM Foundation, now you say it’s community-controlled. (Are those entities even the same?) Earlier in this thread @boomanaiden154-1 mentioned that GCP project is not controlled by the Foundation. In support of this, just recently @akorobeynikov suggested that there is only one person who has access to BuildKite (Discord). So, can you clarify what’s going on with our current CI infrastructure before rejecting an RFC that is likely to offer more contributors an opportunity to step up and maintain our infrastructure?

Actually I would have concerns about not replacing the existing pre-commit CI and instead adding a new flow to maintain. But that was not the point I was making actually: I was actually pointing that saying “there is ‘noise’ in the current system” alone does not justify throwing away everything.
I have no particular attachment to buildkite as a system for example, but moving away from buildkite does not imply changing everything else for example.

Right, I am using “community” and “LLVM foundation” interchangeably here: the foundation is the legal entity for the community (that allows us to receive credits or funding to then pay for cloud resources and other things).
That is per-opposition to “Google provided”: when Google provides machines (instead of providing GCP credits), then only Google employees can have access to the GCP project and the machine (this was the case for Phabricator). From the community point of view Google does not provide “machines” actually, but only some “service” (like Phabricator), and that makes us dependent on Google employees goodwill for things like “issues with Phabricator” (I was maintaining Phabricator when I was at Google until 1y ago: this was a lot of goodwill, this kind LLVM community resources was not part of anyone’s job at Google really).
On the other hand, when the foundation controls the GCP project, any community member could potentially get involved in improving the system. The buildbot VMs I am maintaining for example are in such a GCP project (and the scripts and docker image is in Zorg where everyone can also contribute).

What I wrote before in the thread was:

which is saying the same thing I think? I haven’t said that the existing premerge-CI is in a foundation-controlled GCP project, but that it is a problem it isn’t. (Solving this problem though can be done by other means that the current proposal)

The problem with the RFC as is, IMO, is that it does not spell out what the problems are and jump to one particular solution (which I see as a bit extreme). Qualifying the existing problems more accurately would help navigating the space of possible solutions, without just going directly to “everything is to throw away”.

Could you elaborate? As far as I know, Github runners just support limited number of targets. Like on AIX, we are unable to setup Github runners.

Right. That is definitely a current limitation of Github Runners. This proposal is not intending to replace or augment any postcommit infrastructure. That specific element of the proposaal was aimed only to add an additional job through Github to make sure the precommit pipeline wasn’t broken on main (which would only be on 64-bit x86 Linux for this specific proposal), although based on feedback from other commenters in the thread, it seems like we will still want to rely on buildbot for that case.

That’s a good point. Having Windows and Mac builders would definitely be a significant value add. I wasn’t trying to preclude the addition of Windows runners. Quite the opposite. I believe this proposal lays the ground work for running windows precommit through Github actions. The main bottleneck there is availability of Windows machines and throughput on them (at least for the free Github runners). I haven’t done any testing, but from what I understand, building/testing on Windows takes more time, and running it by default on every PR would (more than) double the number of machines required, which might become an issue during peak times. If the decision is eventually made to move the existing Windows builders over to Github Actions (or there are additional machines), this should definitely be doable and something we should focus effort on.

We’re limited to five Mac runners at a time. It should be doable for opt-in testing, but opt-in testing probably won’t catch most of the bugs that we want this sort of system to catch.

Looking at my previous ten PRs (on the Buildkite pipeline overall, not just the Linux premerge), 7/10 had at least one Buildkite pipeline fail, with none of the failures being related to the patches themselves.

https://github.com/llvm/llvm-project/pull/78880 - passed
https://github.com/llvm/llvm-project/pull/78878 - passed
https://github.com/llvm/llvm-project/pull/77900 - https://buildkite.com/llvm-project/github-pull-requests/builds/29622
https://github.com/llvm/llvm-project/pull/77887 - https://buildkite.com/llvm-project/github-pull-requests/builds/28691
https://github.com/llvm/llvm-project/pull/77374 - https://buildkite.com/llvm-project/github-pull-requests/builds/29995
https://github.com/llvm/llvm-project/pull/77283 - https://buildkite.com/llvm-project/github-pull-requests/builds/27267
https://github.com/llvm/llvm-project/pull/77264 - https://buildkite.com/llvm-project/github-pull-requests/builds/27224
https://github.com/llvm/llvm-project/pull/77226 - https://buildkite.com/llvm-project/github-pull-requests/builds/27142
https://github.com/llvm/llvm-project/pull/77224 - https://buildkite.com/llvm-project/github-pull-requests/builds/27140
https://github.com/llvm/llvm-project/pull/76788 - passed

A failure rate that high ends up creating alarm/alert fatigue and no one even ends up checking the pipeline. I know some people who pretty much end up ignoring the pipeline results because the failure rate is so high. This is compounded by the fact that the reporting within Github is not good. The entire pipeline is reported as a single unit, so someone just sees that it failed rather than what part failed. This has become especially prominent with the Windows failures where the pipeline will just show the yellow waiting icon for 8+ hours until a runner finally gets to the other half of the pipeline.

Sure, a lot of failures in that list are probably from commits in main that are causing failures, and the issues that I brought could theoretically be worked on. But no one has stepped up to do so, and instead quite a few people just end up ignoring the pipeline. Moving to Github actions helps alleviate this problem somewhat as people are actually willing and able to work on it. Looking at hard stats on the premerge pipeline changes versus Github action changes:

  • <monorepo>/.ci has sixteen commits in its entirety from four different authors. google/llvm-premerge-checks has a lot more more, but that infrastructure is rarely edited by the community. All the recent commits are from the maintainers, and there have only been 21 contributors across its history (most of them one off patches).
  • .github/workflows has 200+ commits from 33 different authors, 22 of which have submitted more than one patch. I have seen everything from adding new functionality to fixing the reliability of existing checks like the code formatting action. Almost all of these have taken place during and after the transition to Github.

I think moving to Github actions helps empower the community to ultimately make the CI more reliable as they are significantly more likely to be able to fix any reliability issues/missing features.

The current Buildkite maintainers have done a great job with the constraints that they had to work with (especially given it working with Phabricator), but I believe that ultimately a significant amount of community involvement in the CI is enabled by moving to Github Actions, and that is what is going to really get CI to work well within LLVM.

Sure, that can be fixed by means other than this proposal. But there has been a lot of talk about doing this for quite a while at this point, and nothing substantial has happened. I would think that there are a lot of practical constraints for doing this as like you mention, it ultimately relies on the goodwill of employees at Google.

(Thank you for your tenure maintaining Phabricator BTW).

Sorry. I probably should’ve explained this better in the motivation section. The three main points regarding this are the following:

  1. Ability for the community to hack on the CI
  2. Reliability of the CI
  3. Integration of CI into the review platform

I still think moving to Github actions is the best course of action for addressing those problems. We might disagree on the pipeline reliability, but I and many others find the pipeline unreliable. If we want people to actively use CI, we need them to have some level of trust that the pipeline reports (almost) only true positives and true negatives.

In addition, many others have also been interested in moving the CI to Github for reasons related to those above. One is @goncharov, who maintains the existing premerge setup.

Maybe it’s not the specific implementation details here that end up fixing the problem. Those were mainly intended to get something working so we can get the ball rolling without anyone needing to cut large checks are do any complicated setup. But eventually, I do think precommit through Github actions makes the most sense.

This proposal is intended as an eventual replacement. We can’t replace the current premerge system with the current implementation in this proposal, at least to the same level of latency/coverage. I would expect there to be a transition period where we get Github Actions working properly and then eventually switch/potentially move machines over. The immediate implementation would simply add new jobs.

Thanks for looking, however that’s not a root cause:

if the failures are from main actually be broken, then I don’t see how what you’re proposing in the RFC would improve the situation in any way?
There is already a bot checking the main branch: https://lab.llvm.org/buildbot/#/builders/272

If this is the normal state of the main branch, then this seems like the main problem to solve and I don’t quite understand how your proposal is really touching on the problem here.

Sure, this is something that can be changed in itself independently though. Which goes with my request about listing the actual problems specifically.

You’re comparing apple-to-oranges here: the .ci workflow that has 16 commits is code that was moved from google/llvm-premerge-checks !

Also this is very mature code in production for years, so of course it won’t get many commits in the few months it has been in the monorepo.

Similarly, google/llvm-premerge-checks (which should only contain the VM definition by now?) is something that should go in zorg (if things haven’t moved yet). But that should likely be done in conjunction with setting up things in a foundation-owner GCP project. It’s like not difficult to start from scratch:

Please, again provide data: you’re making wild unsustantiated claim here. When you say the “pipeline is unreliable” you’re not saying much without actually qualify what is the problem: are the machine unreliable? (running out of space, hanging, etc.), is buildkite itself unreliable? (web-hook unresponsive, triggers lost, worker lost, etc.), is the script we use to build unreliable? (does not report error correctly? does not test what it should do?).

It’s important because otherwise how can you be claiming just that “migrating to GitHub action” will make it suddenly reliable?
Also it is important because if the machines are a problem, then using self-hosted runner would carry over the problems.

Right now from what I can see: “pipeline is unreliable” means to you “I see unrelated failures on my PR”, which can be solely attributed to the main branch being broken and nothing related to the infrastructure!

Again, my main concern with the way you approached this has nothing to do with ditching buildkite. My main concern is your whole approach of ditching everything!
(compared to looking at carrying over the machines as self-hosted runners for example, preserving the current workflow for the scripts, etc. for which there is momentum here by the way)

Lots of things can be improved, you haven’t shown a link between the current problems of “unrelated failures” on PR premiere and the infra though.

Some root causing of mine right now:

So one of the main source of breaking the pre-merge checks, seems to be that folks are merging without having these passed. Things that come to mind to me right now to address these issues:

  • Enabling this feature Block Github merge on changes requested? - #25 by tstellar ; that way before merging there would be a popup if the build is failing and the user wouldn’t merge unknowingly.
  • Implement an “auto-revert”: when the post-merge job corresponding to the pre-merge is failing, it could automatically open a PR for the revert (leaving it up to a human to click the merge button on it) and post a comment in the original PR.
  • Enable GitHub “merge queues”: if folks can have a fire-and-forget button, they may use this to merge instead of just directly merging (I would use this).
  • Having a “freshness” checks: invalidate a pre-merge check after 24h for example.