Investigating CI workflow improvements and potential changes to arc

Hi,

In response to CI-related issues we’ve seen recently, we are currently investigating some improvements to the pre-commit CI flow, in particular with how CI jobs are triggered when code reviews are created in Phabricator.

This may evolve into requiring changes to how we use arc, such as requiring commit access in order to get access to the pre-commit CI. However, for the time being, this is just a heads up that we’re investigating solutions to improve the overall pre-commit CI experience, and we’ll keep folks informed when we have more findings to share.

Thanks,
Louis

CC @tstellar

1 Like

Is this referring to how pre-commit tests are run and displayed in Phabricator, e.g. highlighting potential test failures? If so, will this result in users who don’t use arc to upload their patches in losing the functionality of pre-commit testing?

Could you please share at least some of the ideas?

I was recently thinking about writing an RFC to allow PR created by bots: essentially to replace the current setup with a) a bot that will take a diff and create a PR to llvm-project b) CI that will run against this PR. Such PRs are going to be marked as “this is not to review the change, go ”. Only bot user will be allowed to create PRs.

That should allow us to have some PRs and figure any issues with GitHub without affecting existing review process but also improve pre-commit CI.

Should I propose that?

There are a few problems I’ve been trying to address:

  1. Instability of the Phabricator <=> Buildkite bridge. We sometimes see spurious failures in the setup steps of the CI pipeline, before the actual CI jobs are triggered. This is an example. From the Phabricator side, this looks like the build is still running but in reality it has failed: Screenshot 2023-05-02 at 8.20.01 AM

  2. Complexity of getting to the actual CI results. When you click on the current pre-merge checks link in the Phabricator UI, you get to a page like this in Harbormaster:


    If you want to see the actual CI being run, you need to copy-paste the buildkite.com link in plain-text and follow it. Then you get to a first Buildkite pipeline and you need to click through 2 other pipelines in order to get to the actual CI pipeline being run for libc++. This is a bit cryptic for folks not familiar with our CI setup and just trying to get stuff done.

  3. The current Phabricator <=> Buildkite bridge uses the unit test reporting feature of Phabricator to report failing tests in Lit. This is nice, however as explained here it doesn’t always report everything. As a result, we’ve sometimes seen CI runs where we thought we had no more failing tests but in reality there were.

  4. We’ve also seen instances of the libc++ CI pipeline being skipped but the Phabricator UI reporting the CI as passing. This happened with ⚙ D143914 [libc++] Clean up pair's constructors and assignment operators and this CI run (you won’t be able to see the green check mark on the review anymore because I shipped it).

  5. Because large chunks of the CI setup don’t live in the LLVM monorepo and are run externally, it can be challenging to fix these issues by ourselves. While the folks who support us have been extremely resourceful and I am eternally thankful for their support, I feel like we would benefit from having more control over our CI infrastructure, since it has grown to become a very important part of some projects. For example, when the CI infrastructure starts failing, libc++ simply can’t make any progress because we rely almost entirely on our CI to test our ~60 different configurations.

Things have been working mostly well for us for the past two years, but recently we started seeing more and more of these issues. In particular, issues (1) and (4) were non-existent and they started happening recently, which is basically a dealbreaker for libc++ development. As a result, I started investigating solutions to unblock libc++ contributors who were blocked and I discovered that Buildkite had a builtin integration with Phabricator: Phabricator | Buildkite Documentation.

This is what I am looking into right now. It requires using the “staging area” feature of Phabricator. A staging area is basically a remote repository that arc diff will automatically push a tag to, and that can be used by any external system to fetch the exact source code content used by a given Differential Diff. This results in a simpler stack to integrate Phabricator and Buildkite, which means fewer things that can fail between us and the actual CI. For example, this shows how the Phabricator UI presents a failed job in my test review:


Here, clicking on View in Buildkite will take you directly to the CI pipeline being run, which solves (2) as well as (1), (3) and (4). Note that the job is failing due to a legitimate issue in our tests, don’t let it distract you.

The downside is that using arc diff will need to push the commit being reviewed to a staging area, which means that there needs to be some Github access set up to submit a patch and have the CI run on it. If one doesn’t want to set up Github access, arc diff --skip-staging can be used and the staging area will be skipped, and the CI will not function. This also means that uploading diffs via the web UI wouldn’t support pre-commit CI either.

Those are definite downsides, however the current situation is also problematic since it gets in the way of getting stuff done when it fails us. In the short term, what I have is a working proof of concept using the staging area and the builtin Phabricator integration, and I will be onboarding the libc++ contributors onto that workflow to see how well that works. We’ll see how that goes and we might propose doing that for the rest of the project, however my goal here is primarily to solve the small crisis we are facing with the libc++ CI.

That should allow us to have some PRs and figure any issues with GitHub without affecting existing review process but also improve pre-commit CI.

I think that’s great and in fact @tstellar was suggesting something similar. I would support any work in the direction of frontloading the integration of our CI with Github PRs, I think we’ll be extremely thankful for that when we begin switching over to PRs. However, as I tried to explain above, the investigation I am conducting has the main goal of fixing some very concrete ongoing issues that are making it difficult for libc++ to make progress as usual, so it’s a bit orthogonal and narrower in scope.

I’d be extremely loathe to lose the pre-commit CI for patches uploaded via the web UI, because as someone who nowadays almost only ever reviews patches, and rarely, if ever, builds and tests these patches themselves due to resource limitations, I rely heavily on this functionality to spot problems, often in the Windows CI (most developers seem to build and test on non-Windows systems, which is fine, but consequently various platform-specific issues are often missed). I don’t know how many of the patches I review are uploaded via the UI versus arc, but from a quick straw poll of colleagues who post patches on Phabricator, several of them use the UI to upload them, so based on that, I suspect a non-trivial number of patches will be negatively impacted by changing the existing workflow.

Of course, I’m conscious that others will use different workflows and be impacted in different ways, but can we find a solution that doesn’t force people to use arc if they want to have the pre-commit testing?

Oh, just to clarify, is this referring to something specific to do with libc++ or the wider pre-commit CI used by LLVM via Phabricator?

I mentioned a similar idea when discussing this with @ldionne on IRC. I think this is the logical next step, we already have bots creating pull requests for releases, so it should be doable.

nice! @ldionne I hope to have time this week to look into your issues!

This was referring primarily to libc++ because that’s where we have the most concrete issues and we’re probably the project that relies the most heavily on pre-commit CI in our workflow. However, my findings would have potentially applied to the whole project.

It turns out that my findings are basically this:

  1. I am able to set up a direct Phabricator <=> Buildkite integration using their builtin integration, and it works. However, it requires enabling the “staging area” feature of Phabricator, which would require anyone uploading a review to do it using arc diff (otherwise CI won’t work), and also it would require commit access to some repository (not necessarily github.com/llvm/llvm-project, but something).
  2. Job cancellation doesn’t work with the builtin integration. This means that if you upload a patch and then re-upload the patch shortly after (maybe you’ve fixed a typo or something), the first job that was created doesn’t get cancelled. This is a deal breaker for us because our CI wouldn’t be able to bear the load if we don’t have that feature. I guess we could implement that in Harbormaster, but pouring more time into an end-of-lifed technology doesn’t make sense IMO.

I have everything set up with custom Herald rules and all and it works, however the two issues above (in particular #2) are a deal breaker for wider adoption in my opinion. After discussing with some folks on Discord, I believe that the best solution here is to fast-track experimentations with Github PRs – I’ll be writing another post explaining that.

Thank you! The most important issue to solve in the short term would really be the one where jobs sometimes get cancelled for an unknown reason, and then the Phabricator UI reports the job as being green. The other issues we can live with until we’ve moved over to PRs, I think.

IIUC there are people who don’t use arc because of allergies to PHP. An arc-dependent solution would not work for everyone.

This is it: Opening up PRs experimentally for a subset of the LLVM project