There are a few problems I’ve been trying to address:
Instability of the Phabricator <=> Buildkite bridge. We sometimes see spurious failures in the setup steps of the CI pipeline, before the actual CI jobs are triggered. This is an example. From the Phabricator side, this looks like the build is still running but in reality it has failed:
Complexity of getting to the actual CI results. When you click on the current pre-merge checks link in the Phabricator UI, you get to a page like this in Harbormaster:
If you want to see the actual CI being run, you need to copy-paste the buildkite.com link in plain-text and follow it. Then you get to a first Buildkite pipeline and you need to click through 2 other pipelines in order to get to the actual CI pipeline being run for libc++. This is a bit cryptic for folks not familiar with our CI setup and just trying to get stuff done.
The current Phabricator <=> Buildkite bridge uses the unit test reporting feature of Phabricator to report failing tests in Lit. This is nice, however as explained here it doesn’t always report everything. As a result, we’ve sometimes seen CI runs where we thought we had no more failing tests but in reality there were.
We’ve also seen instances of the libc++ CI pipeline being skipped but the Phabricator UI reporting the CI as passing. This happened with ⚙ D143914 [libc++] Clean up pair's constructors and assignment operators and this CI run (you won’t be able to see the green check mark on the review anymore because I shipped it).
Because large chunks of the CI setup don’t live in the LLVM monorepo and are run externally, it can be challenging to fix these issues by ourselves. While the folks who support us have been extremely resourceful and I am eternally thankful for their support, I feel like we would benefit from having more control over our CI infrastructure, since it has grown to become a very important part of some projects. For example, when the CI infrastructure starts failing, libc++ simply can’t make any progress because we rely almost entirely on our CI to test our ~60 different configurations.
Things have been working mostly well for us for the past two years, but recently we started seeing more and more of these issues. In particular, issues (1) and (4) were non-existent and they started happening recently, which is basically a dealbreaker for libc++ development. As a result, I started investigating solutions to unblock libc++ contributors who were blocked and I discovered that Buildkite had a builtin integration with Phabricator: Phabricator | Buildkite Documentation.
This is what I am looking into right now. It requires using the “staging area” feature of Phabricator. A staging area is basically a remote repository that
arc diff will automatically push a tag to, and that can be used by any external system to fetch the exact source code content used by a given Differential Diff. This results in a simpler stack to integrate Phabricator and Buildkite, which means fewer things that can fail between us and the actual CI. For example, this shows how the Phabricator UI presents a failed job in my test review:
Here, clicking on
View in Buildkite
will take you directly to the CI pipeline being run, which solves (2) as well as (1), (3) and (4). Note that the job is failing due to a legitimate issue in our tests, don’t let it distract you.
The downside is that using
arc diff will need to push the commit being reviewed to a staging area, which means that there needs to be some Github access set up to submit a patch and have the CI run on it. If one doesn’t want to set up Github access,
arc diff --skip-staging can be used and the staging area will be skipped, and the CI will not function. This also means that uploading diffs via the web UI wouldn’t support pre-commit CI either.
Those are definite downsides, however the current situation is also problematic since it gets in the way of getting stuff done when it fails us. In the short term, what I have is a working proof of concept using the staging area and the builtin Phabricator integration, and I will be onboarding the libc++ contributors onto that workflow to see how well that works. We’ll see how that goes and we might propose doing that for the rest of the project, however my goal here is primarily to solve the small crisis we are facing with the libc++ CI.
That should allow us to have some PRs and figure any issues with GitHub without affecting existing review process but also improve pre-commit CI.
I think that’s great and in fact @tstellar was suggesting something similar. I would support any work in the direction of frontloading the integration of our CI with Github PRs, I think we’ll be extremely thankful for that when we begin switching over to PRs. However, as I tried to explain above, the investigation I am conducting has the main goal of fixing some very concrete ongoing issues that are making it difficult for libc++ to make progress as usual, so it’s a bit orthogonal and narrower in scope.