Severely degraded pre-commit CI performance for some builds

Endill · August 15, 2024, 5:16pm

Yesterday it was brought to my attention that Linux pre-commit CI waiting times have crept up to 3 hours. I don’t consider it to be too bad, but decided to look into it, so that it wouldn’t get worse in the future. I wasn’t able to find the reason, but here are my findings.

Sometimes compilation step on both Linux and Windows takes much, much more time than it usually does. Data points are listed below.

Baseline (https://buildkite.com/llvm-project/github-pull-requests/builds/91862): Linux — 51s, Windows — 446s.

Problematic builds I was able to find (Linux/Windows):
Github pull requests #91863 : 1372s/2882s
Github pull requests #91864 : 1383s/2870s
https://buildkite.com/llvm-project/github-pull-requests/builds/91865 : 1376s/2804s
Github pull requests #91887 : 1089s/2961s
Github pull requests #91892 : 798s/1072s

Windows numbers caveat

(Windows numbers are not entirely accurate, because I measured the time between ninja invocation and start of MLIR tests, but on Windows MLIR tests start before Clang and various LLVM tools are compiled. This behavior is repeatable, so I consider those numbers to still tell us a story.)

Looking closely at the compilation log, I can’t see any outliers. Slowdown is evenly spread between compilation of translation units, as if the whole machine was slowed down.

I don’t think it’s attributable to subprojects that the change touches. Such, baseline 91862 touches compiler-rt and LLVM, but problematic 91863 touches only LLVM.

I also don’t think it’s attributable to a particular agent. All five problematic builds above were executed on a different Linux agent (our Linux pool is 5 agents), and on 4 different Windows agents. The same agents handle other builds just fine, in the usual time window.

Interesting fact is that I haven’t encountered a build where only Linux or Windows was affected. Which suggest that there could be something wrong with the PRs themselves, like bad commit from main made its way into their respective branches. But could a single commit slow down our pre-commit CI this much? Doesn’t seem likely to me.

Independent of the issue above, I also found out that one of the LLDB test has been timing out after 20 minutes (Github pull requests #91810). I don’t think it’s related, and it seems that LLDB is already working on that in Fix single thread stepping timeout race condition by jeffreytan81 · Pull Request #104195 · llvm/llvm-project · GitHub

I don’t understand why is this happening, but I’d love to hear what people think about this.

CC @boomanaiden154-1 @lnihlen @gchatelet @Keenuts @tstellar

boomanaiden154-1 · August 15, 2024, 5:23pm

At least one pattern I can see in all those PRs is that they’re old. Just looking at the numbering, New PRs are at about 105000, whereas these are ~92000, and that is consistent for all of them.

If I had to take a guess, this would be due to cache thrashing. The times seem like they could be about right for building everything with a cold cache. If we aren’t already reporting ccache statistics in the build, we probably should start doing that so we can diagnose issues like this (assuming this is caused by caching issues).

I’m not sure what the solution is here. Github actions by default tests the PR merged into main. That would fix the problem here (assuming I diagnosed it correctly, more investigation needed) and enable caches to stay warm, but would make everyone’s CI more dependent on whether or not main is broken.

Endill · August 15, 2024, 5:33pm

You are misled by the way Discourse formatted those links. Those are Buildkite build IDs, and not PR IDs. PRs seem to be recent enough.

We don’t, but we probably should. Thank you for bringing this up, I totally forgot about this par of our CI.

I think status quo is good, because it’s predictable. Testing changes against main means some kind of automatic rebasing on top of main, which brings the question what is actually being tested, especially if there are merge conflicts.

Moreover, I don’t consider main to be reliable. More than once I had to merge main into my PR multiple times because of unrelated failures. I believe that the only way to address this is to merge when CI is green, but for this we need to implement a set of cultural and technical solutions.

boomanaiden154-1 · August 15, 2024, 7:36pm

Ah. Thanks for pointing that out. That definitely changes my analysis.

I’ll see if I can write up a patch today to further the investigation. Given the above, it’s reasonably likely it’s something else though.

Yeah, not interested in changing that currently if we don’t need to. Github handles merge conflicts by just not running tests if there are merge conflicts present. That’s not super ideal, but maybe not the worst behavior given the goal of presubmit checks is to ensure that things still work before landing in main and one cannot land stuff in main without resolving merge conflicts. More discussion would need to happen before any changes would be made there though.

I think we’re in agreement here.

mehdi_amini · August 15, 2024, 7:46pm

Merge conflicts aren’t really a problem as far as I can tell (one can’t merge without fixing them anyway). However a PR that isn’t rebased for weeks and does not have merge conflicts has higher chances of a false “green” CI result if the CI does not test the current merged state: and that seems to go against the goal of presubmit checks.

Endill · August 16, 2024, 3:26pm

I agree now that testing PRs against main is a viable alternative to the status quo. Thank you for you comments.

Topic		Replies	Views
Please fix the buildbots. LLVM Dev List Archives	7	96	December 16, 2009
llvm and clang are getting slower LLVM Dev List Archives	37	319	April 1, 2016
ARM failures LLVM Dev List Archives	20	246	January 9, 2013
1 Week Before 2.0 Branch Creation LLVM Dev List Archives	20	134	May 7, 2007
RFC: Future of Windows pre-commit CI LLVM Project	80	2864	May 7, 2024

Severely degraded pre-commit CI performance for some builds

Related topics