Yesterday it was brought to my attention that Linux pre-commit CI waiting times have crept up to 3 hours. I don’t consider it to be too bad, but decided to look into it, so that it wouldn’t get worse in the future. I wasn’t able to find the reason, but here are my findings.
Sometimes compilation step on both Linux and Windows takes much, much more time than it usually does. Data points are listed below.
Baseline (https://buildkite.com/llvm-project/github-pull-requests/builds/91862): Linux — 51s, Windows — 446s.
Problematic builds I was able to find (Linux/Windows):
Github pull requests #91863 : 1372s/2882s
Github pull requests #91864 : 1383s/2870s
https://buildkite.com/llvm-project/github-pull-requests/builds/91865 : 1376s/2804s
Github pull requests #91887 : 1089s/2961s
Github pull requests #91892 : 798s/1072s
Windows numbers caveat
(Windows numbers are not entirely accurate, because I measured the time between ninja
invocation and start of MLIR tests, but on Windows MLIR tests start before Clang and various LLVM tools are compiled. This behavior is repeatable, so I consider those numbers to still tell us a story.)
Looking closely at the compilation log, I can’t see any outliers. Slowdown is evenly spread between compilation of translation units, as if the whole machine was slowed down.
I don’t think it’s attributable to subprojects that the change touches. Such, baseline 91862 touches compiler-rt and LLVM, but problematic 91863 touches only LLVM.
I also don’t think it’s attributable to a particular agent. All five problematic builds above were executed on a different Linux agent (our Linux pool is 5 agents), and on 4 different Windows agents. The same agents handle other builds just fine, in the usual time window.
Interesting fact is that I haven’t encountered a build where only Linux or Windows was affected. Which suggest that there could be something wrong with the PRs themselves, like bad commit from main
made its way into their respective branches. But could a single commit slow down our pre-commit CI this much? Doesn’t seem likely to me.
Independent of the issue above, I also found out that one of the LLDB test has been timing out after 20 minutes (Github pull requests #91810). I don’t think it’s related, and it seems that LLDB is already working on that in Fix single thread stepping timeout race condition by jeffreytan81 · Pull Request #104195 · llvm/llvm-project · GitHub
I don’t understand why is this happening, but I’d love to hear what people think about this.