Unfortunately, we did not realize from the beginning how detrimental false positives (reported failures when the change is actually fine) can be to the perception of our results. In hindsight, it would have been better to start with something basic and reliable, gradually expanding tests (thanks to Augie Fackler for the insight). However, we also wanted to experiment and explore different modes.
Nevertheless, I believe the situation can still be mended.
Here are some ideas:
-
(Easy) Significantly restrict the number of tests/projects that are tested. For example, we can limit testing to LLVM and clang on Linux only. For other subprojects, we can recommend defining a stable subset of build/test or following the libcxx approach by defining their own custom pipeline. We can still provide a way for developers to request a Windows build or broader testing if desired, perhaps by adding a specific keyword to the review description.
-
(Hard) Select a “golden green” commit as a baseline for comparison.
-
(Hard) Identify new test failures introduced by the change compared to the main branch. This approach is not ideal and does not address flaky tests.
-
(Unknown) Introduce a way to fine-tune the tests and exclude long-running or flaky ones. The difficulty and policy around disabling tests are uncertain.
Considering the options, I believe implementing (a) would provide a net benefit in terms of feedback time (as we would have fewer initial tests to run) and significantly reduce the number of false positives.
Thoughts?
IMO each project should define what it wants to run as CI. It really is different for each project, and trying to have one CI job that satisfies everyone ends up satisfying no one.
What I would suggest here is that we could simply call a script like generate-buildkite-pipeline
(llvm-project/generate-buildkite-pipeline at main · llvm/llvm-project · GitHub) which would be expanded by projects that want to with their own testing pipeline. For example we could move generate-buildkite-pipeline
to utils/ci
and then it would be entirely transparent what CI happens for what project, and it would be trivial for folks to onboard their project by just editing that file.
Essentially this is a stripped down version of your first suggestion above where we don’t even try to run anything unless the project has opted into it and define what CI they want to run. But we’re empowering the projects to do it by making it super easy and handling the infrastructure side of things.
IMO each project should define what it wants to run as CI.
This is already the case to some extent: there are various configs in buildbot, and even in the pre-merge it is quite customizable. For example, phabricator changes to MLIR are for a while running ninja check-mlir
and won’t build or test any of clang, lld, etc. (it’ll build the minimum amount of LLVM needed, but not run any test there).
More importantly, “project should define” does not seem very well defined to me, because projects are composed of various stakeholder which each have their own interest and priorities: right now it is up to a stakeholder to step up and provide CI for something (a config, platform, etc.) they care about.
One more thing is that there are things we can do in a “unified” pre-merge build as today that become less efficient when you treat projects independently: the projects mostly form a DAG organized around LLVM itself, and the unified build means building LLVM once instead of “once per project a patch is touching” (That may be fine since most patch are not cross-projects?)
I don’t object to the principle of this discussion, but please don’t disable Windows tests in the pre-merge CI, at least for core LLVM (possibly clang and lld too). A significant number of patches I see have Windows problems in them, because most developers develop on non-Windows systems, so miss simple things that don’t work on Windows for whatever reason. Build failures that make it into main are a pain to deal with, even if LLVM’s post-commit build bots pick them up, because it blocks our (Sony’s) auto-merging process, if nothing else. Pre-commit testing has ensured many of these breakages haven’t made it to the main branch.
I’ve got no objection to different projects defining what sets of tests should be run though (including e.g. Windows/Linux only).