it’s been a while since any updates to premerges: I was mostly maintaining it so it will not fail apart too badly. But this is clearly not sustainable and causes frustration both for me and you all. So here are three main things I think we should do, please comment on them in a relevant thread.
One of the biggest issues I have is demonstrating that having premerges improves the quality of development. For example I have tried to look at the number of reverts in the project and it does not seem to budge. Now I think that maybe we should only run them on some random subset of reviews and see if the number of submitted breakages from group A differs from group B. Not sure if you are up to such experiments.
As usual, if you are interested in improving CI quality please please get in touch.
I think the existing pre-merge checks have a perception of being not especially robust, i.e., tend to fail for unrelated reasons. For example my D150114 had pre-merge check fail, and yet that patch has zero code changes–all it does is add commentary. Following the links, I can’t tell what the problem is.
Pre-merge CI is not helpful unless it is robust and the reports are actionable.
+1 to this, I’ve been advocating for this since the beginning: “no false positives” is a must for a pre-merge CI, and having the exact configuration tracking the main branch in buildbot seems like the very first step! (next is pruning flaky tests and reducing the surface of testing based on what is changed in the patch).
This is a high-priority config to keep green all the time.
Pre-merge checks are not readable and therefore not actionable
Speaking to readability, I took a look at D150114, and I see this in the logs:
FAIL: Clangd Unit Tests :: ./ClangdTests/26/37 (1229 of 1229)
******************** TEST 'Clangd Unit Tests :: ./ClangdTests/26/37' FAILED ********************
Value of: Server.blockUntilIdleForTest()
Failed Tests (2):
Clangd Unit Tests :: ./ClangdTests/ClangdServer/MemoryUsageTest
Clangd Unit Tests :: ./ClangdTests/ClangdServer/TidyOverrideTest
Perhaps that’s not very readable, and maybe the accessibility can be improved at the margin (clean up the logs by running the tests with llvm-lit instead of through ninja), but I don’t think there’s much to do here on the premerge infrastructure side. I think most of the burden falls on folks writing LLVM tests and test infrastructure to design tests so that failures are readable and actionable. The test logs are there, and that’s what you’d get if you ran the tests locally.
Speaking to failures unrelated to the patch, yes, LLVM isn’t one project, it’s an umbrella, and this is an issue that will require more thought. I highly doubt that your comment-only change caused these clangd test failures, so they must be flaky or pre-existing failures. I can think of two ways to solve this problem:
Enforce rigorous, high standards of pre-commit testing across the entire LLVM project, to the point that all of us start to care about these clangd tests for example. If we find that they don’t provide value for the average developer, we should remove them from the LLVM-wide presubmit checks
Invest heavily in test infrastructure. Add things like A/B comparisons of test results before and after the patch to ignore pre-existing failures. Add test re-runs to deflake tests.
I want to emphasize that approach 2 is really expensive, and it comes with an eternal maintenance burden, even if it sounds appealing to the average developer. It’s a big increase in scope. The main problem with approach 1 is that it requires discipline, and nobody likes to spend time arguing over the perceived value of somebody else’s tests.
My view is that we should start with the disciplined approach 1, and we should reduce the set of tests that we run on pre-merge checks so that we get something with high signal to noise to get more community approval. We should prioritize the developer approval of the system overall even if that means compromising complete test coverage.
Thanks for all the work you have been putting into the precommit testing infrastructure!
From my personal experience, the precommit tests helped me multiple times by highlighting test files I missed to update in a patch (e.g. due to having some backend/sub-project disabled). I suspect in many such cases, people (myself included) just push a follow-on commit to fix the test failure, not revert the original commit, so those cases would be missed by looking at reverts only.
Now that you’ve identified the failures, I can find them too. And once I get that much (and if it seems likely to be my fault) I know how to go about trying to reproduce it. It’s as actionable as running the tests myself–once I find the report.
But to get there takes fumbling around with links on two or three intervening pages, scrolling to the bottom of an 11000-line log, and working backwards 100 or so lines to find what actually went wrong. This is not an especially friendly user experience. When I run the tests myself there is a summary right at the end, and if I care about details I can work back to where those are reported. This pre-commit log blorps out a bunch of other stuff after the important bit and makes it that much harder to navigate.
(I have similar issues with the Jenkins jobs we run internally, where it seems that most of the information Jenkins is willing to provide is oriented toward debugging the Jenkins job itself, not reporting the software-under-test problem. Maybe it’s just the nature of automated testing to make that stuff not simple to find, but I have to think we can do better.)
So, while we could have a separate debate about the way individual LLVM tests work and whether they provide actionable reports, I maintain that the LLVM test results that we are all used to are unnecessarily hard to find in this pre-commit report.
As with anything, once I know how to navigate to where pre-commit is hiding the important info, yes I can find it. But this is definitely not a good experience for the newcomer, and unnecessarily burdensome on everyone.
FWIW, I have found these tests useful (as posted on other threads), and have prevented several patches getting landed that would otherwise have broken on one or more build bots (in some cases, on virtually all build bots!). However, I am aware that I have a higher tolerance for false positives and hard-to-navigate UIs than perhaps your average developer.
I do agree that the usability of the reports is less-than-ideal. In my ideal world, you’d get reports along the lines of the following:
Configuraiton and build failures (i.e. failures during cmake, or failures due to files failing to link/compile): a clear statement that the code failed to build, with a link to the part of the log that describes the build failures (not just the start/end of that log, since quite often there’ll be plenty of unrelated noise).
Test failures (i.e. lit and unit test failures): a list of tests that failed, with links to the relevant parts of the test output that describes the individual test failure.
Patch failures (i.e. where Phabricator/GitHub can’t apply the patch): ideally, these wouldn’t occur, but given that they do, I don’t think there’s a specific need to improve this, beyond perhaps being able to specify somewhere what the base commit is that this patch should be applied to (I doubt this is easily possible admittedly).
Where multiple configurations fail, the failure report should list these separately (like it does already, to be fair).