[RFC] Add a warning when bypassing the premerge testing

mehdi_amini · March 15, 2024, 9:51am

Here is a first attempt to document this: [Doc] Add a section on CI to the GitHub documentation by joker-eph · Pull Request #85376 · llvm/llvm-project · GitHub

marbre · March 15, 2024, 10:39am

I strongly support this proposal and think it is a first necessary step into the right direction. The proposed changes to the documentation look good me.

AaronBallman · March 15, 2024, 12:23pm

Windows is a first-class target for our project and is a frequent source of test failures in Clang, so “let’s just skip it” is solving the wrong problem. IMO, we should not regress behavior for folks who develop on Windows or do post-commit triage.

AaronBallman · March 15, 2024, 12:52pm

This is circular logic and I disagree with it. The project is not broken at HEAD because we have no requirement that it be formatted. The check is wrong, and the changes you’d like to make will make it more obvious how wrong the check is (so it’s not orthogonal – I’m calling out the impact your proposed changes will have due to having a bad check).

Regardless, I still think it’s reasonable to move forward with your changes because it’s still possible to continue to ignore the faulty CI behavior. However, I think folks in Clang community are being trained to ignore CI failures because of the poor quality of the format check and I hope the folks working on infrastructure can solve that issue.

rengolin · March 15, 2024, 2:20pm

I’d like to echo @AaronBallman point here: this is not about Windows or any particular target, it’s about stability of our CI. I see that Mehdi’s doc update does not mention any of that, just “let’s all be reasonable”, which is very reasonable.

We also have problems with Flang tests that fail randomly, so there’s more than the infrastructure group can do to improve the state.

This is not about first-class, it’s about the role of pre-commit bots.

Back in the old days, when I was trying to make Arm a first class citizen, I had a requirement to not only fix the Arm backend, fix all tests and make the test-suite run flawlessly on Arm, but also to keep the buildbots clean, fast and actionable. Those were Panda boards, so not fast. It was a pain to make them clean and actionable, but it was my pain, not the community’s. But these were post-commit bots.

When we finally agree with pre-commit tests, the (very reasonable) requirement is that they actually be fast and stable. So, Windows can still be a first-class citizen while only having post-commit bots, like many other first-class supported features that do not have pre-commit.

To me, the obvious solution is on the infrastructure side: if we want Clang/Windows to be a pre-commit target, we need fast, clean and actionable bots available. And that means getting new machines, improve build times and stability, and not making everyone else wait hours for a bot to come back to merge.

mehdi_amini · March 15, 2024, 11:04pm

Seems like you argue about the definition of “broken”: if I commit a test which is undefined behavior, and your clang commit break the bot because of this test, is the project broken? After all it’s a faulty test…

I don’t see it any different: someone added a “test” here, you’re claiming that this is a bad test, well then it’s no different from any other test to me: delete it! (or fix it).

It seems that you’re thinking that there is some sort of organized group dedicated to working on the infrastructure, when in practice there are two parts:

making the machines available and churning the commits and PR and reporting the result, this is indeed in general done by “infra” oriented people.
the sets of tests to perform on this infra. This is not done by any infra people, this as much your responsibility as mine and everyone else. There is no difference to me between maintaining a lit test, a C++ test or fixing this script ; which is why I told you before: please send a patch! (by the way it seems to me you reviewed this? It was copied from libcxx checks at the time)

jh7370 · March 18, 2024, 9:32am

I don’t think we should be documenting that it’s okay to ignore the Windows one if it’s still pending. There are a number of reasons:

Many (most?) developers develop only on Linux, yet there are a significant number who do use Windows. Breaking Windows in main is a pain for these people, just as much as it is when the Linux build is broken. Because many people only develop on Linux, it’s fairly common for only the Windows build to be broken by these people. Pre-commit CI should minimise this.
I would hope that at some point the Windows pre-commit CI job has enough hardware behind it that it finishes reasonably quickly (I accept it will never be as fast as Linux, due to the overheads incurred by running on Windows). I think we can agree that once this is the case, any added point should be removed. However, I can almost guarantee you that this removal will not happen for a significant period of time (if ever) until after the situation has improved enough. There’s also going to be a grey area where it’s unclear whether the situation is good enough.
The point will train people to always ignore the pending Windows CI, which we don’t want long-term. Removing the bullet point will not undo that training, since most people will never be aware of the change, I suspect.
In my opinion, waiting a few hours between creating a PR and landing it is a good idea anyway, as it allows more people to actually see the PR and comment on it. You can continue to develop your next feature on a branch off of your local PR, if you need to do so.

I do think we should make it clear in the docs that it’s okay to ignore both Linux and Windows pending checks, if you’ve just made a minor revision (e.g. formatting), and the previous checks were green.

rengolin · March 18, 2024, 9:51am

Nowhere in the proposed changes we mention Windows or Linux or any other tests by name. Please review the PR, I believe it covers all your points above.

jh7370 · March 18, 2024, 9:58am

Sorry, I was specifically referring to @nikic’s comments above and the ensuing conversation about that. I didn’t even realise there was a PR .

tstellar · April 9, 2024, 11:04pm

What’s the status of this RFC? Do we have consensus?

mehdi_amini · April 9, 2024, 11:11pm

I believe so.
However right now I am concerned with how the windows workers are overloaded: it takes an unreasonable amount of time to get a worker available it seems.
I am concerned about the negative experience for developer if we roll this out in the current condition. @lnihlen is at EuroLLVM, hopefully we’ll manage to chat a bit about it.

lnihlen · April 10, 2024, 7:32am

I’m still working on the queuing delays, and happy to chat with anyone at EuroLLVM about presubmit!

lnihlen · April 10, 2024, 8:16am

To expand a bit on my last, I’m suspicious that the problem is disk bandwidth. I have a planned change in the works to move the machines from “balanced” SSD/HDD configuration to an all SSD configuration. This will require rebooting each instance in the windows build cluster, so I’m going to wait until this weekend to make the move, in order to minimally disrupt the build jobs.

Endill · April 10, 2024, 2:16pm

If that helps, last weekend pipelines have been burning through the queue until roughly Saturday afternoon (UTC). I happened to keep a close eye on them last weekend.

AaronBallman · April 10, 2024, 4:27pm

Agreed. To this end, I spent some time investigating what could have significant impacts (at least from the Clang perspective) and came up with:

We could set Windows CI up to run only after the Linux build and test was successful.
We could run Windows CI on the initial push of the patch but then only run it on demand by requiring some sort of “special comment” on the PR to kick them off, or run Windows CI on the PR only after it’s gotten one “Accept” from a reviewer.
We could drop all tests on Windows aside from the ones added in the PR.
- We could decide to be even smarter and look at which directories were touched by the PR and run tests from the corresponding test directory in addition to the tests added in the PR.

rengolin · April 12, 2024, 4:17pm

This sounds sensible.

We use benchmark labels for this purpose:

github.com

plaidml/tpp-mlir/blob/1c66a2d2986326b6e830a49616d8cd01eb3c2006/.github/workflows/tpp-benchmark.yml#L40


      
          Check_LLVM:
            if: ${{ github.event_name }} == "push" || \
                (${{ github.event_name }} == "pull_request" && ${{ github.event.label.name == 'benchmark' }})
            uses: ./.github/workflows/tpp-llvm.yml
            secrets: inherit
          
          TPP-MLIR-SPR-BASE:
            runs-on: pcl-tiergarten
            if: ${{ github.event_name }} == "push" || \
                ${{ inputs.RUN_SPR_BENCH }} == 1 || \
                (${{ github.event_name }} == "pull_request" && ${{ github.event.label.name == 'benchmark' }})
            needs: Check_LLVM
            steps:
              - uses: actions/checkout@v4
              - name: Sapphire Rapids Base
                run: |-
                  CMD="KIND=Release COMPILER=clang LINKER=lld BUILDKITE_BENCHMARK_NUM_ITER=${{ env.NUM_ITER }} \
                       ${{ github.workspace }}/scripts/buildkite/benchmark.sh -b -p"
                  ${{ env.SRUN }} --partition=spr --time=2:00:00 --constraint=\"notrb\" -- $CMD
          
          TPP-MLIR-SPR-OMP:

Example:

The way I added, it does not trigger on update. You have to remove/add again to trigger the benchmarks again. Not great, but maybe there’s an easier way?

This is a great general direction, but maybe the other two above sound better/easier short term.

pogo59 · April 12, 2024, 4:44pm

That mapping is not always straightforward. For example, Clang has a single CodeGen library, but around a dozen CodeGen* test directories. LLVM has an IR library but there is no llvm\test\IR directory.

Also, I don’t think the time the bot takes once it starts running is a huge factor? I could easily be wrong, but my impression is that the problem is always that it takes a long time to get to the head of the queue.

AaronBallman · April 12, 2024, 5:02pm

Agreed, it’s not trivial and it definitely loses coverage. Might be more work than it’s worth.

My thinking is: if we reduce how long it takes for CI to complete, then the builder becomes available more frequently, which should hopefully reduce the queue times.

pogo59 · April 12, 2024, 5:32pm

There are traditionally two ways to reduce queue time: decrease time per item, and increase parallelism. Both are likely worth pursuing. One problem is that increasing parallelism really means throwing hardware at it, which has visible dollar costs that managers get unhappy about. They are much more tolerant of the invisible dollar costs of slow turnaround times.

I could see looking at what projects are touched, as opposed to what libraries are touched, to decide on test sets. A Clang patch only needs to run Clang tests. (Arguably an LLVM patch would need to run LLVM tests and everything above it in the stack, so there’s debate that could happen there too.) A doc update doesn’t need to run any tests.

I have nebulous thoughts about deriving dependencies in some more automatic way, but they keep tripping over practicalities that will wildly overstate dependencies. Not worth pursuing here.

lnihlen · April 15, 2024, 11:40am

Update on this. I flew home from EuroLLVM on Saturday, and the resulting jetlag on Sunday meant that I slept through that Sunday AM window I was hoping to update the VMs during.

So, given that CPU consumption was comparatively low right now (7:30a Toronto local), I just started the rolling restart process for the Windows VMs to move them to SSDs. Your Windows builds may fail during this time, please restart them after I post back here saying the restart process is finished.

Topic		Replies	Views
Block Github merge on changes requested? GitHub	44	1409	January 25, 2024
Pre-merge checks updates May 2023 Project Infrastructure	6	400	May 15, 2023
Pre-merge testing and transition to PRs Project Infrastructure	0	164	August 31, 2023
Pre-merge testing, running checks on main branch Infrastructure Working Group	5	521	February 8, 2022
Pre-merge testing enabled by default LLVM Dev List Archives	1	81	February 28, 2020

[RFC] Add a warning when bypassing the premerge testing

Related Topics