[RFC] Add a warning when bypassing the premerge testing

Here is a first attempt to document this: [Doc] Add a section on CI to the GitHub documentation by joker-eph · Pull Request #85376 · llvm/llvm-project · GitHub

2 Likes

I strongly support this proposal and think it is a first necessary step into the right direction. The proposed changes to the documentation look good me.

Windows is a first-class target for our project and is a frequent source of test failures in Clang, so “let’s just skip it” is solving the wrong problem. IMO, we should not regress behavior for folks who develop on Windows or do post-commit triage.

This is circular logic and I disagree with it. The project is not broken at HEAD because we have no requirement that it be formatted. The check is wrong, and the changes you’d like to make will make it more obvious how wrong the check is (so it’s not orthogonal – I’m calling out the impact your proposed changes will have due to having a bad check).

Regardless, I still think it’s reasonable to move forward with your changes because it’s still possible to continue to ignore the faulty CI behavior. However, I think folks in Clang community are being trained to ignore CI failures because of the poor quality of the format check and I hope the folks working on infrastructure can solve that issue.

I’d like to echo @AaronBallman point here: this is not about Windows or any particular target, it’s about stability of our CI. I see that Mehdi’s doc update does not mention any of that, just “let’s all be reasonable”, which is very reasonable.

We also have problems with Flang tests that fail randomly, so there’s more than the infrastructure group can do to improve the state.

This is not about first-class, it’s about the role of pre-commit bots.

Back in the old days, when I was trying to make Arm a first class citizen, I had a requirement to not only fix the Arm backend, fix all tests and make the test-suite run flawlessly on Arm, but also to keep the buildbots clean, fast and actionable. Those were Panda boards, so not fast. It was a pain to make them clean and actionable, but it was my pain, not the community’s. But these were post-commit bots.

When we finally agree with pre-commit tests, the (very reasonable) requirement is that they actually be fast and stable. So, Windows can still be a first-class citizen while only having post-commit bots, like many other first-class supported features that do not have pre-commit.

To me, the obvious solution is on the infrastructure side: if we want Clang/Windows to be a pre-commit target, we need fast, clean and actionable bots available. And that means getting new machines, improve build times and stability, and not making everyone else wait hours for a bot to come back to merge.

1 Like

Seems like you argue about the definition of “broken”: if I commit a test which is undefined behavior, and your clang commit break the bot because of this test, is the project broken? After all it’s a faulty test…

I don’t see it any different: someone added a “test” here, you’re claiming that this is a bad test, well then it’s no different from any other test to me: delete it! (or fix it).

It seems that you’re thinking that there is some sort of organized group dedicated to working on the infrastructure, when in practice there are two parts:

  1. making the machines available and churning the commits and PR and reporting the result, this is indeed in general done by “infra” oriented people.
  2. the sets of tests to perform on this infra. This is not done by any infra people, this as much your responsibility as mine and everyone else. There is no difference to me between maintaining a lit test, a C++ test or fixing this script ; which is why I told you before: please send a patch! (by the way it seems to me you reviewed this? It was copied from libcxx checks at the time)

I don’t think we should be documenting that it’s okay to ignore the Windows one if it’s still pending. There are a number of reasons:

  1. Many (most?) developers develop only on Linux, yet there are a significant number who do use Windows. Breaking Windows in main is a pain for these people, just as much as it is when the Linux build is broken. Because many people only develop on Linux, it’s fairly common for only the Windows build to be broken by these people. Pre-commit CI should minimise this.
  2. I would hope that at some point the Windows pre-commit CI job has enough hardware behind it that it finishes reasonably quickly (I accept it will never be as fast as Linux, due to the overheads incurred by running on Windows). I think we can agree that once this is the case, any added point should be removed. However, I can almost guarantee you that this removal will not happen for a significant period of time (if ever) until after the situation has improved enough. There’s also going to be a grey area where it’s unclear whether the situation is good enough.
  3. The point will train people to always ignore the pending Windows CI, which we don’t want long-term. Removing the bullet point will not undo that training, since most people will never be aware of the change, I suspect.
  4. In my opinion, waiting a few hours between creating a PR and landing it is a good idea anyway, as it allows more people to actually see the PR and comment on it. You can continue to develop your next feature on a branch off of your local PR, if you need to do so.

I do think we should make it clear in the docs that it’s okay to ignore both Linux and Windows pending checks, if you’ve just made a minor revision (e.g. formatting), and the previous checks were green.

1 Like

Nowhere in the proposed changes we mention Windows or Linux or any other tests by name. Please review the PR, I believe it covers all your points above.

1 Like

Sorry, I was specifically referring to @nikic’s comments above and the ensuing conversation about that. I didn’t even realise there was a PR :slight_smile:.

2 Likes

What’s the status of this RFC? Do we have consensus?

I believe so.
However right now I am concerned with how the windows workers are overloaded: it takes an unreasonable amount of time to get a worker available it seems.
I am concerned about the negative experience for developer if we roll this out in the current condition. @lnihlen is at EuroLLVM, hopefully we’ll manage to chat a bit about it.

1 Like

I’m still working on the queuing delays, and happy to chat with anyone at EuroLLVM about presubmit!

To expand a bit on my last, I’m suspicious that the problem is disk bandwidth. I have a planned change in the works to move the machines from “balanced” SSD/HDD configuration to an all SSD configuration. This will require rebooting each instance in the windows build cluster, so I’m going to wait until this weekend to make the move, in order to minimally disrupt the build jobs.

If that helps, last weekend pipelines have been burning through the queue until roughly Saturday afternoon (UTC). I happened to keep a close eye on them last weekend.

Agreed. To this end, I spent some time investigating what could have significant impacts (at least from the Clang perspective) and came up with:

  • We could set Windows CI up to run only after the Linux build and test was successful.
  • We could run Windows CI on the initial push of the patch but then only run it on demand by requiring some sort of “special comment” on the PR to kick them off, or run Windows CI on the PR only after it’s gotten one “Accept” from a reviewer.
  • We could drop all tests on Windows aside from the ones added in the PR.
    • We could decide to be even smarter and look at which directories were touched by the PR and run tests from the corresponding test directory in addition to the tests added in the PR.

This sounds sensible.

We use benchmark labels for this purpose:

Example:

The way I added, it does not trigger on update. You have to remove/add again to trigger the benchmarks again. Not great, but maybe there’s an easier way?

This is a great general direction, but maybe the other two above sound better/easier short term.

That mapping is not always straightforward. For example, Clang has a single CodeGen library, but around a dozen CodeGen* test directories. LLVM has an IR library but there is no llvm\test\IR directory.

Also, I don’t think the time the bot takes once it starts running is a huge factor? I could easily be wrong, but my impression is that the problem is always that it takes a long time to get to the head of the queue.

Agreed, it’s not trivial and it definitely loses coverage. Might be more work than it’s worth.

My thinking is: if we reduce how long it takes for CI to complete, then the builder becomes available more frequently, which should hopefully reduce the queue times.

1 Like

There are traditionally two ways to reduce queue time: decrease time per item, and increase parallelism. Both are likely worth pursuing. One problem is that increasing parallelism really means throwing hardware at it, which has visible dollar costs that managers get unhappy about. They are much more tolerant of the invisible dollar costs of slow turnaround times.

I could see looking at what projects are touched, as opposed to what libraries are touched, to decide on test sets. A Clang patch only needs to run Clang tests. (Arguably an LLVM patch would need to run LLVM tests and everything above it in the stack, so there’s debate that could happen there too.) A doc update doesn’t need to run any tests.

I have nebulous thoughts about deriving dependencies in some more automatic way, but they keep tripping over practicalities that will wildly overstate dependencies. Not worth pursuing here.

Update on this. I flew home from EuroLLVM on Saturday, and the resulting jetlag on Sunday meant that I slept through that Sunday AM window I was hoping to update the VMs during.

So, given that CPU consumption was comparatively low right now (7:30a Toronto local), I just started the rolling restart process for the Windows VMs to move them to SSDs. Your Windows builds may fail during this time, please restart them after I post back here saying the restart process is finished.