Flang tests are extremely slow on Windows

See RFC: Future of Windows pre-commit CI - #77 by philnik for context.

Here is an example of a pre-merge run: Github pull requests #56855

The interesting part is the time for lit execution running ninja check-flang on Linux and Windows:

# Linux
Testing Time: 5.98s
Total Discovered Tests: 2992
  Unsupported      :  137 (4.58%)
  Passed           : 2840 (94.92%)
  Expectedly Failed:   15 (0.50%)

# Windows
Testing Time: 1220.35s
Total Discovered Tests: 2983
  Unsupported      :  195 (6.54%)
  Passed           : 2771 (92.89%)
  Expectedly Failed:   17 (0.57%)

Slowest tests:

# Linux
Slowest Tests:
--------------------------------------------------------------------------
4.03s: Flang :: Lower/OpenMP/threadprivate-real-logical-complex-derivedtype.f90
1.90s: Flang :: Lower/OpenMP/FIR/threadprivate-real-logical-complex-derivedtype.f90
1.71s: Flang :: Fir/convert-to-llvm.fir
1.27s: Flang :: Lower/array.f90
1.05s: Flang :: Intrinsics/math-codegen.fir
1.04s: Flang :: Fir/dispatch.f90
0.95s: Flang :: Driver/omp-driver-offload.f90
0.92s: Flang :: Lower/select-type.f90
0.88s: Flang :: Driver/default-optimization-pipelines.f90
0.80s: Flang :: Driver/use-module.f90
0.80s: Flang :: Driver/optimization-remark.f90
0.76s: Flang :: Lower/OpenMP/FIR/rtl-flags.f90
0.74s: Flang :: Lower/OpenMP/rtl-flags.f90
0.72s: flang-OldUnit :: Evaluate/real.test
0.68s: Flang :: Lower/Arm/arm-sve-vector-bits-vscale-range.f90
0.67s: flang-OldUnit :: Evaluate/integer.test
0.66s: Flang :: Lower/RISCV/riscv-vector-bits-vscale-range.f90
0.65s: Flang :: Driver/falias-analysis.f90
0.65s: Flang :: Lower/allocatable-polymorphic.f90
0.62s: Flang :: Driver/fopenmp.f90

# Windows
Slowest Tests:
--------------------------------------------------------------------------
18.49s: Flang :: Driver/omp-driver-offload.f90
8.15s: Flang :: Lower/OpenMP/threadprivate-real-logical-complex-derivedtype.f90
7.39s: Flang :: Driver/fopenmp.f90
6.15s: Flang :: Intrinsics/math-codegen.fir
4.30s: Flang :: Driver/linker-flags.f90
4.26s: Flang :: Driver/fveclib.f90
4.11s: Flang :: Lower/OpenMP/FIR/threadprivate-real-logical-complex-derivedtype.f90
4.10s: Flang :: Driver/default-optimization-pipelines.f90
4.06s: Flang :: Driver/pic-flags.f90
4.04s: Flang :: Driver/aarch64-sve-vector-bits.f90
3.98s: Flang :: Semantics/modfile07.f90
3.83s: Flang :: Driver/gcc-toolchain-install-dir.f90
3.70s: Flang :: Driver/optimization-remark.f90
3.46s: Flang :: Driver/fsave-optimization-record.f90
3.29s: Flang :: Driver/mlir-debug-pass-pipeline.f90
3.18s: Flang :: Driver/use-module.f90
3.16s: Flang :: Driver/fdefault.f90
3.13s: Flang :: Driver/fixed-line-length.f90
3.08s: Flang :: Fir/convert-to-llvm.fir
3.05s: Flang :: Driver/falias-analysis.f90

Here is the cmake invocation for reference (release+asserts):

cmake -S C:/ws/src/llvm -B C:/ws/src/build -D 'LLVM_ENABLE_PROJECTS=clang;flang;llvm;mlir' -G Ninja -D CMAKE_BUILD_TYPE=Release -D LLVM_ENABLE_ASSERTIONS=ON -D LLVM_BUILD_EXAMPLES=ON -D COMPILER_RT_BUILD_LIBFUZZER=OFF -D 'LLVM_LIT_ARGS=-v --xunit-xml-output C:/ws/src/build/test-results.xml --timeout=1200 --time-tests' -D COMPILER_RT_BUILD_ORC=OFF -D CMAKE_C_COMPILER_LAUNCHER=sccache -D CMAKE_CXX_COMPILER_LAUNCHER=sccache -D MLIR_ENABLE_BINDINGS_PYTHON=ON -D CMAKE_EXE_LINKER_FLAGS=/MANIFEST:NO -D CMAKE_MODULE_LINKER_FLAGS=/MANIFEST:NO -D CMAKE_SHARED_LINKER_FLAGS=/MANIFEST:NO

As a comparison, during the same run, MLIR tests ran in 13.13s on Linux and 34.81s on Windows.

What should we do about this right now? Flang seems to become the bottleneck of the pre-merge testing here.

Thanks @mehdi_amini for bringing up this issue.

I had a quick look at one of the tests (the one below) and I could reproduce it on my linux machine. It looks like it is using a lot of CHECK-DAGs. If I change the CHECK-DAGs to CHECKs then the test is much faster. We can probably reduce the number of CHECK-DAGs.

We are looking into the other tests highlighted. We will remove some tests that test deprecated flows.

The problem I’m bringing here isn’t that one particular test is slow, it is that they are all much much slower on Windows than they are on Linux.

Let me recap:

  • MLIR tests ran in 13.13s on Linux and 34.81s on Windows, so 2.65x slower.
  • Flang tests ran in 5.98s on Linux and 1220.35s on Windows, so >200x slower.

There is something pathological here!

2 Likes

It seems to me that the problem is pathological, not necessarily with those tests? CHECK-DAG IS quite a bit slower than CHECK, and we should prefer that whenever possible.

That said, the bigger problem to me is how big the difference is between windows and linux. Perhaps the heavy use of CHECK-DAG is evidence of what we should be looking it, and seeing why IT is so much slower on windows (assuming that is the cause).

What is the specification of the machine used here?

I tried to look, but I couldn’t find it anywhere - I’m not an expert on how buildkite works, so it’s probably me not knowing how/where to look…

I agree, it’s not one or two tests that is the issue, it’s the whole run that takes significantly longer - and from what I can only estimate bease on the top ones lists, ALL of the tests take significantly longer - after all, some of the slow ones on Linux are “only” ~5x faster, which means there must be others that are worse, outside of that top-list.

@lnihlen : how we can we debug this? It there someone on your side that can provide support to run a more detailed profiling of ninja check-flang on one of these instance?

In the past there use to be some anti-virus kicking-in, but that was supposed to be fixed, can it be something similar hitting specifically flang here?

I’ve done a few builds myself on Windows and Linux with different compilers and I can’t reproduce this issue. I get around the same difference in performance between Windows and Linux as you’re seeing for the MLIR tests.

Windows with MSVC: check-flang takes 23s
Linux: check-flang takes 6s

I do also have AV running (Microsoft Defender) and it doesn’t seem to be interfering with my builds.
This would lead me to suspect there’s something else going wrong with the build machine, rather than this necessarily being a problem with flang.

3 Likes

I ran check-flang on a the 4-core Windows GitHub Action runners and got these results:

024-04-30T22:09:26.3263420Z Testing Time: 102.30s
2024-04-30T22:09:26.3263616Z 
2024-04-30T22:09:26.3263742Z Total Discovered Tests: 2980
2024-04-30T22:09:26.3264093Z   Unsupported      :  195 (6.54%)
2024-04-30T22:09:26.3264458Z   Passed           : 2766 (92.82%)
2024-04-30T22:09:26.3264824Z   Expectedly Failed:   17 (0.57%)
2024-04-30T22:09:26.3265172Z   Failed           :    2 (0.07%)
3 Likes

Thanks @tstellar and @DavidTruby !

@lnihlen : seems like something to investigate on the build machine itself, so ping on this:

A datapoint: Running check-flang on Windows, I get:

-- Testing: 2627 tests, 4 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..
Slowest Tests:
--------------------------------------------------------------------------
6.11s: Flang :: Driver/fdefault.f90
5.11s: Flang :: Driver/flarge-sizes.f90
3.81s: Flang :: Driver/write-module.f90
1.82s: Flang :: Driver/omp-driver-offload.f90
1.47s: Flang :: Intrinsics/math-codegen.fir
1.06s: Flang :: Fir/convert-to-llvm.fir
0.95s: Flang :: Driver/fopenmp.f90
0.87s: Flang :: Driver/use-module.f90
0.82s: Flang :: Lower/array.f90
0.77s: Flang :: Driver/optimization-remark.f90
0.75s: Flang :: Driver/default-optimization-pipelines.f90
0.74s: Flang :: Driver/multiple-input-files.f90
0.71s: Flang :: Driver/input-output-file.f90
0.70s: Flang :: Lower/OpenMP/rtl-flags.f90
0.69s: Flang :: Driver/falias-analysis.f90
0.64s: Flang :: Fir/dispatch.f90
0.62s: Flang :: Driver/emit-mlir.f90
0.60s: Flang :: Driver/fsave-optimization-record.f90
0.56s: Flang :: Driver/fixed-line-length.f90
0.54s: Flang :: Driver/input-from-stdin/input-from-stdin.f90

Tests Times:
--------------------------------------------------------------------------
[    Range    ] :: [               Percentage               ] :: [  Count  ]
--------------------------------------------------------------------------
[6.00s,6.50s) :: [                                        ] :: [   1/2562]
[5.50s,6.00s) :: [                                        ] :: [   0/2562]
[5.00s,5.50s) :: [                                        ] :: [   1/2562]
[4.50s,5.00s) :: [                                        ] :: [   0/2562]
[4.00s,4.50s) :: [                                        ] :: [   1/2562]
[3.50s,4.00s) :: [                                        ] :: [   0/2562]
[3.00s,3.50s) :: [                                        ] :: [   0/2562]
[2.50s,3.00s) :: [                                        ] :: [   0/2562]
[2.00s,2.50s) :: [                                        ] :: [   0/2562]
[1.50s,2.00s) :: [                                        ] :: [   2/2562]
[1.00s,1.50s) :: [                                        ] :: [   2/2562]
[0.50s,1.00s) :: [                                        ] :: [  18/2562]
[0.00s,0.50s) :: [*************************************** ] :: [2537/2562]
--------------------------------------------------------------------------

Testing Time: 58.80s

Total Discovered Tests: 2939
  Unsupported      :  202 (6.87%)
  Passed           : 2720 (92.55%)
  Expectedly Failed:   17 (0.58%)

The slowest test, Driver/fdefault.f90, does not even show up on Linux running on the same machine. What seems the most slowdown is filesystem walking. If I remove rm -rf %t/dir-flang-new && mkdir -p from the RUN line in Driver/fdefault.f90, I get:

Slowest Tests:
--------------------------------------------------------------------------
0.56s: Flang :: Driver/fdefault.f90

Windows defender is off. Switching off system restore and indexer didn’t help. Disk is an SSD.

Tests running slower on Windows is to be expected. Launching executables isn’t the fastest on Windows; or rather, Linux has optimized it to the brink e.g. with lazy initialization. I didn’t expect directory deletion/recreation to be that slow. And there’s a twist: With “Linux”, I meant WSL1, i.e. still the Windows kernel and NTFS filesystem.

I could not reproduce the pathological test time anywhere close to 1220.35s.

Performance debugging on Windows, including inside kernel, would be done is typically done using XPerf which is comparable to perf on Linux. See here for a guide. Same author used it to debug slow process startup times in the Chromium test suite. However, no idea how to do use it on the BuildKite workers.

1 Like

I took a stab at this on Sunday, basically the only day of the week it’s possible to iterate on anything on pre-commit CI. My findings are the following:

  1. After all the PASS messages are printed for all the flang tests, a 20 minutes pause happens. The time is fairly consistent: 20 minutes minus 7-12 seconds. I believe this is the degradation we’ve been experiencing, or at least a major part of it.
  2. I tried to insert some tracing into lit to understand what it’s spending this time on, but this didn’t lead me anywhere. Maybe I haven’t dug deep enough into lit, which is surprisingly deep, but I was running against the clock.
  3. Out of tests in flang/test, only Semantics tests are triggering the aforementioned pause. I disabled unit tests early on, so can’t speak of those.
  4. I figured one of the tests that are triggering the pause via binary search: flang/test/Semantics/OpenMP/do02.f90 (GitHub). It doesn’t look interesting and hadn’t been touched for a year, so I guess that something in flang changed instead.
  5. There is at least one more OpenMP test that is triggering the pause. Unfortunately, I ran out of time. For future reference, CI is basically free until roughly 6 AM UTC on Monday.
  6. flang/test/Semantics/PowerPC are not triggering the pause.

I’ll leave Add print statements to lit for tracing purposes by Endilll · Pull Request #92694 · llvm/llvm-project · GitHub for reference. Based on Github pull requests #65431, tests under flang/test/SemanticsChecked are known to be good, and one (or more) of taskloop01.f90, taskloop02.f90, taskloop03.f90, and taskwait.f90 should be another test triggering the pause.

I hope flang contributors can take it from here.

7 Likes

Thanks @Endill for the investigation.

We can try disabling these OpenMP semantics tests (do02.f90, taskloop01.f90, taskloop02.f90, taskloop03.f90, taskwait.f90) to begin with and investigate the issue separately. It could be a loop related semantics check that might be having some issues or alternatively it could be that nested OpenMP tests are emitting a lot of messages which on slow terminals might be taking a lot of time.

Thank you!

FWIW, at this point I think two different actions should happen:

  1. Flang tests on Windows should be temporarily disabled. This allows Flang developers to debug the slowdowns without duress.
  2. The logic for determining when to run Flang tests should be updated so that changes to Clang do not run Flang’s tests. I looked through CMake for Flang and don’t see it linking against Clang at all (perhaps I’ve missed something though!), so perhaps Flang tests should not run on Clang changes at all, but at the very least, Flang tests should not run when the only changes in Clang don’t impact Flang to begin with.

#1 should hopefully provide temporary relief for our beleaguered precommit CI on Windows, and #2 should hopefully help reduce the costs associated with precommit CI testing in general.

Is that reasonable and something someone from the Flang community can get done fairly quickly @kiranchandramohan (and @sscalpone)? CC @lnihlen

1 Like

If it is only the tests that @Endill mentioned, then we can specifically disable these tests. [Flang][OpenMP] Disable/Remove a few long running tests by kiranchandramohan · Pull Request #92739 · llvm/llvm-project · GitHub

At the moment, the only portion of Flang that uses Clang code is the Driver. While the long-term plan is to move this code to llvm/Frontend, this work has been slow.

We know for sure that it’s more than one tests, but I ran out of time to find all of them.

If you could give a list of all the tests that are not checked, then we can disable them all on Windows temporarily.

See flang/tests/SemanticsChecked for a list of tests that are known to NOT trigger the issue: Add print statements to lit for tracing purposes by Endilll · Pull Request #92694 · llvm/llvm-project · GitHub. The rest of Semantics tests are to be disabled on Windows if you’re willing to do so.

Are the rest of the Semantics test that are not checked only in the OpenMP flang/test/Semantics/OpenMP directory?

We don’t have an exhaustive list of which tests specifically are problematic, so I would still recommend disabling all Flang tests on Windows in precommit CI as a temporary measure. Once we have more ideas on a better boundary or we fix whatever the issue is causing the major slowdown, we can start bringing more/all of those tests back online. (Because of how disruptive the current situation is, my hope is to get the community CI back to a functional state ASAP and then fix the issues once the pressure is off.)

Good to know! Then it probably makes sense to only run Flang tests if the clangDriver or clangBasic libraries are touched within Clang as that’s the only way for Clang to impact Flang.

No, I wasn’t able to tests which are under flang/test/Semantics directly, so that’s a long list of tests that I wasn’t able to check.