Check-mlir times: `Examples/standalone` testing time

Hello all,

This isn’t a high priority, but can something be done to reduce the testing time for test/Examples/standalone/. It increases the check-mlir times from 1-2 seconds to nearly 19s on a fast 32-core system! See below:

$ time bin/llvm-lit -sv ../mlir/test/

Testing Time: 18.92s
  Unsupported:   46
  Passed     : 1237

real	0m19.088s
user	1m13.569s
sys	0m39.545s

$ time bin/llvm-lit -sv ../mlir/test/Examples/standalone/

Testing Time: 18.13s
  Passed: 1

real	0m18.208s
user	0m32.226s
sys	0m11.770s

Although LLVM_BUILD_EXAMPLES=OFF can turn it off, we’d typically like that to be on for full test coverage (since the toy tutorial uses a lot of passes from the main tree).

The slow thing here is that it’s running cmake… I don’t know that there’s alot we can do about that.
Independently, are we relying on the toy tutorials for test coverage? I’d expect that buildbots are compiling this, but not that most developers are, I guess. Would it help if it was guarded as an integration test, from your perspective?

No, but updates to the tree can break the toy tutorial build which is why developers might want to have that covered.

More developers would rely on regularly running integration tests – so it would be worse if the standalone test is moved there. It’s perhaps fine where it is – but it’s taking 20x more time than any other test.

I agree the time here is excessive and I often resort to turning off examples, which has caused me to miss breakages in the past. But it is also important to keep this sample running (and it is easy to break – I find the test useful).

I wonder if we shouldn’t have multiple check-mlir alias targets. Maybe check-mlir does everything and is what we advertise. But then something lighter (check-mlir-unit or something). I find that the CMake switch is not great in terms of development flow: it is easy to forget it is on and off, etc.

That’s right. Is there a way standalone can be built as part of the “build” instead of as part of check?

I’ve been disabling the examples in the past as well to work around the standalone case. That said it does not hit during incremental development: I suspect it only has to fully re-run when the CMake files change?

I think this is a good approach!

I’m afraid though that there might be a case of seeking a read_my_mind alias. Also I suspect we likely should keep this aligned with the granularity of the CMake options.

That would make me think to run only the C++ unit-tests (which are driven by test/Unit/... right now).
Naming is hard…

Something behaves differently now it seems, I’m fairly sure that the incremental build was behaving much better before. It seems that right now re-running the cmake configuration in the standalone test is very fast, but it invalidates the build directory somehow and a few things are rebuilt (but not everything).
I have the impression that this is only the Python bindings and the C API support

I agree. I too don’t remember seeing such a slow down before on rebuilds.

It comes from the introduction of the CAPI tests, it is likely affecting MLIR itself.

  # Unfortunately need to compile at least one source file, which is hard
  # to guarantee, so just always generate one. We generate one vs using the
  # LLVM common dummy.cpp because it works better out of tree.
  set(_empty_src "${CMAKE_CURRENT_BINARY_DIR}/${name}__empty.cpp")
  file(WRITE "${_empty_src}" "typedef int dummy;")

The CMake invocation for add_mlir_aggregate will unconditonally write this source instead of checking if it needs to be produced, this invalidates all targets that depend on it.
@stellaraccident should we use an intermediate file and something like configure_file or copy_if_different here?

1 Like

Oh that is a good find! For in tree builds there is a static source file that suffices. This is an issue for out of tree builds (which is what this is) where that is not available. And now that I see it, this explains incremental build redundancy that I’ve been scratching my head on for out of tree too.

Just thinking of the right way to fix this…

configure_file will at least fix the issue Uday is reporting but still has some non incrementally to it (ie. On config regenerate, it will rewrite). I haven’t tried copy_if_different but based on the name, sounds like that might be the best.

I think this will work: ⚙ D119069 [mlir] Do not use an empty source file when building aggregate libraries.

On my machine:

All tests enabled:

$ LIT_OPTS="--time-tests" ninja check-mlir
[1/2] Running the MLIR regression tests
Slowest Tests:
--------------------------------------------------------------------------
4.27s: MLIR :: Examples/standalone/test.toy
3.15s: MLIR :: mlir-reduce/crashop-reduction.mlir
3.15s: MLIR :: mlir-reduce/multiple-function.mlir
2.53s: MLIR :: mlir-cpu-runner/simple.mlir
2.29s: MLIR :: Dialect/Linalg/comprehensive-module-bufferize.mlir
2.07s: MLIR :: mlir-tblgen/op-error.td
1.95s: MLIR :: python/dialects/linalg/ops.py
1.93s: MLIR :: python/dialects/linalg/opdsl/shape_maps_iteration.py
1.85s: MLIR :: python/dialects/linalg/opdsl/emit_pooling.py
1.79s: MLIR :: mlir-cpu-runner/utils.mlir
1.77s: MLIR :: Pass/pipeline-options-parsing.mlir
1.75s: MLIR :: python/execution_engine.py
1.74s: MLIR :: python/dialects/linalg/opdsl/test_core_named_ops.py
1.66s: MLIR :: Dialect/Linalg/codegen-strategy.mlir
1.63s: MLIR :: python/dialects/linalg/opdsl/assignments.py
1.62s: MLIR :: Dialect/Linalg/comprehensive-module-bufferize-partial.mlir
1.61s: MLIR :: python/dialects/linalg/opdsl/interfaces.py
1.61s: MLIR :: mlir-tblgen/rewriter-errors.td
1.59s: MLIR :: python/dialects/linalg/opdsl/emit_matmul.py
1.59s: MLIR :: Pass/pipeline-parsing.mlir

Testing Time: 4.40s

Filter out Examples:

$ LIT_OPTS="--time-tests --filter-out=Examples" ninja check-mlir
[1/2] Running the MLIR regression tests
Slowest Tests:
--------------------------------------------------------------------------
2.70s: MLIR :: mlir-reduce/crashop-reduction.mlir
2.60s: MLIR :: mlir-reduce/multiple-function.mlir
2.43s: MLIR :: mlir-tblgen/op-error.td
1.98s: MLIR :: mlir-cpu-runner/utils.mlir
1.97s: MLIR :: mlir-cpu-runner/simple.mlir
1.84s: MLIR :: python/dialects/linalg/opdsl/test_core_named_ops.py
1.82s: MLIR :: Dialect/Linalg/codegen-strategy.mlir
1.75s: MLIR :: python/dialects/linalg/opdsl/assignments.py
1.70s: MLIR :: Dialect/Linalg/comprehensive-module-bufferize-partial.mlir
1.63s: MLIR :: python/execution_engine.py
1.59s: MLIR :: python/dialects/linalg/opdsl/emit_pooling.py
1.59s: MLIR :: python/ir/array_attributes.py
1.58s: MLIR :: Dialect/SCF/loop-unroll.mlir
1.49s: MLIR :: Pass/ir-printing.mlir
1.48s: MLIR :: python/dialects/linalg/opdsl/interfaces.py
1.46s: MLIR :: Dialect/Vector/vector-contract-transforms.mlir
1.45s: MLIR :: Conversion/MemRefToLLVM/memref-to-llvm.mlir
1.42s: MLIR :: Dialect/SparseTensor/sparse_parallel.mlir
1.40s: MLIR :: python/dialects/linalg/opdsl/emit_misc.py
1.38s: MLIR :: python/dialects/linalg/opdsl/shape_maps_iteration.py

Testing Time: 3.10s

Filter out Examples and python (since they are all in the slowest):

$ LIT_OPTS="--time-tests --filter-out=Examples|python" ninja check-mlir
[1/2] Running the MLIR regression tests
Slowest Tests:
--------------------------------------------------------------------------
3.16s: MLIR :: mlir-reduce/crashop-reduction.mlir
3.15s: MLIR :: mlir-reduce/multiple-function.mlir
3.11s: MLIR :: mlir-tblgen/op-error.td
2.12s: MLIR :: mlir-cpu-runner/utils.mlir
1.62s: MLIR :: mlir-cpu-runner/simple.mlir
1.61s: MLIR :: Dialect/Linalg/comprehensive-module-bufferize.mlir
1.55s: MLIR :: mlir-tblgen/llvm-intrinsics.td
1.45s: MLIR :: Pass/pipeline-parsing.mlir
1.43s: MLIR :: mlir-tblgen/rewriter-errors.td
1.40s: MLIR :: Pass/pass-timing.mlir
1.37s: MLIR :: Dialect/Linalg/comprehensive-module-bufferize-partial.mlir
1.37s: MLIR :: Dialect/Vector/vector-transpose-lowering.mlir
1.32s: MLIR :: Dialect/Affine/loop-tiling.mlir
1.26s: MLIR :: Dialect/Linalg/comprehensive-module-bufferize-analysis.mlir
1.23s: MLIR :: Dialect/Linalg/tile-and-peel-tensors.mlir
1.23s: MLIR :: Dialect/SparseTensor/sparse_lower_col.mlir
1.21s: MLIR :: Transforms/inlining.mlir
1.17s: MLIR :: Target/Cpp/for.mlir
1.12s: MLIR :: Transforms/loop-fusion-2.mlir
1.11s: MLIR :: Dialect/SparseTensor/sparse_lower.mlir

Testing Time: 3.47s

So the examples test is adding ~1.3s to the total runtime. The python tests, which dominate the slowest tests list are only adding about ~0.4s. Both are optional features and this doesn’t seem excessive to me.

Possibly some better documentation on how to control the granularity of tests? I use lit standalone in other projects and have gotten used to invoking it in various ways to test just what I want, but the actual way it is integrated into the LLVM build has always made it hard for me: most of the time I just use it via ninja check-mlir but then can’t target that further. I’ve been using LIT_OPTS= lately and have found it to be good for iteration because I can just string additional criteria on. Examples:

LIT_OPTS="--filter=python/ir/blocks -a" ninja check-mlir

Maybe a hint here: Getting Started - MLIR
Or: Testing Guide - MLIR

3 Likes

I’m surprised at this … Thanks for measuring it!

Absolutely! I didn’t know about this --filter options to lit, it looks fantastic :slight_smile:

I added some documentation. Have a look: https://github.com/llvm/mlir-www/pull/94

1 Like

Thanks – these are really useful! – even more for out-of-tree dialects since it isn’t easy to get hold of the right lit command/path to use.

@aartbik @bixia1 These tests are taking too long to run and slowing down check-mlir by themselves significantly for typical cmake build/test conf’s – for everyone. Looks like these were added by folks working on sparse tensor support. Could you consider reducing the problem sizes or anything else so that there is no significant impact on test times?

1.78s: MLIR :: Integration/Dialect/SparseTensor/taco/unit_test_tensor_core.py
1.65s: MLIR :: Integration/Dialect/SparseTensor/python/test_stress.py
1.59s: MLIR :: Integration/Dialect/SparseTensor/python/test_SDDMM.py
1.58s: MLIR :: Integration/Dialect/SparseTensor/python/test_output.py
1.37s: MLIR :: python/integration/dialects/linalg/opsrun.py
1.23s: MLIR :: Integration/Dialect/SparseTensor/taco/test_simple_tensor_algebra.py
1.22s: MLIR :: Integration/Dialect/Async/CPU/microbench-scf-async-parallel-for.mlir
1.22s: MLIR :: Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir
1.21s: MLIR :: Integration/Dialect/Vector/CPU/test-transfer-read-3d.mlir
1.15s: MLIR :: Integration/Dialect/Async/CPU/test-async-parallel-for-2d.mlir
1.07s: MLIR :: Integration/Dialect/Vector/CPU/test-transfer-read-2d.mlir
0.96s: MLIR :: Integration/Dialect/SparseTensor/taco/test_Tensor.py
0.93s: MLIR :: Integration/Dialect/SparseTensor/taco/test_SDDMM.py
0.92s: MLIR :: python/dialects/linalg/opdsl/test_core_named_ops.py
0.92s: MLIR :: Integration/Dialect/SparseTensor/taco/test_MTTKRP.py
0.88s: MLIR :: Integration/Dialect/SparseTensor/python/test_SpMM.py
0.87s: MLIR :: Dialect/LLVMIR/nvvm.mlir
0.87s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_cast.mlir
0.82s: MLIR :: Integration/Dialect/Vector/CPU/test-transfer-read-1d.mlir

I don’t think tests like “test-transfer-read-1d” or unit execution tests should be taking over half a second on a fast machine as part of this CI.

There aren’t enabled by default, one has to opt-in with -DMLIR_INCLUDE_INTEGRATION_TESTS=ON, it is expected that they somehow can’t be entirely as fast as the unit-tests, O(s) per tests does not seem entirely exaggerated to me for end-to-end python tests.
That said if there are low-hanging fruits to make them faster, of course we ought to take them :slight_smile: