Check-mlir test suite times: Sparse tensor integration tests

Timing the MLIR test suite (latest git) on a fast 32-core system shows the results pasted below. The sparse tensor integration tests completely dominate the profile in a way disproportionate to their footprint in the repo. @aartbik: can these tests be minimized/trimmed/reduced in a way they take about 1/4th of the time that they currently take? They are otherwise slowing down things for everyone – more energy, more CPU cycles, longer waits, etc. I personally always keep integration tests on for better testing/coverage – typically, for landing commits as well as for most development. So, turning those off isn’t a solution.

LIT_OPTS="--time-tests" ninja check-mlir

Slowest Tests:
--------------------------------------------------------------------------
3.68s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_binary.mlir
3.20s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conversion_sparse2dense.mlir
3.18s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conv_3d_ndhwc_dhwcf.mlir
3.04s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conversion.mlir
2.99s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conv_3d.mlir
2.90s: MLIR :: Integration/Dialect/SparseTensor/CPU/concatenate_dim_0_permute.mlir
2.86s: MLIR :: Integration/Dialect/SparseTensor/CPU/concatenate_dim_1_permute.mlir
2.85s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conv_2d.mlir
2.85s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_collapse_shape.mlir
2.82s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_matmul.mlir
2.64s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conversion_ptr.mlir
2.59s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conv_2d_nhwc_hwcf.mlir
2.48s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conversion_sparse2sparse.mlir
2.34s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_storage.mlir
2.13s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_flatten.mlir
1.95s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_expand_shape.mlir
1.69s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_codegen_foreach.mlir
1.65s: MLIR :: Integration/Dialect/Async/CPU/microbench-scf-async-parallel-for.mlir
1.63s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conversion_dyn.mlir
1.61s: MLIR :: Integration/Dialect/SparseTensor/CPU/sparse_conv_1d_nwc_wcf.mlir

Tests Times:
--------------------------------------------------------------------------
[    Range    ] :: [               Percentage               ] :: [  Count  ]
--------------------------------------------------------------------------
[3.60s,3.80s) :: [                                        ] :: [   1/1696]
[3.40s,3.60s) :: [                                        ] :: [   0/1696]
[3.20s,3.40s) :: [                                        ] :: [   2/1696]
[3.00s,3.20s) :: [                                        ] :: [   2/1696]
[2.80s,3.00s) :: [                                        ] :: [   5/1696]
[2.60s,2.80s) :: [                                        ] :: [   2/1696]
[2.40s,2.60s) :: [                                        ] :: [   2/1696]
[2.20s,2.40s) :: [                                        ] :: [   0/1696]
[2.00s,2.20s) :: [                                        ] :: [   2/1696]
[1.80s,2.00s) :: [                                        ] :: [   0/1696]
[1.60s,1.80s) :: [                                        ] :: [   4/1696]
[1.40s,1.60s) :: [                                        ] :: [   6/1696]
[1.20s,1.40s) :: [                                        ] :: [   4/1696]
[1.00s,1.20s) :: [                                        ] :: [  11/1696]
[0.80s,1.00s) :: [                                        ] :: [  15/1696]
[0.60s,0.80s) :: [                                        ] :: [  12/1696]
[0.40s,0.60s) :: [                                        ] :: [  12/1696]
[0.20s,0.40s) :: [                                        ] :: [  41/1696]
[0.00s,0.20s) :: [*************************************   ] :: [1575/1696]
--------------------------------------------------------------------------
********************
Failed Tests (1):
  MLIR :: Integration/Dialect/Vector/CPU/test-contraction.mlir


Testing Time: 3.76s
  Unsupported:  139
  Passed     : 1963
  Failed     :    1

(On a relatively minor note, the naming of these test files deviates from the style. Nearly all of MLIR and LLVM uses hyphens for test cases.)

Have you compared the sparse tensor tests to other integration tests or tried profiling to see what exactly makes these tests slow? I suspect that one reason for these tests to dominate ninja check-mlir is the fact that there’s a lot of them. But that’s, IMHO, reflective of how much work has gone into SparseTensor and how mature that area of MLIR is. Folks working on SparseTensor deserve applauding for their great work :clap:t2: .

One potential way to trim these tests would be to reduce the number of RUN lines (see e.g. sparse_conv_2d.mlir). In fact, Aart and I have been discussing this recently. For example, most of those integration tests run the vectoriser twice:

We intend to fold these “duplicated” RUN lines into one, but that requires a bit of work:

Once this is complete, I’d hope for a ~25% improvement (based on the fact that on average we’ll be reducing the number of RUN lines from 4 to 3).

One other angle worth investigating would be proper profiling and perhaps switching from JIT to AOT compilation. This is just a loose suggestion, not something I investigated.

Why not rely on our public pre-commit and post-commit CI to run these for you instead? Here are AArch64 bots that run the MLIR integration tests:

Integration tests are bound to be significantly slower than regular regression tests. And we intend to add some new ones soon. From my side, these will primarily be for SME, so I imagine that most folks will keep them disabled anyway (otherwise, they will need to install and set-up an SME emulator). My point being - we should accept that we may need to skip some tests in our daily development.

With all this said, I agree that this is something worth closer investigation. I should be able to finish refactoring the “vectorisation” RUN lines in the next few weeks, so hopefully you’ll see some improvement.

-Andrzej

2 Likes

Although the request to reduce runtimes of our tests is of course always reasonable, I somehow find the tone of your post, hopefully unintentional, a bit negative. I would like to think that the sparse tensor integration test suite sets the right example by actually thoroughly testing every new feature that we implement. I have often been contacted, in private, alas, by other teams that found bugs in their code because of the seemingly unrelated sparse tensor tests, simply because it covers many features of MLIR untested elsewhere. In fact, having in-tree tests that are slightly more expensive, and thus should run less frequently, but cover more features, was exactly why I proposed the integration tests to start with.

I appreciate you SO much for saying this. This is exactly the kind of contributions I like to think my team is making.

Having said that, of course we can look into trimming tests that perhaps could run a bit quicker. And, sigh replacing underscores with hyphens (they are known to run a lot faster! :wink: ).

Another take on this may be that the other areas taking footprint in the repo are drastically under-tested and could benefit from more testing :wink:

3.7s to run our entire suite of tests! Many individual C++ files take longer to compiler :slight_smile:
What kind of latency do you expect actually?

We should make a difference between “comprehensive test suite” and “integration smoke tests”, the one we have in mlir/test/Integration are mean to be the kind that are fast and runnable locally during development.

I would wonder: how much extra coverage each of these integration adds individually (vs the existing ones) and do they pull their weight here (vs in separate suites).
It’s also interesting that there are no Python integration test in the top here.

To my point above, it is likely that this is getting beyond how we intended the integrations tests in this location, so that we keep smoke testing part of the normal development flow, and we can have more comprehensive suite that the bots are running.
We may just have a signal that we need to re-structure and split “comprehensive test suites” from there?

So far it seemed manageable to me and I didn’t sense an urgency on this, on my development config (Release+Asserts+Shared libs) I have:

Testing Time: 9.56s
  Unsupported:  255
  Passed     : 1848

and with the integration tests:

Testing Time: 16.00s
  Unsupported:   46
  Passed     : 2056
  Failed     :    1
2 Likes

Thank you Mehdi. Together with Andrzej, you just made my day!

That was actually not my intention and it shouldn’t be seen that way. The intent is to make things strictly better – to seek a way to reduce those times without compromising on coverage. We’ve done this multiple times with the check-mlir test suite – sometimes by splitting long test cases into multiple files and, in some cases dropping tests that aren’t necessary. I’d suggest always looking at things from a technical standpoint. If cutting it down without sacrificing coverage and quality isn’t possible, there isn’t anything to worry about.

This isn’t the right way to look at it – in absolute terms. 3.7s could be too little or a lot – not everyone runs on the kind of fast machine that I tested it on. I’m pretty sure it’s 20s on some common modern setups. I suggest looking at it in “proportion” – opportunities to cut it down should always be welcome given how widely used and run the test suite is. Developer productivity is paramount and it isn’t a great argument to say “it seems manageable” for my setup without actually auditing things.

This is just pure speculation. If something is undertested, there isn’t anything preventing additions.

That’s a non-option for the use cases I mentioned – those landing patches don’t rely on them after a rebase, and it simply isn’t feasible during development. They are mostly secondary assurance pre-merge, and my entire post is addressing the local/“on terminal” tests people run with ninja check-mlir. That said, all of the things you suggested are exactly some of the directions to follow. The team working and regularly using these tests are in the best position to look for opportunities and only they know what’s minimal and powerful (reducing problem sizes to minimal is another area).

Most people (not using those specific parts of MLIR) would, but the integration test package is something useful to keep enabled even for daily development, and it isn’t an really an option for committers to disable them. I’ve often seen test suite breakage on integration tests because the build conf used by the author/committer didn’t have integration tests enabled.

[quote=“bondhugula, post:6, topic:70972”]
This isn’t the right way to look at it – in absolute terms. 3.7s could be too little or a lot – not everyone runs on the kind of fast machine that I tested it on. I’m pretty sure it’s 20s on some common modern setups. I suggest looking at it in “proportion” – opportunities to cut it down should always be welcome given how widely used and run the test suite is. Developer productivity is paramount[/quote]

Sure, but it was still quite striking that you were complaining about test time and posting this measurement :slight_smile:
Overall: it does not seem pragmatic or manageable to me to complain about the time taken here without estimating first what is / isn’t reasonable: that is I believe we should be able to describe 1) what kind of coverage we’re seeking for the check-mlir target and 2) what kind of “CPU time” we are targeting for day-to-day development (which likely gets also somehow into our build time as well).

To be clear, because it’s not clear if you’re implying it: I never wrote anything like that.

To your point: unless one has a massive amount of cores, this won’t help (and in this case this should only ever improve possibly some latency but I hardly how it helps the overall CPU user time).

I guess this is what I’m contesting. You don’t need to have a time/target in mind to look for low-hanging opportunities to trim down any unnecessary execution tests (for eg. are problem sizes minimal enough for the thing being tested?). It’s only natural one would start from the dominating tests, and it’s ideally the responsibility of the maintainers/active users of those tests to audit that. Build time for the incremental build scenario isn’t an issue. The post upthread already suggests an effort – nothing more to discuss here.

I don’t have a horse in the sparse tensor race, but I was thinking the same thing.

In general, my feeling is that MLIR is undertested, and I would not vote to remove any tests that are finding bugs and holding things together unless if we were talking about an order of magnitude or more latency.

(Now doing some work to make the build time better and possibly rework expensive parts – would be much more desirable to me)

I wonder if we’re pointed at the right root cause: when people are working on a subset of the project in a tight iteration loop, do we have enough controls or organization so that they can iterate efficiently on the part of interest? Pulling in this thread could lead to project organization and/or mechanics for working on a subset (like Testing Guide - MLIR).

2 Likes

Just to double check and so that we have an additional data point:

  • Do you enable the SVE integration tests and do you run them through an emulator?

That would require you to set the following flags:

 -DMLIR_RUN_ARM_SVE_TESTS=True -DARM_EMULATOR_EXECUTABLE=<aarch64_emulator>

I suspect that most people will have these set to defaults, i.e. disabled.

I just wanted to make sure that none of this is specific to vector-length-agnostic vectorisation in SparseTensor and/or SVE.

Thanks!