[RFC] Opt-in way to make VerifyDiagnosticConsumer slightly less strict

Summary

Our use case (DR conformance testing) demands something along the lines of an // expected-maybe-no-diagnostics magic comment, that works like // expected-no-diagnostics but doesn’t trigger an error if diagnostics are actually expected further on in the file. We’ll be equally well served by a -cc1 option of the same effect. The main question is which way this functionality should be introduced? Or even in both ways simultaneously.

It should be noted that even when enabled this feature is not going to compromise VerifyDiagnosticConsumer strictness in any way that introduces false positives or false negatives into tests that rely on it.

Use case

We’re working on the C++ defect report test suite. Currently, test cases are grouped by 100 in a file (example), which makes sense due to the sheer number of DRs against the C++ Standard (2608 and counting), while a significant part of them does not require a test or require couple of lines to test. Namespaces almost provide a sufficient level of isolation for the purposes of grouping those tests under a single file. But recently we came across a situation leading to nasty interference between tests, which is the reason D148433 is stuck. A reduced example is available on godbolt. In short, Clang skips over template instantiation if an unrecoverable error was encountered earlier in the file, which prevents it from issuing diagnostics that otherwise would be issued.

While useful for general user experience, such behavior is undesirable and surprising for conformance tests (which DR tests are a part of). So we’d like to improve test isolation while not compromising the succinct notation used to specify DR statuses. We’re going to rely on the split-file utility to split grouped test files into individual files per test, and extend lit to avoid duplicating RUN commands for each test (we have a RUN command per each Standard revision) (patch). Here’s a comparison between old layout and new layout: link.

With test cases individually compiled and checked by VerifyDiagnosticConsumer, tests that doesn’t expect any diagnostic are not grouped together with tests that do anymore. Sprinkling // expected-no-diagnostics seems an obvious solution, but dr5xx.cpp alone contains 27 tests that would require such marking. Instead of being actually useful, those comments are going to become noise for the sake of taming VerifyDiagnosticConsumer. This doesn’t seem sound, so we want a way to adjust VerifyDiagnosticConsumer a bit for our test suite, and do that in a single place or handful of places.

Other usages

Non-DR conformance tests could also leverage this infrastructure in order to e.g. group tests for individual paragraphs under a single file, simplifying directory structure.

2 Likes

Thank you for continuing to try to improve the DR conformance testing situation! In general, I think using the split-file approach is going to help get us hermetic tests without adding significant overhead for people writing the tests, so this is heading in a good direction IMO.

In terms of whether to go with a new verifier comment or with a command line switch, my (very slight) preference is for the comment. A cc1 flag would certainly work, but is a bit more heavy-handed of an approach for something that I expect we’d mostly only want to use for DR testing and perhaps modules/serialization testing as opposed to a more general feature. I think having a comment for the verifier suffices.

I was a little bit uncomfortable at first of the idea of having a “maybe this is expected, maybe it’s not” kind of comment. However, the only situation where I think this would come back to bite us is when adding additional tests to an existing file already using this marking and the additional tests should have diagnostics but don’t (for whatever reason). In that case, the test will pass when it would have otherwise failed. But this seems like an edge case that we can live with.

CC @zygoloid @dblaikie @cjdb for some extra opinions given this is a question of which approach to take.

Thank you for the feedback!

I agree that this is both theoretically possible and something we can live with. That’s how this situation would look like in practice when applied to conformance testing:

  1. There’s no concept of XFAIL tests there. You have test that pass, and you have some more tests that pass, but have FIXME comments inside.
  2. So if your new test doesn’t produce a diagnostic it should, and you forgot about that, what you actually is going to missing is a FIXME comment.
  3. This would mean that instead of setting the status to non-conformant, you set it to conformant. At this point I’d question if author understands what they’re testing.
  4. Then we have reviewer(s). If they let it through, I’d have the same question for them as well.

I see this happening in two cases:
a) underqualified author and underqualified reviewer; both are interested enough in conformance testing, though;
b) Standard wording that could be misunderstood easily enough to pass through qualified person unnoticed.

I don’t see how something purely technical like requiring // expected-no-diagnostics can help us with those severe issues on organizational level.

But again, this is conformance testing perspective. Your scenario could be more realistic for other use cases.

Yeah, mixed feelings similar to your own, but nothing stronger. It doesn’t seem like it’d be that big a deal to have an explicit annotation on each micro-test that either shows the expected diagnostic, or an explicit statement that none is expected.

But if you folks who work on this more feel that’d be too noisy, I wouldn’t hold it against you to add this feature to the verify diagnostic consumer.

Patch has been published: D151320

Sorry for the late feedback, but I’m very concerned about the overall direction here.

Process creation is surprisingly expensive – I think around 3ms on Linux and 30ms on Mac and Windows is not uncommon – and Clang’s startup and command-line parsing time adds further cost (30ms or so for a debug build). There are thousands of DRs, and we currently have 7 RUN lines for each of them.

This direction would add somewhere around 3000 * 7 * 33ms = 693s (over 11 minutes) of overhead on Linux, and more like 22 minutes of test overhead on Mac. Even on my 56-core development machine, this is a substantial increase, and doesn’t seem acceptable to me. This is going to get worse over time – both the 3000 and the 7 will grow linearly as C++ continues to evolve. Clang’s test suite has already become unbearably expensive, in part due to the addition of very large and slow tests, but I don’t think that’s an excuse to allow further, substantial regressions.

It seems to me that we could address the test isolation issue by instead adding a mode to Clang to run multiple compilation actions from the same -cc1 invocation. Maybe we still structure that as a split-file pass followed by passing a bunch of files to the same %clang_cc1, or maybe we structure it as having Clang do the splitting itself, but I think we want to avoid generating a bunch of separate RUN: lines and subprocess invocations.

1 Like

The other day I combined all the patches on this topic floating around (I haven’t landed anything yet), and tested what kind of slowdown test isolation via subprocess invocations would bring. I used the fastest lit approach to support this, so lit overhead is minimized. On a quite populated dr6xx.cpp I see 4.5x slowdown using debug build of Clang. Test was conducted on aging 24-core Ivy Bridge platform using SATA SSD in case that matters.

Here’s what I was testing: %for-each-file test · GitHub
It depends on D151320 and D150856. Created from my recollection, so not guaranteed to work, albeit it should be close to that. Sorry for not bringing any absolute numbers! I wonder what absolute numbers are, if 4.5x estimate is applied to your hardware.

First, I’m glad to see we’re on the same page on the fact that there’s an issue to address.
I had this idea myself, but it was considered too heavy-handed when I discussed it with others. I still wouldn’t mind addressing issue this way, though.

I’d go for cc1 after split-file file first. I don’t think reimplementing split-file in Clang is a good idea, unless there are compelling reasons to do that.

This was also a concern I had shared with Vlad. Ideally, we’d only need to split out some tests into their own files, but the trouble is: the DR tests are structured in such a way that it’s not easy to tell when the test is missing diagnostics due to unrelated prior failures. Adding some sort of mode to the compiler to work around this would be possible, but then we’re not actually testing what our users run, so it’s not ideal either.

Our test suite is not based on code coverage, so we have a lot of test cases that add to the overhead for potentially little value. However, “tests are really slow to run” is not a reason to have less (accurate) test coverage. If we have to pick between two poisons, I think we should err on the side of too many/slow tests over too few/fast because quality issues will always impact users more than a slow test suite will. That said, we should certainly be trying to optimize our test suite so that developing Clang is less painful as well – but I think those efforts should be put towards making Clang run faster, as we have no control over process creation time and improvements to -cc1 or test infrastructure mostly only helps ourselves (which isn’t bad, but if we’re spending scarce QA resources, we should put them to the best use we can).

Multiple compilations from -cc1 might be a good idea (I believe the driver can take advantage of that as well, which reduces compile times for users), but I don’t think having Clang do the splitting itself is worth it unless we can find a reasonable user-facing feature enabled by the new functionality.

Perhaps we should consider splitting our tests up more? We could consider moving DRs to llvm-test-suite – it gets tested in post-commit CI so there’s still benefit to having the tests, though we’d lose local testing for most folks which could mean somewhat more post-commit reverts. Alternatively, we could consider adding testing “profiles” to lit so that you can smoke test an abbreviated set of tests or run the full set of in-tree tests, which leaves it to the developer to decide how much compute time they want to spend on testing – we’d have to figure out a policy as to what tests go into which profile.

Not just that, but it requires a clear guideline which test go to which repository. I believe trying to decide that “on case-by-case basis” and “at reviewer discretion” would make it unpredictable for others to figure out where to search for a particular kind of test, not to mention it will add up to the list of things for reviewers to think about.

At the end of the day, it’s more or less kicking the can down the road, because at some point complains about llvm-test-suite being too slow would arise as well (if they haven’t already).

We already have a set of check-* targets that allows you to run only the subset of tests, and it doesn’t require people to learn new policies to understand which tests they are running, because it’s directory-based. Why can’t we leverage that?

Can you expand on how that could affect test isolation when compared to separate processes?