[RFC?] FileCheck should verify its input (commands, and combinations thereof)

Problem

Over the course of working with a LIT tests system in LLVM, many times I was frustrated with the way FileCheck behaves. Personally, I find FileCheck to be way too “forgiving” [1] i.e. it does not really sanitize its input. As a result, one opens up the tests to bugs. Latest examples that I’ve encountered:

  • CHEKC - typos not “caught” and as a result the written test effectively does nothing. Examples in main branch of LLVM: #1, #2, #3, there are more. Common typos from my experience are CHEKC, CHECK-SAEM, etc.
  • CHECK-DAG - is a very special command that doesn’t play well with anything else. For example, writing a CHECK-SAME after CHECK-DAG produces no error, yet CHECK-SAME does nothing - because support for CHECK-SAME is not implemented for CHECK-DAG (at least as far as I understand the situation). After a brief search, I do not see misuses of CHECK-DAG in LLVM but I’ve seen it multiple times in our downstream.
  • CHECK-NOT - is also very special as its semantics is “the specified line is not present at this position”. In a large enough test, this semantics can often become an accidental test bug. Consider:
// Note: this is pseudo-MLIR example
%0 = foo(...)
%1 = bar(%0)

// succeeds:
// CHECK: bar
// CHECK-NOT: foo

// fails:
// CHECK-NOT: foo
// CHECK: bar
  • CHECK-SAME-LITERAL - is a “syntax error” as the intended command is CHECK-SAME{LITERAL}. As in the case of typos, such line seems to be ignored.
  • I do not recall more misusages from the top of my head but I could imagine they exist

Proposal

My proposal is to enhance the FileCheck tool by adding sanity checks (such as “is there a typo in the keyword?”), syntax checks and “workflow” logical checks (such as “for CHECK-SAME is there a preceding CHECK?”) that run before the tool starts to analyse the given text.

  1. For typos, et. al the process could follow an “is it a duck” test (i.e. “if it looks like a duck and it sounds like a duck, it’s most likely a duck”): if there’s a line that could almost be treated as a command, it is a command. Consider:
// CHEKC: ...
^ ^ ^^^^
| |  the word immediately follows and is "similar" to --check-prefix value
| has a whitespace (" ") right after
starts with "//"
-> decision: IS a misspelled command

// the next line is run through CHEKC:
                                ^^^^^
                        the word doesn't immediately follow
-> decision: is NOT a misspelled command

For the “approximation” algorithm, one may have something like a Levenshtein_distance based heuristic (actually, could we ask linting experts within LLVM for a good algorithm here - this is a task that’s generally solved by linters/IDEs/etc.?).
Thus, if a misspelled command, an error is produced and the test fails.

  1. The next phase, assuming typos are eliminated due to the procedure above, is to ensure no syntax errors such as CHECK-LITERAL vs CHECK{LITERAL} or CHECK- (with an extra -).

Given that we “know” the command (it is “parsed” successfully), any later deviation is unexpected and constitutes a failure. In general, I guess this forces one to have a full-blown parser but given that the FileCheck “API” is rather minimalistic, perhaps we could have a set of “possible results” hardcoded and that would already be sufficient.

  1. To counter logical errors (such as “CHECK-SAME cannot follow CHECK-DAG”), it seems reasonable to hold a stack-based state of commands.

Upon a discovery of a command (e.g. CHECK-SAME, CHECK-NEXT) that must have a preceding command, the last value in the stack of commands is verified for correctness. As is the case with the syntax errors, perhaps we could just maintain a hardcoded list of possible combinations. AFAIR something like this is already done for CHECK-NEXT?

  1. Additional commands: I think it is reasonable to either extend functionality of existing commands or introduce additional commands. For example,
    • CHECK-NOT-DAG (or CHECK-DAG-NOT) to search the whole text for a presence of a line - in many cases this could be preferred over CHECK-NOT (e.g. when the compiler is supposed to eliminate completely a certain command/operation/variable/etc.)
    • CHECK-DAG-SAME (as a synonym to CHECK-SAME) is the command that analyses the text right after CHECK-DAG. With additional verification of CHECK-SAME, the error could suggest the following: “CHECK-SAME cannot be used after CHECK-DAG, did you mean to use CHECK-DAG-SAME”?

Conclusion

Would this be something that the community is interested in?

I am not sure I would have sufficient time to implement all this but our downstream periodically has to deal with FileCheck inconveniences so I think I could spare time for certain things in this list.

In any case, I felt like it is worth it to start a discussion and perhaps learn something new. Could it be that I missed some new developments in this topic? In this case, please feel free to point me to a similar conversation / documentation / etc.


[1]: Our downstream project resides on top of LLVM 19.x. Perhaps there’s some improvement already in 20.x and/or beyond.

If it’s possible to detect common FileCheck usage bugs without many false positives, sure, let’s do it.

However, the way FileCheck is used in our lit tests, this tends to be a lot harder than it may look on the surface. For example, if you have a FileCheck invocation with prefix CHECK and you see ; CHECK-FOO:, you might think that you can report an error for an unknown/misspelled directive. However, what’s actually going on is that there is a second FileCheck invocation on the same file using --check-prefix=CHECK-FOO. But the first FileCheck invocation does not know anything about the second one. So we have to conservatively assume that this is actually a valid directive for a different FileCheck invocation, and can’t report an error. (This is not a hypothetical situation, this is actually very common.)

3 Likes

In addition to the example situation @nikic has described, another related issue is that lit and FileCheck have similar command syntaxes. For example, the following is an extremely common, more-or-less standard lit layout:

# RUN: some command | FileCheck %s
# CHECK: some pattern

However, this is equally legitimate:

# RUN: some command | FileCheck %s --check-prefix=RUNS
# RUNS: some pattern

(To be clear, I’m not suggesting it is a GOOD style to follow, but it is legitimate and it would be hard to specify how similar/dissimilar FileCheck prefixes should be compared to lit directives).

There are plenty of other examples which I think would make it hard to come up with a practical heuristic that doesn’t just get in the way of the developer doing what they want. I’m not opposed to the principle, but just don’t see how you could make it work in practice.

An alternative approach might be to make it a soft check done on PRs that generates a warning if a potential typo is detected, a bit like the clang-format warning that is generated when something isn’t correctly formatted. By making it this sort of style, the checker could analyse the whole file to get the whole context and report unused prefixes and the like, whilst also ignoring known lit directives etc.

Isn’t this what --implicit-check-not is for?

1 Like

Thanks for your attention to our testing tools! What you’ve identified is the “Rotten Green Tests” problem (there’s an ICSE paper with this exact title, and I tried to leverage the idea for googletest although that never got accepted). We’ve done a bit of work on FileCheck in the past to address these problems; for example, you can’t apply multiple suffixes to a directive and we do detect that.

Some of us have had thoughts about improving FileCheck’s detection of ill-formed tests in the past, and occasionally even poked at implementing some (particularly @jdenny-ornl IIRC). @nikic pointed out the main problem, which is that a given test can run FileCheck multiple times with different prefixes. Addressing that situation really requires that FileCheck record information about what directives it used in each run, and then we need some sort of llvm-cov-like aggregator post-processing to see if there’s anything that got missed. (If we don’t address this multiple-run situation, then error detection will be full of false positives, and people will just turn off or ignore the detection. It has to be pretty robust to be acceptable.)

I do think that CHECK-SAME with a -DAG predecessor could be detected relatively easily. It is clear that CHECK-DAG followed by CHECK-SAME is unlikely to have a useful meaning and should probably be considered an error. (Given my understanding of FileCheck behavior, I’d expect it to do a match on the same line as the last match of the preceding DAG set, which is probably not what you actually wanted. If FileCheck actually does something else with it, that’s an even stronger argument that the combination is an error.)

CHECK-NOT is worked exactly as specified. If you’re finding tests that are using it incorrectly, please refer to the documentation.

Perhaps we can create a separate linter tool / mode to warn users about it. Since this tool can read the entire test script, it’s easy to tell whether CHEKC or CHECK-FOO is intended or not as it sees every single --check-prefix(es).

I think we can just reuse whatever algorithm Clang and llvm::cl use for suggesting command line flags.

I remember there being some aork done in the past to fix CHEKC and similar typos in the tests. Don’t recall if there was a tool/linter created for it or if it was mostly manual.

It looks like there is this filecheck_lint.py tool in the repository. Tested it locally and it seems to be able to find typos:

Found potentially misspelled directive “CHEKC”. Did you mean “CHECK”?

Found potentially misspelled directive “CHCEK-SAME”. Did you mean “CHECK-SAME”?

FileCheck really is very bug-prone. It’s true that there are a lot of common patterns for using it that make diagnosing mistakes difficult, but we can do something about that. If we can make a significant dent in the bug-proneness problem by hardening idiomatic FileCheck use, we should.

Like, we could just require files to list all their expected prefixes at the start of the file, and then we could diagnose things that look like check lines but aren’t in the list. We’d need to update a bunch of tests first, but that should be highly automatable. Deciding what “looks like a check line” might be hard, though.

2 Likes

Here are some more examples of lit related linters that I found useful in our project (only for inspiration, not particularly copy-able to other projects):

Both of these caught dozens of tests in our project, many of which silently weren’t running

That second tool you mentioned looks like it checks for a relatively small set of lit feature keywords. I’ve had a change in mind for lit for a long time, to make sure REQUIRES / UNSUPPORTED / XFAIL keywords aren’t misspelled, for example. Guaranteeing that every lit test is actually run somewhere is a harder problem.