[RFC] Improving FileCheck

I’d like to (re)start a discussion on a few gotchas in FileCheck that I’ve noticed working on various tests in llvm’s suites. This begain in a review [1], but I’ll try to summarize here so it gets the right audience before decisions are made on it (so to speak).

1: https://reviews.llvm.org/D77227

The main sticking point is the abundance of checks in FileCheck tests that appear to be checking something, but are in fact silently hiding failures. The biggest class of this bug appears to be CHECK lines that omit the trailing colon, though there are a few others.

CHECK: legitimate test
CHECK gotcha A
CHECK : gotcha B
CHECKNEXT: gotcha C
CHECKDAG: gotcha D
CHECK_NOT: gotcha E
CHECK-LABEL-NOT: ??? F
CHECK-SAME-DAG: ??? G

Gotcha A

First of all, thanks for working on this. In my opinion, filecheck needs help, and anybody trying to make it better is a hero.

That said, from skimming the review thread, it seems to me that perhaps automatically detecting bad CHECKs might not be feasible. Perhaps better debugging tools might be a solution. If I could ask filecheck to list all directives in a file, I could easily find this stuff manually. If I used this hypothetical filecheck --list-directives, and I don’t see the one I just added, then I know it’s bad. This would be much easier to catch in code review in the future.

Such a debugging framework could be expanded in the future to do more fancy things. I’d love it if there were some sort of REPL like functionality where I can throw inputs at a directive and see what it finds.

(Repost, remembering to cc llvm-dev this time)

Thanks Jon!

I’ve done some grepping for a few of these gotchas in llvm/test and clang/test. Of course this isn’t looking for the check prefixes that are actually used, but CHECK would be by far the most common, so I think we can take it as a non-definitive proxy for other cases.

Gotcha A (missing colon): This has a fair number of examples in the wild. Usually the colon is just missing, although I’ve seen a few examples that had a semicolon instead of colon, which is an easy typo to make. The real problem here is coming up with a heuristic that will distinguish “things that are likely a mistake” from “things that are likely just a comment” so that the diagnostic has a reasonably low false-positive rate.

For example, in the wild I’ve seen // TEMPORARY CHECK: X which FileCheck will treat as a real directive. On the other hand we don’t want it to look at // These CHECK lines are here on purpose and diagnose that as a missing colon.

I’ve been poking at this over the past day or so and I now think the most reasonable heuristic is: If the directive is preceded by an alphanumeric (+ whitespace) or an = or ‘,’ then it’s likely a comment or a RUN: line that we don’t want to diagnose. This means we would not diagnose a missing colon in // TEMPORARY CHECK; X but that sort of case is extremely rare. We would diagnose // (TEMPORARY) CHECK; X however.

Another question was whether we should require FileCheck directives to be preceded by a punctuation mark of some kind. I think that ought to be its own separate discussion.

Gotcha B (space between the directive and the colon): Some tests have this bug, so it would be worth catching.

James Henderson observed that legalizing it could help prettify some cases where we’re matching whitespace or the entire line. I don’t think it’s that valuable personally.

If we implement a reasonable diagnostic heuristic for the missing-colon case (Gotcha A), then we’ll catch this mistake in the same net.

Gotchas C,D (missing hyphen): I found exactly one case in the wild. I’d say the value is debatable.

(It’s a CHECKNEXT in llvm/test/CodeGen/PowerPC/testCompareslleqsi.ll if someone wants to fix it.)

Gotcha E (underscore instead of hyphen): I found 40 examples across clang/test and llvm/test. I am certain I have caught a few cases in review and pretty sure I’ve had to fix some of these that I typo’d myself. I’d say this is worth doing.

Multiple suffixes: I believe there are NO multiple-suffix combinations that FileCheck currently supports. The tool should detect any multiple suffix combinations and report them as errors. Currently it looks for a limited set (basically, -NOT in combination with almost anything else), but it’s easy for someone to infer that if FileCheck doesn’t complain, then it will Do The Right Thing™ with other combinations. We should not be that user-unfriendly; we should complain about all multiple-suffix combinations.

–paulr

(Repost, remembering to cc llvm-dev this time)

Thanks Jon!

+1

I’ve done some grepping for a few of these gotchas in llvm/test and clang/test. Of course this isn’t looking for the check prefixes that are actually used, but CHECK would be by far the most common, so I think we can take it as a non-definitive proxy for other cases.

Gotcha A (missing colon): This has a fair number of examples in the wild. Usually the colon is just missing, although I’ve seen a few examples that had a semicolon instead of colon, which is an easy typo to make. The real problem here is coming up with a heuristic that will distinguish “things that are likely a mistake” from “things that are likely just a comment” so that the diagnostic has a reasonably low false-positive rate.

For example, in the wild I’ve seen // TEMPORARY CHECK: X which FileCheck will treat as a real directive. On the other hand we don’t want it to look at // These CHECK lines are here on purpose and diagnose that as a missing colon.

I’ve been poking at this over the past day or so and I now think the most reasonable heuristic is: If the directive is preceded by an alphanumeric (+ whitespace) or an = or ‘,’ then it’s likely a comment or a RUN: line that we don’t want to diagnose. This means we would not diagnose a missing colon in // TEMPORARY CHECK; X but that sort of case is extremely rare. We would diagnose // (TEMPORARY) CHECK; X however.

The previous heuristic we discussed was that the diagnostic is reported only if, other than whitespace, the directive is at the beginning of a line or at the beginning of a comment (for some definition of comment). It appears that your new heuristic strictly expands the cases that would be diagnosed. Is that right? Is there a specific case you saw in the test suites that is covered by your new heuristic but not by the previous heuristic?

Your new heuristic does have additional false positives in the test suites, like the following:

// NOTE: CHECK lines have been autogenerated by gen_ast_dump_json_test.py
// FIXME: CHECK-NOT is broken somehow, it doesn’t work here. Check adjacency instead.

This makes me wonder if such heuristics are going to create a frustrating FileCheck user experience when trying to write comments.

Consider what happens if we implement your heuristic in a parameterized way using new environment variables with the following defaults:

FILECHECK_COMMENT_REGEX=’([a-zA-Z0-9]\s*|[=,])$’

FILECHECK_COMMENT_APPLY=diagnostics

So far, this is exactly your proposed heuristic except:

  1. It’s slightly more to implement: just a couple of environment variables to read. Not that hard.

  2. We can easily tweak the regex per test suite as we discover how this heuristic fares with the rest of check-all and with the rest of the prefixes (beyond CHECK).

  3. Specific test suites that want total freedom in writing comments and are ready to protect them with something like ‘COM:’ could specify something like:

FILECHECK_COMMENT_REGEX=’\bRUN:|\bCOM:’
FILECHECK_COMMENT_APPLY=diagnostics,directives

At some point in the future, when all test suites are compatible with 3, it could become the default (and we might then drop FILECHECK_COMMENT_APPLY). At that point:

A. Diagnostics would no longer be limited by heuristics’ shaky assumptions about intentions within comments. They know precisely what are just comments.

B. Users would know precisely where they can and cannot write the names of directives (both with and without the colon).

C. Users would have a way to comment out directives while making their intention very obvious (no more mangling).

This seems like both a short-term and long-term win for very little extra upfront effort.

Another question was whether we should require FileCheck directives to be preceded by a punctuation mark of some kind. I think that ought to be its own separate discussion.

Well, if we eventually require directive-like text to be FileCheck-commented, as described above, this question goes away.

Gotcha B (space between the directive and the colon): Some tests have this bug, so it would be worth catching.

James Henderson observed that legalizing it could help prettify some cases where we’re matching whitespace or the entire line. I don’t think it’s that valuable personally.

If we implement a reasonable diagnostic heuristic for the missing-colon case (Gotcha A), then we’ll catch this mistake in the same net.

Agreed.

Gotchas C,D (missing hyphen): I found exactly one case in the wild. I’d say the value is debatable.

(It’s a CHECKNEXT in llvm/test/CodeGen/PowerPC/testCompareslleqsi.ll if someone wants to fix it.)

Gotcha E (underscore instead of hyphen): I found 40 examples across clang/test and llvm/test. I am certain I have caught a few cases in review and pretty sure I’ve had to fix some of these that I typo’d myself. I’d say this is worth doing.

No arguments here.

Multiple suffixes: I believe there are NO multiple-suffix combinations that FileCheck currently supports. The tool should detect any multiple suffix combinations and report them as errors. Currently it looks for a limited set (basically, -NOT in combination with almost anything else), but it’s easy for someone to infer that if FileCheck doesn’t complain, then it will Do The Right Thing™ with other combinations. We should not be that user-unfriendly; we should complain about all multiple-suffix combinations.

Agreed.

Thanks.

Joel

Gotcha B (space between the directive and the colon): Some tests have this bug, so it would be worth catching.

James Henderson observed that legalizing it could help prettify some cases where we’re matching whitespace or the entire line. I don’t think it’s that valuable personally.

If we implement a reasonable diagnostic heuristic for the missing-colon case (Gotcha A), then we’ll catch this mistake in the same net.

Fine by me, as long as “# CHECK:” continues to be acceptable instead to allow the colons to line up with other check directives on other lines.

Gotchas C,D (missing hyphen): I found exactly one case in the wild. I’d say the value is debatable.

(It’s a CHECKNEXT in llvm/test/CodeGen/PowerPC/testCompareslleqsi.ll if someone wants to fix it.)

I’m not keen on this becoming something that is forbidden, since I could imagine myself running into this on occasion, particularly when using custom check prefixes shared between cases - it’s not unusual for me to name such prefixes as something like “FOO”, “BAR” and “FOOBAR”, where FOO is for case 1, BAR is for case2 and FOOBAR is used by both. If BAR happened to be a different name that clashes with a known suffix (e.g. LABEL), then I’d get an unwanted false positive.

Gotcha E (underscore instead of hyphen): I found 40 examples across clang/test and llvm/test. I am certain I have caught a few cases in review and pretty sure I’ve had to fix some of these that I typo’d myself. I’d say this is worth doing.

Multiple suffixes: I believe there are NO multiple-suffix combinations that FileCheck currently supports. The tool should detect any multiple suffix combinations and report them as errors. Currently it looks for a limited set (basically, -NOT in combination with almost anything else), but it’s easy for someone to infer that if FileCheck doesn’t complain, then it will Do The Right Thing™ with other combinations. We should not be that user-unfriendly; we should complain about all multiple-suffix combinations.

I agree that checking for them all seems reasonable, though perhaps it might be nice for this to be user-configurable somehow (I’m thinking FileCheck tests themselves might run into things confusingly). It seems to me that it would be easy to auto-generate the combinatorial list in code and then do a simple check for the prefix being in the list.

As an update, after lots of fixes from a number of different people (thanks everyone!), the current list of false-positives on ninja check-llvm for the more stringent Gotcha A diagnostic is:

LLVM :: Analysis/CostModel/X86/vselect-cost.ll
LLVM :: CodeGen/AMDGPU/GlobalISel/smrd.ll
LLVM :: CodeGen/AMDGPU/fp_to_uint.ll
LLVM :: CodeGen/AMDGPU/global-constant.ll
LLVM :: CodeGen/AMDGPU/merge-tbuffer.mir
LLVM :: CodeGen/AMDGPU/smrd.ll
LLVM :: CodeGen/AMDGPU/unhandled-loop-condition-assertion.ll
LLVM :: CodeGen/ARM/build-attributes.ll
LLVM :: CodeGen/ARM/float-helpers.s
LLVM :: CodeGen/ARM/select-imm.ll
LLVM :: CodeGen/ARM/struct_byval_arm_t1_t2.ll
LLVM :: CodeGen/Mips/cconv/arguments-float.ll
LLVM :: CodeGen/Mips/cconv/arguments-hard-float-varargs.ll
LLVM :: CodeGen/Mips/cconv/arguments-hard-float.ll
LLVM :: CodeGen/Mips/cconv/arguments-varargs.ll
LLVM :: CodeGen/Mips/cconv/arguments.ll
LLVM :: CodeGen/Mips/cconv/return-hard-fp128.ll
LLVM :: CodeGen/Mips/cconv/return-hard-struct-f128.ll
LLVM :: CodeGen/Mips/ci2.ll
LLVM :: CodeGen/Mips/countleading.ll
LLVM :: CodeGen/Mips/divrem.ll
LLVM :: CodeGen/Mips/dynamic-stack-realignment.ll
LLVM :: CodeGen/Mips/inlineasm-operand-code.ll
LLVM :: CodeGen/Mips/mips64muldiv.ll
LLVM :: CodeGen/PowerPC/ppc64-crsave.mir
LLVM :: CodeGen/X86/avx-cast.ll
LLVM :: CodeGen/X86/fp-intrinsics.ll
LLVM :: CodeGen/X86/splat-for-size.ll
LLVM :: CodeGen/X86/sse-scalar-fp-arith.ll
LLVM :: CodeGen/X86/vec_shift6.ll
LLVM :: CodeGen/X86/vector-compare-combines.ll
LLVM :: CodeGen/X86/vector-narrow-binop.ll
LLVM :: DebugInfo/COFF/vframe-fpo.ll
LLVM :: DebugInfo/X86/debug-info-static-member.ll
LLVM :: FileCheck/check-empty-tag.txt
LLVM :: FileCheck/dump-input-annotations.txt
LLVM :: FileCheck/no-multi-suffixes.txt
LLVM :: FileCheck/var-scope.txt
LLVM :: MC/AsmParser/expr-shr.s
LLVM :: MC/Mips/relocation-n64.s
LLVM :: MC/Mips/relocation.s
LLVM :: MC/RISCV/compressed-relocations.s
LLVM :: MC/RISCV/relocations.s
LLVM :: MC/RISCV/rv32b-aliases-valid.s
LLVM :: MC/RISCV/rv64b-aliases-valid.s
LLVM :: MC/RISCV/rva-aliases-valid.s
LLVM :: MC/RISCV/rvi-aliases-valid.s
LLVM :: Transforms/InstCombine/double-float-shrink-2.ll
LLVM :: Transforms/LoopFusion/cannot_fuse.ll
LLVM :: tools/llvm-objdump/ELF/dynamic-section-machine-specific.test
LLVM :: tools/llvm-readobj/ELF/RISCV/section-types.test
LLVM :: tools/llvm-readobj/ELF/section-types.test
LLVM :: tools/llvm-readobj/ELF/symbol-binding.test

All of these are variants in some form or another of comments that contain the check prefix, but which are intentionally not actual CHECK lines.

As an update, after lots of fixes from a number of different people (thanks everyone!), the current list of false-positives on ninja check-llvm for the more stringent Gotcha A diagnostic is:

LLVM :: Analysis/CostModel/X86/vselect-cost.ll
LLVM :: CodeGen/AMDGPU/GlobalISel/smrd.ll
LLVM :: CodeGen/AMDGPU/fp_to_uint.ll
LLVM :: CodeGen/AMDGPU/global-constant.ll
LLVM :: CodeGen/AMDGPU/merge-tbuffer.mir
LLVM :: CodeGen/AMDGPU/smrd.ll
LLVM :: CodeGen/AMDGPU/unhandled-loop-condition-assertion.ll
LLVM :: CodeGen/ARM/build-attributes.ll
LLVM :: CodeGen/ARM/float-helpers.s
LLVM :: CodeGen/ARM/select-imm.ll
LLVM :: CodeGen/ARM/struct_byval_arm_t1_t2.ll
LLVM :: CodeGen/Mips/cconv/arguments-float.ll
LLVM :: CodeGen/Mips/cconv/arguments-hard-float-varargs.ll
LLVM :: CodeGen/Mips/cconv/arguments-hard-float.ll
LLVM :: CodeGen/Mips/cconv/arguments-varargs.ll
LLVM :: CodeGen/Mips/cconv/arguments.ll
LLVM :: CodeGen/Mips/cconv/return-hard-fp128.ll
LLVM :: CodeGen/Mips/cconv/return-hard-struct-f128.ll
LLVM :: CodeGen/Mips/ci2.ll
LLVM :: CodeGen/Mips/countleading.ll
LLVM :: CodeGen/Mips/divrem.ll
LLVM :: CodeGen/Mips/dynamic-stack-realignment.ll
LLVM :: CodeGen/Mips/inlineasm-operand-code.ll
LLVM :: CodeGen/Mips/mips64muldiv.ll
LLVM :: CodeGen/PowerPC/ppc64-crsave.mir
LLVM :: CodeGen/X86/avx-cast.ll
LLVM :: CodeGen/X86/fp-intrinsics.ll
LLVM :: CodeGen/X86/splat-for-size.ll
LLVM :: CodeGen/X86/sse-scalar-fp-arith.ll
LLVM :: CodeGen/X86/vec_shift6.ll
LLVM :: CodeGen/X86/vector-compare-combines.ll
LLVM :: CodeGen/X86/vector-narrow-binop.ll
LLVM :: DebugInfo/COFF/vframe-fpo.ll
LLVM :: DebugInfo/X86/debug-info-static-member.ll
LLVM :: FileCheck/check-empty-tag.txt
LLVM :: FileCheck/dump-input-annotations.txt
LLVM :: FileCheck/no-multi-suffixes.txt
LLVM :: FileCheck/var-scope.txt
LLVM :: MC/AsmParser/expr-shr.s
LLVM :: MC/Mips/relocation-n64.s
LLVM :: MC/Mips/relocation.s
LLVM :: MC/RISCV/compressed-relocations.s
LLVM :: MC/RISCV/relocations.s
LLVM :: MC/RISCV/rv32b-aliases-valid.s
LLVM :: MC/RISCV/rv64b-aliases-valid.s
LLVM :: MC/RISCV/rva-aliases-valid.s
LLVM :: MC/RISCV/rvi-aliases-valid.s
LLVM :: Transforms/InstCombine/double-float-shrink-2.ll
LLVM :: Transforms/LoopFusion/cannot_fuse.ll
LLVM :: tools/llvm-objdump/ELF/dynamic-section-machine-specific.test
LLVM :: tools/llvm-readobj/ELF/RISCV/section-types.test
LLVM :: tools/llvm-readobj/ELF/section-types.test
LLVM :: tools/llvm-readobj/ELF/symbol-binding.test

All of these are variants in some form or another of comments that contain the check prefix, but which are intentionally not actual CHECK lines.

Thanks for working on this. That report makes me think the diagnostic is going to be frustrating. What do you think?

Joel

Just to add to this, I have sometimes used a semi-column instead of a column and wondered why it didn’t work. Is that included in gotcha A?

All of these are variants in some form or another of comments that contain the check prefix, but which are intentionally not actual CHECK lines.

Thanks for working on this. That report makes me think the diagnostic is going to be frustrating. What do you think?

This seems small compared to the number of tests, and even smaller compared to the number of total lines of test. It’s also small compared to the number of true positives it has found.

These cases can all be easily reworded to avoid the new diagnostic, and we could even add the preferred spelling to avoid it in the diagnostic itself, i.e. “if you want to avoid this, surround with backticks” (or whatever we decide on).

IMO, that makes it worthwhile even despite a little discomfort.

Jon

Yeah, I’ve been considering that to be the same kind of bug in these measurements/proposals.

All of these are variants in some form or another of comments that contain the check prefix, but which are intentionally not actual CHECK lines.

Thanks for working on this. That report makes me think the diagnostic is going to be frustrating. What do you think?

This seems small compared to the number of tests, and even smaller compared to the number of total lines of test. It’s also small compared to the number of true positives it has found.

These cases can all be easily reworded to avoid the new diagnostic, and we could even add the preferred spelling to avoid it in the diagnostic itself, i.e. “if you want to avoid this, surround with backticks” (or whatever we decide on).

IMO, that makes it worthwhile even despite a little discomfort.

OK. You’ve been looking at it more than I have.

Which of the heuristic proposals are you testing with now?

Any interest in introducing a FileCheck comment syntax? Maybe the regex is ‘\bRUN:|\bCOM:’? “COM:” would likely be clearer than directive name mangling, either when “fixing” comments for this diagnostic or when just trying to disable a directive for debugging, etc.

Joel