[RFC] Support/GlobPattern: add operator to invert matches

Intro

Hi everyone, I wanted to get some feedback on a feature that expands the functionality of the Glob pattern matcher.

Problem

Currently, the Glob pattern matching stuff does not support a way to exclude specific terms from matches.

Solution

Add an “inverted match” operator: {!foo}. This will match literally everything so long as it is not precisely "foo".

Notice that this operator matches the existing syntax of the term expansion bracket syntax (i.e., {foo,bar} will match any foo or bar).

Details

Prepend a ! to a term within curly brackets and the whole bracket expansion becomes “inverted”. Once one term is inverted, they all must be – this is enforced through pattern compilation errors.

This operator respects its surroundings just like other bracket operators.

The pattern: test{!_debug} will match against strings starting with test but not ending with _debug.

So, it will match test, test_qaz, test_qux but not test_debug or missionimpossible.

You can supply multiple inverted terms.

the_{!dog,!cat}_can_fly will match against the_bird_can_fly or the_alligator_can_fly but not the_dog_can_fly or the_cat_can_fly

Implementation

See my tree glob-negation.

This tree also includes a code snippet that allows for discriminating against certain types in arithmetic overflow checks utilizing this new operator.

For example, you can disable unsigned integer overflow sanitization for all types except size_t with an ignorelist.txt:

[unsigned-integer-overflow]
type:{!size_t}

Another way to accomplish this is to introduce “allowlists” in the sanitizer space. Currently, there is no allowlisting available. However, I have a sanitize-allowlist tree which aims to implement that.

Conclusion

Adding an inverted match operator can usefully expand the functionality of the GlobPattern class and of those that use it.

1 Like

@efriedma-quic @ellishg @vitalybuka

This invert match extension, if we decide to support, probably should not occur twice in the pattern, as otherwise it could easily be matched, end up having a high time complexity.

(Notes about recent changes to GlobPattern.cpp: I optimized the base algorithm in ⚙ D156046 [Support] Rewrite GlobPattern , reducing the time complexity to O(|S|*|P|). The brace expansion extension has an exponential time complexity and we use GlobPattern::create(Pattern, /*MaxSubPatterns=*/1024) in llvm/lib/Support/SpecialCaseList.cpp.)

In IRPGO we resolve this problem by using glob patterns to either match positively or negatively. If we use skip or forbid then the pattern is used to block instrumentation. If allow is used then we allow instrumentation instead. I believe something like this should cover the cases you mentioned and might even be more powerful.

https://clang.llvm.org/docs/UsersManual.html#instrumenting-only-selected-files-or-functions

However, I do see the value since it can greatly simplify common cases like test{!_debug} that you mentioned.

Should this be type:size_t=allow then?

I agree that we can extend Sanitizer special case list — Clang 20.0.0git documentation syntax to support invert matches.

I feel nervous adding invert matches features to glob as {!foo} is far from used in the wild. Bash extended glob has !(pattern-list) Matches anything except one of the given patterns, but used very little.

I agree that {!foo} is a weird and not-often-used syntax. I was trying to match the existing syntax that GlobPattern supports for other brace expansions. Anyways, I will implement type:...=allow for SCLs with a special focus on the arithmetic overflow sanitizers being supported since this is my main goal.

Thanks for the valuable discussion on this RFC. I hope to open a PR soon :slight_smile: