Format of special case list for sanitizers

The documentation for the sanitizer special case list format[0] says,

The meanining of * in regular expression for entity names is different - it is treated as in shell wildcarding.

In SpecialCaseList::parse, we see that this is just replacing * with .*:

    // Replace * with .*
    for (size_t pos = 0; (pos = Regexp.find("*", pos)) != std::string::npos;
         pos += strlen(".*")) {
      Regexp.replace(pos, strlen("*"), ".*");
    }

This seems to introduce more problems than it solves, since (i) this doesn’t really behave like a shell globbing wildcard as advertised, and (ii) if the user tries to use * as a regex quantifier, this will match incorrectly: A* matches the empty string and any number of As, while A.* matches all strings that start with at least one A.

If it’s forgivable to break compatibility here, we should do regular expressions _or_ shell globbing, and not a hybrid format.

I’d prefer shell globbing for paths in src entities, but that isn’t as useful for function names. Most filenames will contain periods, which also need to be escaped properly as regular expressions. (This also limits the usefulness of treating literals separately.)

(Just a note: the way that regular expressions are concatenated in ::parse appears to have a bug if a pattern contains a pipe.)

Ryan

0: Sanitizer special case list — Clang 16.0.0git documentation

diff --git a/lib/Support/SpecialCaseList.cpp b/lib/Support/SpecialCaseList.cpp
index c312cc1..2972cb1 100644
--- a/lib/Support/SpecialCaseList.cpp
+++ b/lib/Support/SpecialCaseList.cpp
@@ -133,7 +133,7 @@ bool SpecialCaseList::parse(const MemoryBuffer *MB, std::string &Error) {
     // Add this regexp into the proper group by its prefix.
     if (!Regexps[Prefix][Category].empty())
       Regexps[Prefix][Category] += "|";
- Regexps[Prefix][Category] += "^" + Regexp + "$";
+ Regexps[Prefix][Category] += "^(" + Regexp + ")$)";
   }
   return true;
}

Hi Ryan,

Alexey Samsonov <vonosmas@gmail.com> writes:

Hi Ryan,

The documentation for the sanitizer special case list format[0] says,

> The meanining of * in regular expression for entity names is different
- it is treated as in shell wildcarding.

In SpecialCaseList::parse, we see that this is just replacing * with .*:

    // Replace * with .*
    for (size_t pos = 0; (pos = Regexp.find("*", pos)) != std::string::npos;
         pos += strlen(".*")) {
      Regexp.replace(pos, strlen("*"), ".*");
    }

This seems to introduce more problems than it solves, since (i) this
doesn’t really behave like a shell globbing wildcard as advertised, and
(ii) if the user tries to use * as a regex quantifier, this will match
incorrectly: A* matches the empty string and any number of As, while A.*
matches all strings that start with at least one A.

If it’s forgivable to break compatibility here, we should do regular
expressions _or_ shell globbing, and not a hybrid format.

I agree that the current format description is misleading, e.g. "foo*bar"
will also match "fooxxx/yyy/zzzbar", which might be unexpected for user
expecting a shell globbing. For now we should at least change documentation
to reflect that. In retrospective, replacing "*" with ".*" doesn't look like
a good idea at all, and for simplicity I'd prefer to just use regular
expressions. Adding special case for "file paths" isn't nice:
1) as you mention, we have to do careful escaping
2) at the moment special case list format is generic and is not tied to "src"
or "fun" entities, it's specific sanitizers that introduce logic on top of
it. I'd prefer to treat all special case list entries in a similar way.

However, I'm really afraid of breaking compatibility :frowning: I know several users
of blacklist that already use "*" meaning ".*", and it would be challenging
to migrate them - e.g. you have to keep two different blacklist files for
older and newer Clang...

TBH, this doesn't seem like that big of a deal to me. The "*" behaviour
is strange and confusing, and I'd expect most people to use sanitizers
mostly from their newer compiler if they use multiple - they work
better. Compiling with older compilers is important for a lot of use
cases, but I can't see why people would run sanitizers from two
different versions of the compiler at the same time.

The documentation for the sanitizer special case list format[0] says,

> The meanining of * in regular expression for entity names is different -
it is treated as in shell wildcarding.

In SpecialCaseList::parse, we see that this is just replacing * with .*:

    // Replace * with .*
    for (size_t pos = 0; (pos = Regexp.find("*", pos)) !=
std::string::npos;
         pos += strlen(".*")) {
      Regexp.replace(pos, strlen("*"), ".*");
    }

This seems to introduce more problems than it solves, since (i) this
doesn’t really behave like a shell globbing wildcard as advertised, and
(ii) if the user tries to use * as a regex quantifier, this will match
incorrectly: A* matches the empty string and any number of As, while A.*
matches all strings that start with at least one A.

If it’s forgivable to break compatibility here, we should do regular
expressions _or_ shell globbing, and not a hybrid format.

This really doesn't seem compelling enough to merit breaking compatibility,
especially since there are already users in the wild.

-- Sean Silva

Alexey Samsonov wrote:

Hi Ryan,

    The documentation for the sanitizer special case list format[0] says,

     > The meanining of * in regular expression for entity names is
    different - it is treated as in shell wildcarding.

    In SpecialCaseList::parse, we see that this is just replacing * with .*:

         // Replace * with .*
         for (size_t pos = 0; (pos = Regexp.find("*", pos)) !=
    std::string::npos;
              pos += strlen(".*")) {
           Regexp.replace(pos, strlen("*"), ".*");
         }

    This seems to introduce more problems than it solves, since (i) this
    doesn’t really behave like a shell globbing wildcard as advertised,
    and (ii) if the user tries to use * as a regex quantifier, this will
    match incorrectly: A* matches the empty string and any number of As,
    while A.* matches all strings that start with at least one A.

    If it’s forgivable to break compatibility here, we should do regular
    expressions _or_ shell globbing, and not a hybrid format.

I agree that the current format description is misleading, e.g.
"foo*bar" will also match "fooxxx/yyy/zzzbar", which might be unexpected
for user expecting a shell globbing. For now we should at least change
documentation to reflect that. In retrospective, replacing "*" with ".*"
doesn't look like a good idea at all, and for simplicity I'd prefer to
just use regular expressions. Adding special case for "file paths" isn't
nice:
1) as you mention, we have to do careful escaping
2) at the moment special case list format is generic and is not tied to
"src" or "fun" entities, it's specific sanitizers that introduce logic
on top of it. I'd prefer to treat all special case list entries in a
similar way.

However, I'm really afraid of breaking compatibility :frowning: I know several
users of blacklist that already use "*" meaning ".*", and it would be
challenging to migrate them - e.g. you have to keep two different
blacklist files for older and newer Clang...

FWIW, I think ASAN is going to have more users tomorrow than today, and the longer we wait the more painful it will be to change. The sooner we make the change, the easier it'll be and the less time we'll have to live with the wart. That said, I don't know how to arrange for the switch, maybe add a -fold-format flag which converts the * to .*

Well, suppose you build your project with Clang 3.6 and have
my/project/asan_blacklist.txt
checked in your VCS. Now, if you will decide to switch to Clang 3.7 you'd
have to *atomically*
upgrade the contents of my/project/asan_blacklist.txt. If some developers
are still using Clang 3.6,
you'd have to keep the older version there, and do magic tricks in the
build system so that it
would choose between two versions of asan_blacklist.txt. I'm not saying
it's impossible
(things will be easier for you if you use default blacklist, for instance),
it's just painful.

You will also not be able to see the problem instantly - your old blacklist
will be treated
differently, and suddenly some functions blacklisted long time ago would be
silently un-blacklisted,
triggering long-closed bugs.

-fsanitize-old-blacklist-format, which Nick suggests below, is an option.
However, that would
also require passing different flags with different compiler versions. And,
again, we could decide
to add more blacklist features in the future. Or GCC people could implement
some version of it...

Another option is to add versioning to the file format (we had to do smth.
similar for coverage
file format, for instance). E.g. if old file was: