Refactoring llvm::SpecialCaseList

hi folks!

As I was working on implementation of [RFC] Add support for controlling diagnostics severities at file-level granularity through command line. I ended up re-using llvm::SpecialCaseList and it felt like things can be improved here a little bit to ease maintenance and re-use going forward.

Hence I’d like to perform some NFC refactorings in this area and wanted to see if people have concerns before attempting that.

I believe the major refactoring point is separating logic that parses a SpecialCaseList input file and performs matching based on this.

The format is generic, and currently re-used by 3 major components:

All of this use-cases parse the special-case-list files using the same logic and then customize bits and pieces in the matching logic (some needs line numbers, some don’t, some have very different matching criteria than others, some wants to verify certain information in parsed format).

As a result all of these implementations inherit llvm::SpecialCaseList just to use their parser, and then tweak its matching logic to accommodate their use cases. Turning base implementation into a mess that’s really hard to reason about.

Having a shared parser and letting people use some standard matching logic separately should improve simplicity of code here.


The second bit I’d like to change is usage of StringMaps in the matching logic with BumpPtrAllocators.

Various entites like llvm-project/llvm/include/llvm/Support/SpecialCaseList.h at main · llvm/llvm-project · GitHub use a StringMap solely to keep strings alive. Afterwards all the usages on this stringmap actually iterate over all the entries.
Hence I’d like to use a BumpPtrAllocator for keeping strings alive and a std::vector to store & iterate over all the entries.

This will have a slight behavior change. Currently sections with same names are “merged”, e.g:

[foo]
src:my_file.cc

[bar]
src:your_file.cc

[foo]
src:your_file.cc

will create only a single Section for foo on line 1.

First of all Sanitizer special case list — Clang 20.0.0git documentation doesn’t mention anything about declaring the same section multiple times. So I believe any bets are off here. Moreover the new implementation will store these as two separate Sections, one foo on line 1, another foo on line 3, which seems better.

In terms of matching behavior we might have changes again. Previously the example file above could fist match all the entries that belong to foo section hence your_file.cc would match foo. Now it can match against bar instead.

But because we were actually using StringMaps all over the place, when there are multiple entities matching a query, it wasn’t guaranteed which one would match. Hence again, I think using std::vector in these places are actually going to make matching behavior more “reasonable” by making sure we’re matching entries in the order user provided them.


If no one has concerns about these two points, I’d like to start implementing them as soon as https://github.com/llvm/llvm-project/pull/112517 lands.

oh I also forgot to mention, there are discussions around using a json based format for special-case-lists in [RFC] Add support for controlling diagnostics severities at file-level granularity through command line. separation of parser and matcher should also ensure such a change could happen with way less changes/risks.