Get rid of checker name handling boilerplate

I’m fed up with the boilerplate code that manages the checker names and bug types in cases when a single checker class implements multiple checkers.

The problem

(Feel free to skip this section if you’re familiar with the code of checkers that have multiple sub-checkers.)

As a relatively simple example, DivZeroChecker contains the lines

enum CheckKind { CK_DivideZero, CK_TaintedDivChecker, CK_NumCheckKinds };
bool ChecksEnabled[CK_NumCheckKinds] = {false};
CheckerNameRef CheckNames[CK_NumCheckKinds];
mutable std::unique_ptr<BugType> BugTypes[CK_NumCheckKinds];

and then the initialization of the bug types performed by ugly scattered lazy blocks like

if (!BugTypes[CK_TaintedDivChecker])
  BugTypes[CK_TaintedDivChecker].reset(
      new BugType(CheckNames[CK_TaintedDivChecker], "Division by zero",
                  categories::TaintedData));

while the checker names are filled manually querying mgr.getCurrentCheckerName() as

checker->CheckNames[DivZeroChecker::CK_TaintedDivChecker] =         
    mgr.getCurrentCheckerName();                                    

in the registerXXX functions.

However, there are much more complex situations, e.g. in MallocChecker where the bug type data members look like

mutable std::unique_ptr<BugType> BT_DoubleFree[CK_NumCheckKinds];
mutable std::unique_ptr<BugType> BT_DoubleDelete;
mutable std::unique_ptr<BugType> BT_Leak[CK_NumCheckKinds];
mutable std::unique_ptr<BugType> BT_UseFree[CK_NumCheckKinds];
mutable std::unique_ptr<BugType> BT_BadFree[CK_NumCheckKinds];
mutable std::unique_ptr<BugType> BT_FreeAlloca[CK_NumCheckKinds];
mutable std::unique_ptr<BugType> BT_MismatchedDealloc;
mutable std::unique_ptr<BugType> BT_OffsetFree[CK_NumCheckKinds];
mutable std::unique_ptr<BugType> BT_UseZerroAllocated[CK_NumCheckKinds];
mutable std::unique_ptr<BugType> BT_TaintedAlloc;

(and their lazy initializations are scattered everywhere in the source file…)

My goals and plans

  • I want to replace all std:unique_ptr<BugType> members with plain BugTypes that are initialized eagerly where they are declared.
  • I want to get rid of direct calls to the magical method mgr.getCurrentCheckerName() and ad hoc data members that store CheckerNameRefs in checker-specific ways.
  • I don’t want to touch the source code of “simple” checker classes that only define a single checker.
  • I won’t perform large-scale refactoring like splitting big fat checker classes into multiple smaller classes – and extrapolating from the history of the analyzer, it’s unlikely that anybody else will do so in the future. The perfect is the enemy of the good; I want practical incremental improvements instead of dreams that will are postponed forever.
  • I’m planning to standardize the practical pattern that checker classes with multiple checkers define the possible CheckKinds as enumerators in an enum that can also be used as array indexes.
  • I’m planning to tweak the constructor of BugType and the registerChecker machinery to interact with this standardized CheckKinds. (The name “CheckKinds” may change.)
  • I may introduce a new base class (e.g. MultiChecker) for checker classes that have multiple sub-checkers; but I’m also thinking about making the plain Checker smart enough to provide either “simple” checkers or multicheckers.

I know about the old (abandoned) commit ⚙ D67336 [analyzer][NFC] Introduce SuperChecker<>, a convenient alternative to Checker<> for storing subcheckers which had similar goals.

However there are several differences between that change and my plans:

  • It introduces a hierarchical model where a SuperChecker has multiple SubCheckers, while in my suggestion the MultiChecker has an array of equally valid checker names (corresponding to different check kinds).
    • In the super/sub checker model, the subcheckers cannot be registered without also registering the superchecker; my plan doesn’t have this kind of limitation.
  • The SuperChecker commit does not help with the clear construction of BugTypes (although it could be extended to do so).

@Szelethus @steakhal @NoQ @Xazax-hun What do you think about this plan? Do you see any fundamental issues, or is it OK if I start working on implementing this?

Im busy this week, but Ill have a look at this next week if I have the time. Ping me if I dont.
Thank you.

It’s completely fine if you answer next week – this is not especially urgent (although I don’t want to postpone indefinitely).

Yes please. I think this should work without any changes to the checker infrastructure. This is a good separate first PR that should be a relatively trivial change. If we have single checkers that are still using the lazy initialization those could potentially be changed as well.

For the rest, I’d need to see the patches to have opinions on them. Do you think the new way of organizing multiple checks would also be useful to split modeling and diagnostics?

Unfortunately this cannot be done separately from the other changes that I’m planning (see below for explanation) – but my plans are straightforward enough that they may be similar to the “relatively trivial change” that you imagine.

The main reason why I need to change the architecture is that the different checker names from the mgr.getCurrentCheckerName() calls (which are performed in the registerXXXChecker() functions) must be transferred to the method BugType::getCheckerName() (which is called e.g. for diagnostics generation). If we want nice, eager initialization of the BugTypes (like const BugType BT_Undef{this, "Garbage return value"}; in ReturnUndefChecker), then all the BugTypes are initialized by the constructor of the checker class, which happens within the first registerXXXChecker() for that checker class. This means that the bug types must be initialized with a pointer to the checker, which they can use later to retrieve the right checker name in a standardized way and the checker class must have a storage area where it can record all the checker names that are registered for it.

For the “simple” checkers (one class, one name) this is already implemented:

  • The checker class has a data member CheckerNameRef Name which is inherited from CheckerBase.
  • CheckerManager::registerChecker() constructs the checker object.
    • The constructor of the checker constructs the BugTypes with the constructor that takes const CheckerBase *Checker as its first parameter (and passes this to these parameters).
      • This Checker pointer is saved as a data member of the BugType.
    • registerChecker saves currentCheckerName in the data member checker->Name.
  • Later, when BugType::getCheckerName() is called, it calls Checker->getName() (where Checker is the data member of the BugType) which returns the data member Name of the checker object.

In checker classes that “cover” multiple names, this cannot work, because the single Name of the (singleton) checker object cannot transfer all the different names; so each such class implements its own procedure, which usually has the following structure:

  • The checker class declares its own enum CheckKind { CK_Foo, CK_Bar, ..., CK_NumCheckKinds }.
  • The checker class also declares its own CheckerNameRef CheckNames[CK_NumCheckKinds]; to hold all the different names;
  • (We still have the inherited generic CheckerNameRef Name but it is less useful, because it’s just one name among the many. I suspect that its use might be buggy in some cases.)
  • In the registerXXX functions EITHER CheckerManager::registerChecker() constructs the checker object OR CheckerManager::getChecker() retrieves the already constructed checker
    • In the constructor we cannot properly initialize the BugTypes (there is no suitable constructor), so we just prepare null unique_ptrs for later lazy initialization.
    • (By the way, registerChecker still saves currentCheckerName in the data member checker->Name.)
  • After calling either getChecker or registerChecker; we save mgr.getCurrentCheckerName() in the array checker->CheckNames (at the element whose index is the current CheckKind).
  • Later, in the middle of the analysis, when we need the BugType, we lazily initialize it with a constructor that takes CheckerNameRef CheckerName as its first parameter.
    • This CheckerName is retrieved from the array CheckerNames.

My plan is that I want to “standardize” the status quo, e.g. in the following manner:

  • MultiCheckers need to declare their own enum CheckKind { ..., CK_NumCheckKinds} which is associated with their checker class (probably as a template parameter).
  • The base class of multicheckers declares a data member Names which is an array (or SmallVector) of CheckerNameRefs.
    • The “simple” checkers may keep the current CheckerNameRef Name or they may end up with a single-element Names array – I don’t know which approach would yield the simpler code.
  • The registerXXX functions call a newly introduced function called registerCheckerPart which takes the CheckKind as an argument, and does the following:
    • Constructs the checker object (like registerChecker) if it hasn’t been constructed yet; otherwise retrieves it (like getChecker).
    • The constructor of the checker constructs all the BugTypes with a newly introduced constructor that takes const CheckerBase *Checker and CheckKind as its first two parameters. (E.g we would be able to declare a bug type like BugType TooMuchSpam{this, CK_SpamIssues, "Too much spam", categories::Spam}; within the checker class.)
    • registerCheckerPart saves mgr.getCurrentCheckerName() in the array Names array of the (either freshly constructed or retrieved) checker object at the element whose index is the current CheckKind (which was passed to registerCheckerPart).
  • Later, when BugType::getCheckerName() is called, it uses a getter method to retrieve Checker->Names[CheckKind] (where Checker and CheckKind are data members of the BugType which store the two first arguments of the newly introduced constructor).

Definitely, I’ll do a search and quickly fix them if there are any.

I wouldn’t say that my plan is a “new way of organizing multiple checks”, I just want to improve consistency and reduce duplicated code within the already existing framework – so I’d say that it’s does not change the difficulty of splitting modeling and diagnostics.

I skipped over the implementation details for the most part as I’m not really in the context or mood right now.

Speaking of the motivation, I share all your pain points.

I wonder however, if there is a better alternative to raw-enum-based dispatch techniques. Have you considered somehow using tag types, or tag types inheriting from the existing base checker class?

Maybe that leads to an even better user interface. I have no clue if that’s the case, and I’m also fine with the current direction btw. A deeper discussion would be warranted once we have the patches of course.

I wonder however, if there is a better alternative to raw-enum-based dispatch techniques. Have you considered somehow using tag types, or tag types inheriting from the existing base checker class?

For this application I really prefer the raw enums, because indentifiers of the subcheckers (the CheckKind values):

  • need to be runtime values (not just template parameters);
  • all CheckKind types must be convertible to a single type T (because we want to store CheckKinds in a data member of BugType);
  • the CheckKind values should be usable as indices of a suitable data structure (e.g. the Names of a checker).

These conditions are trivially satisfied by a raw enum where the values of the enumerators are 0, 1, 2, … (with T = unsigned and plain arrays/vectors). On the other hand, IMO these conditions are specific enough that any other implementation is just an awkward way of implementing an enum type. (“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”)

In general, I really dislike using modern, fancy, tricky C++ constructs in situations where the old straightforward solutions produce clear and concise code. There is no reason to be swept away by the fashions and use something new just to demonstrate that we can.

Of course, I’m not opposed to modern, fancy, tricky solutions that have concrete practical benefits – but I wouldn’t think about calling in the heavy artillery until I hit a roadblock that does require artillery strikes.

Okay okay. I guess once you propose something, I’ll see if I get inspired or not.
I have no problems with enums.

I’m progressing with this project when I have time for it, and I think I’ll be able to publish a concrete (although not necessarily final) commit for review and discussion in the foreseeable future.

It turns out that checker name handling is entwined with checker option handling, so my changes will need to change the checker option handling code – although in the first commit I can leave this with a “FIXME: this should be changed later” comment.

I uploaded a concrete (but not necessarily final) implementation for review at [NFC][analyzer] Framework for multipart checkers by NagyDonat · Pull Request #130985 · llvm/llvm-project · GitHub

Reviews are welcome :grinning_face_with_smiling_eyes: feel free to suggest changes if you think that they would make this framework more clear and useful.