I had initially emailed Yitzhak Mandelbaum about this and he suggested that I bring the discussion to a wider audience. This is a copy/paste of a series of emails I sent Yitzhak.
Even with module support in C++20, I think we’re going to be living with the evils of the preprocessor for quite some time and not just due to legacy code bases.
I think clang-tidy is in a good place to migrate existing code bases away from macro usage and towards a modern replacement. However, writing a preprocessor oriented check brings on an additional level of difficulty for the check author. Namely:
- the processing process is decoupled from the construction of the AST
- there isn’t any good way of analyzing code that is “disabled” via conditional compilation (e.g.
#if) in preprocessor based check or an AST based check. There is no AST constructed for any of this disabled code.
- the PP callbacks talk to your code in terms of source locations, but you have no good means of correlating these source locations to AST nodes
- the PP callbacks force you to deal with raw text in order to recognize certain constructs, e.g. there is no PP callback for a
#pragma oncedirective. You have to parse the text after the
#pragmato see if it’s a
#pragma once. Maybe this is easily solved by adding another callback to PPCallbacks.
- If you’re analyzing the expressions used in conditional compilation directives, you have to lex and parse the conditional expression yourself and build some sort of AST around it home-brew, e.g.
#if defined(FOO) && FOO > 1just triggers the
PPCallbacks::Definedmethods, but doesn’t give you insight into the condition (although you do know whether the whole thing evaluates to true or false).
As a check author, it would be nice if I had a DSL that would allow me to write matchers against the preprocessor actions. In particular it would be useful to have some sort of AST for the conditions in
#elif style blocks so that I could apply deeper analysis to conditionally compiled code.
If I’m writing a check that replaces a macro with something else (
constexpr constant, inline function, etc.) then I need to analyze the AST nodes that contain all the places where the macro is expanded.
From PPCallbacks I get a SourceLocation of where the expansion occurs, but I don’t get any associated AST. I have to match translationUnitDecl()s and visit all the nodes in the AST looking for
matching source locations.
With existing matchers you can write something that matches the expansion of a specific macro, but if you don’t know the macro a priori (because you’re monitoring via PPCallbacks), you can’t register the matcher appropriately.
So… to address these shortcomings, I think we need a new piece of infrastructure that bridges these gaps. Some of these can be solved (perhaps inelegantly) with the existing preprocessor/lexer/parser/sema dance, but especially #2 above can’t be solved without some changes to those classes (if it’s even possible). Currently, tools like cppcheck just reprocess the file multiple times with different defines in order to touch all the branches, but I suspect there’s something better that could be done.
My little brainstorm last night was some sort of DSL that let me match preprocessor actions and an AST for preprocessor conditions that I could match with the DSL. Less clear is some sort of DSL for associating AST nodes with macro expansion locations.
[cont’d in a later email]
I was thinking about this a little more last night, and what you want to see when snooping the preprocessor is a tree like this:
#include "foo.h" #ifndef FOO_H #define FOO_H #include "bar.h" #ifndef BAR_H #define BAR_H #define BAR_USES_GRONK 1 #if defined(BAR_USES_GRONK) #if BAR_USES_GRONK #define GRONK_VERSION 6 #endif #endif #endif #endif
Instead of the individual notifications about the directive. You want to process the whole translation unit and get back a tree of preprocessing directives against which you can match with a pattern.
Snooping the individual directives forces you to build lots of complex state machine goop that is bespoke for each preprocessor oriented check that looks beyond a single directive in isolation.
For the bug “Identify suspicious preprocessor macro usage”, we want to look for things like
#define USE_FOO 0 // ... #if defined(USE_FOO) // ... #endif
This is suspicious because we’ve defined an object-like macro to a value, but later we only check if it is defined. A false positive should not be issued for code like:
#define USE_FOO 0 // ... #if defined(USE_FOO) // ... #if USE_FOO // ... #endif #endif
Where “// …” means arbitrarily large amounts of intervening code could appear (which is exactly what makes this bug difficult to find in your code) – including additional conditional preprocessor blocks.
Here you see that we need the entire context of the preprocessor conditional blocks and trying to identify them without the whole tree of conditionals easily involves writing one-off state machines that
won’t help any other check.
It’s made worse by the fact that the
#elif callbacks only give you a source range around the condition. The preprocessor has already parsed the source range into a stream of tokens and evaluated the combined expression to tell you in your callback the result of the condition (true, false, skipped IIRC). However, if your goal is to inspect deeper into the conditional expressions you have to re-lex the condition and parse out the expression evaluation again yourself.
An immediate first step that seems worthwhile is to enhance the preprocessor to invoke a callback for #if/#elif that passes along the stream of tokens parsed out of the conditonal expression to avoid duplicated work.
Following that it seems worthwhile to create a PPCallbacks implementation that builds the tree I describe at the start of this email. From there you can traverse the tree on the EndOfMainFile notification to perform a more global analysis of preprocessor directives.
[cont’d in a later email]
I’ve hacked up something that builds a tree of preprocessing directives and I threw it into Phabricator as a point of discussion: D118711
This code is ugly and doesn’t really follow LLVM/Clang’s way of doing things (e.g.
dyn_cast) but it serves as an object of discussion.