Closing the gap between the preprocessor and the AST

I had initially emailed Yitzhak Mandelbaum about this and he suggested that I bring the discussion to a wider audience. This is a copy/paste of a series of emails I sent Yitzhak.

Even with module support in C++20, I think we’re going to be living with the evils of the preprocessor for quite some time and not just due to legacy code bases.

(Aside, lately I’ve been working on these issues: Create modernize-enum-from-macros check and Identify suspicious preprocessor macro usage which has me thinking a lot about the preprocessor.)

I think clang-tidy is in a good place to migrate existing code bases away from macro usage and towards a modern replacement. However, writing a preprocessor oriented check brings on an additional level of difficulty for the check author. Namely:

  1. the processing process is decoupled from the construction of the AST
  2. there isn’t any good way of analyzing code that is “disabled” via conditional compilation (e.g. #if) in preprocessor based check or an AST based check. There is no AST constructed for any of this disabled code.
  3. the PP callbacks talk to your code in terms of source locations, but you have no good means of correlating these source locations to AST nodes
  4. the PP callbacks force you to deal with raw text in order to recognize certain constructs, e.g. there is no PP callback for a #pragma once directive. You have to parse the text after the #pragma to see if it’s a #pragma once. Maybe this is easily solved by adding another callback to PPCallbacks.
  5. If you’re analyzing the expressions used in conditional compilation directives, you have to lex and parse the conditional expression yourself and build some sort of AST around it home-brew, e.g. #if defined(FOO) && FOO > 1 just triggers the PPCallbacks::If and PPCallbacks::Defined methods, but doesn’t give you insight into the condition (although you do know whether the whole thing evaluates to true or false).

As a check author, it would be nice if I had a DSL that would allow me to write matchers against the preprocessor actions. In particular it would be useful to have some sort of AST for the conditions in
#if/#elif style blocks so that I could apply deeper analysis to conditionally compiled code.

If I’m writing a check that replaces a macro with something else (constexpr constant, inline function, etc.) then I need to analyze the AST nodes that contain all the places where the macro is expanded.
From PPCallbacks I get a SourceLocation of where the expansion occurs, but I don’t get any associated AST. I have to match translationUnitDecl()s and visit all the nodes in the AST looking for
matching source locations.

With existing matchers you can write something that matches the expansion of a specific macro, but if you don’t know the macro a priori (because you’re monitoring via PPCallbacks), you can’t register the matcher appropriately.

So… to address these shortcomings, I think we need a new piece of infrastructure that bridges these gaps. Some of these can be solved (perhaps inelegantly) with the existing preprocessor/lexer/parser/sema dance, but especially #2 above can’t be solved without some changes to those classes (if it’s even possible). Currently, tools like cppcheck just reprocess the file multiple times with different defines in order to touch all the branches, but I suspect there’s something better that could be done.

My little brainstorm last night was some sort of DSL that let me match preprocessor actions and an AST for preprocessor conditions that I could match with the DSL. Less clear is some sort of DSL for associating AST nodes with macro expansion locations.

[cont’d in a later email]

I was thinking about this a little more last night, and what you want to see when snooping the preprocessor is a tree like this:

  #include "foo.h"
  #ifndef FOO_H
    #define FOO_H
    #include "bar.h"
    #ifndef BAR_H
      #define BAR_H
      #define BAR_USES_GRONK 1
      #if defined(BAR_USES_GRONK)
        #if BAR_USES_GRONK
          #define GRONK_VERSION 6
        #endif
      #endif
    #endif
  #endif

Instead of the individual notifications about the directive. You want to process the whole translation unit and get back a tree of preprocessing directives against which you can match with a pattern.
Snooping the individual directives forces you to build lots of complex state machine goop that is bespoke for each preprocessor oriented check that looks beyond a single directive in isolation.

For the bug “Identify suspicious preprocessor macro usage”, we want to look for things like

  #define USE_FOO 0
  // ...
  #if defined(USE_FOO)
    // ...
  #endif

This is suspicious because we’ve defined an object-like macro to a value, but later we only check if it is defined. A false positive should not be issued for code like:

  #define USE_FOO 0
  // ...
  #if defined(USE_FOO)
    // ...
    #if USE_FOO
      // ...
    #endif
  #endif

Where “// …” means arbitrarily large amounts of intervening code could appear (which is exactly what makes this bug difficult to find in your code) – including additional conditional preprocessor blocks.

Here you see that we need the entire context of the preprocessor conditional blocks and trying to identify them without the whole tree of conditionals easily involves writing one-off state machines that
won’t help any other check.

It’s made worse by the fact that the #if/#elif callbacks only give you a source range around the condition. The preprocessor has already parsed the source range into a stream of tokens and evaluated the combined expression to tell you in your callback the result of the condition (true, false, skipped IIRC). However, if your goal is to inspect deeper into the conditional expressions you have to re-lex the condition and parse out the expression evaluation again yourself.

An immediate first step that seems worthwhile is to enhance the preprocessor to invoke a callback for #if/#elif that passes along the stream of tokens parsed out of the conditonal expression to avoid duplicated work.

Following that it seems worthwhile to create a PPCallbacks implementation that builds the tree I describe at the start of this email. From there you can traverse the tree on the EndOfMainFile notification to perform a more global analysis of preprocessor directives.

[cont’d in a later email]

I’ve hacked up something that builds a tree of preprocessing directives and I threw it into Phabricator as a point of discussion: D118711

This code is ugly and doesn’t really follow LLVM/Clang’s way of doing things (e.g. dyn_cast) but it serves as an object of discussion.

2 Likes

This might be useful: ⚙ D118788 [Experimental] String-injection consteval metaprogramming.

I’m imagining this:

You could convert the preprocessor’s actions into implicitly generated consteval metaprograms, with string injection statements. (Whether it would be best to do this during the initial parsing via additional hooks into Parser/Sema, or afterward while processing your PP Tree, I can’t say.)

#defines would become declarations – for arg-taking macros, you would generate consteval const char *-returning functions which accept const char * arguments; otherwise you would generate a const char *-typed static const variable declaration. (Oh one other thing you’d need is a consteval string concatenation expression, to represent A##B - but that has already been implemented elsewhere and could easily be added.)

#ifs etc. would be turned into ordinary if etc. statements within an enclosing consteval {} metaprogram. Within each branch would be a string injection statement, or series of them, taking the content of that branch as its input, with any macros represented as variables or function calls within.

A tool intended to check the AST of alternate preprocessor paths could alter the definition of certain of the (#define-generated) variables, then call a TreeTransform override to ActOn the relevant MetaprogramDecls again, which would result in new code being parsed into (i.e. generating AST nodes in) their enclosing context, if any of that makes sense.

To be sure, string injection as a language feature has gotten a bad rap, precisely due to its similarity to the preprocessor; but in this case that similarity seems like a good thing: an accurate means of representing preprocessor content in the AST, and mimic preprocessor actions well after preprocessing has finished via the ordinary Sema processing of such nodes.

Yitzhak pointed me to this paper “SuperC: Parsing All of C by Taming the Preprocessor” which provides a feasible mechanism for implementing a solution to #2 above. They have a user guide for SuperC as well.

I’ve forked a copy of the code from a ZIP file on their web site into a repository on github. SuperC is just one application of their extensible compiler framework. Their framework – and SuperC – are all written in Java.

The paper is an interesting read and the mechanisms they describe appear to be feasible for implementing in clang. The behavior of parsing all the branches of conditional compilation into the AST would be something you would opt-in and not something that you would want enabled by default. There’s no benefit for regular compilation in doing the additional parsing, but as the clangd folks have been discussing with their parsing needs for IDE fixes, you need this additional work if you want automated refactoring tools like clang-tidy to be able to apply fixes across conditional compilation markers.

If we were to explore such a mechanism for libtooling, I’m thinking we’d want to work by adapting the ideas in the paper to the existing lexer, preprocessor and parsing code in clang without directly imitating their implementation. I’m not sure what the LLVM policy is on incorporating ideas in such papers into the code base; presumably such efforts have been done before in compiler passes or something?

Honestly, the Preprocessor should live in the AST.

Clang-Format is constantly struggling with the preprocessor IIRC, and I personally am doing some work in the Preprocessor and it’s extremely confusing.

1 Like

Honestly, the Preprocessor should live in the AST.

That’s essentially the approach taken by the SuperC paper (PDF at the link I provided earlier), except that it’s a middle ground between representing all preprocessing tokens in the AST and expanding most macros while preserving a flattened conditional for all the conditionally compiled blocks of code.

I think what we have now is what is needed to support compilation, but falls short of other needs like formatting, static analysis and refactoring. The preprocessor is also a very hot path since every character of every transitively included source file needs to go through it.

I think we need to keep this “hot path” minimally impacted by any improvements made to support the other use cases.

1 Like

I have a bit of trouble following that paper, largely because there are no examples of the pseudo-AST it would produce from a given input.

That seems like the most productive way to discuss this: what sort of AST dump would we like to see for a given input?

I’m interested to hear other possibilities, but (to expand on my earlier remarks) I imagine the easiest-to-work-with-and-understand preprocessor representation in the AST is a combination of the following:

  1. static constexpr VarDecls for non-parameterized #defines
  2. “preprocessor metaprogram” nodes specific to various contexts (PPMetaprogramDecl, PPMetaprogramStmt, PPMetaprogramExpr, PPMetaprogramParmVarDecl, PPMetaprogramTemplateParameter/Argument, PPMetaprogramCXXBaseSpecifier, etc.)
  3. Ordinary IfStmts/ElseStmts within those metaprograms.
  4. PPStringInjectionStmts allowed directly within metaprograms or within…
  5. consteval FunctionDecls for parameterized macro defs, whose CallExprs are withi metaprograms.

This would maximize the use of existing AST nodes. We probably don’t want to have to handle e.g. a PreprocessorIfStmt that has no relation to an ordinary IfStmt.

This AST infra is already efficiently implemented in D118788, except for the additional Metaprogram kinds – only declaration and statement contexts are implemented (both via MetaprogramDecl).

On the matter of efficiency, while it is not necessary to perform initial preprocessing via Sema processing off these AST nodes, I actually bet it would be at least as fast to do it that way. Certainly the memory usage would be less and the SourceLocation understandability would be greater: in D118788 there is no copying of the “expansions” (strings to be injected) to new buffers, as the clang Preprocessor does for macros, which in extreme cases can result in so many allocations as to require 64-bit SourceLocations to index into them ([RFC] Clang SourceLocation overflow - #10 by davrec).

The trickiest thing to represent in an easily understandable way would be #undefs. (Also unbalanced delimiters in conditionals or expansions, but those are less common, and so hacky-looking ASTs might be more justifiable in such cases.)

If we represent #defines as constant-evaluated declarations, #undefs would present the exact same challenge as trying to change the value of a constexpr variable. The only good approach would be to allow for, and specially handle where necessary, mutability in these implicit compile-time variables, though I am unclear on whether that would introduce additional problems.

1 Like

Exactly what I was thinking in regards to formatting, clang-format has to jump through so many hoops to get around macros living outside of the AST.

Good point about performance btw, I think a good baseline is if a macro is seen 4? times or more it should be added to the AST.

Speaking of #undefs, there’s also push_macro, pop_macro, and what I’m working on right now redefine_macro pragmas, so macro history NEEDS to continue being supported and not a forgotten edge case.

More brainstorming on this:

The best solution (probably the only solution) to handling non-constness of macro “declarations” in a consteval context is to just encompass all the preprocessor semantics in one big implicit MetaprogramDecl at the start of the translation unit, within which macro definitions are local variables. This is pretty much Legalize’s original “preprocessor tree” idea (the branches of the tree would be defined by ordinary IfStmts), except with full preprocessor semantics also represented within the tree.

By keeping all the preprocessor code in a single compound statement, we can easily represent changes in macro definitions via changes in variable assignments. And, this consolidation would provide better semantics by keeping the metaprogram lookup scope entirely local/isolated from the rest of the program. Not to mention it would be cleaner and would introduce less churn by not interfering with the current AST structure.

The AST links between the two would be the important part. Each generated Decl would have a pointer to the meta-Stmt which generated it, a la InstantiatedFrom for template-instantiated declarations. We could hold links in the opposite direction too: each meta-Stmt could hold begin_generated_nodes and end_generated_nodes pointers indicating the range of AST nodes it generated; this would allow tools to traverse the preprocessor metaprogram and the “generated” AST in parallel, in sync with the state of each.

Another issue: there would probably need to be a single __macro_type encompassing all possible parameterizations (FOO(A,B), FOO(A), FOO), to avoid type conflicts on macro redefinitions. A special variadic void function pointer type callable during constant evaluation would probably do. Push/pop etc. could then be implemented with builtins operating on this type.

Rough example:

Written translation unit:

#define FOO(m,n) void bar##m##n()

FOO(a,b);

#undef FOO
#define FOO(NAME) void NAME()

#ifdef FOO
  FOO(bar);
#endif

#pragma push_macro("FOO")
#undef FOO
#define FOO nonsense
#pragma pop_macro("FOO")

void nonMacroCode() { }

Implicit representation:

// Builtins
consteval void __inj(const char *, ...); // (instructs compiler to parse args)
consteval const char * __concatenate(const char *, const char *); 
using __macro_type = void (*)(...);
consteval void __undef(__macro_type &M) { M = nullptr; }
consteval void __push_macro(const __macro_type &M);
consteval void __pop_macro(__macro_type &M);
//__defined(ID) := FOO exists and FOO != nullptr
// ...

//*** 1st Decl in TU: Implicit preprocessor metaprogram ***//
consteval {
  __macro_type FOO;
  FOO = [](...){ 
    const char * m, n = /*fetch from va_args*/
    __inj("void ", __concatenate("bar",m,n), "()");
  };

  FOO("1","a") //(A)
  __inj(";");

  __undef(FOO)    
  FOO = [](...){ 
    const char *NAME = /*fetch from va_args*/
     __inj("void ", NAME, "()"); 
  };

  if __defined(FOO) {
    FOO("foo"); //(B)
    __inj(";");
  }

  __push_macro(FOO);
  __undef(FOO);
  FOO = [](...) { __inj("nonsense"); }
  __pop_macro(FOO);
  FOO("bar") //(C)
  __inj("; void nonMacroCode() { }");  //(D)
}

//*** Remaining Decls: Generated AST ***//
// void bar1a(); //GeneratedFrom = A
// void foo();   //GeneratedFrom = B
// void bar();   //GeneratedFrom = C
// void nonMacroCode() { } // GeneratedFrom = D
1 Like

One little nitpick: the point of _Pragma(“redefine_macro(“FOO, nonsense”)”)

is that it can be done inside function like macros.

Honestly, the Preprocessor should live in the AST.

I’m not at all convinced this is correct. Modeling the preprocessor directives as an AST (separate from the Clang AST) seems plausible and may be worth experimenting with in a fork as there are benefits from such an approach (like AST matchers over the preprocessor). However, the preprocessor is on the hot path for the compiler and there is massive potential to regress compile time performance with such a change (not to mention how incredibly invasive that change is likely to be within the compiler). It may be possible with sufficient memoization of macro expansions, etc, that we get somewhat comparable performance out of an AST-driven preprocessor, but this alternative preprocessor would have to be hidden behind compiler flags and we’d have to be pretty aggressive as a community at tearing out one preprocessor or the other once it was clear which was preferred (maintaining two parallel preprocessors long-term is too big of a burden on the community).

Modelling the preprocessor AST as part of the Clang AST (instead of as a separate AST) seems really difficult to me. We need to resolve the preprocessor actions in order to make the correct AST nodes – consider things like int array[N] where N is a macro; that impacts template instantiation, overload resolution, etc. Also, the model for pramgas in Clang is that the preprocessor parses the pragma directive and then inserts annotation tokens into the parsing stream that are then handled at parse time instead of preprocessor time to complete the pragma action (such as OpenMP pragmas creating special AST nodes), so I don’t see how that gets modeled via an AST integrated with Clang’s semantic AST.

Responding to @LegalizeAdulthood earlier in the thread:

Even with module support in C++20, I think we’re going to be living with the evils of the preprocessor for quite some time and not just due to legacy code bases.

Clang is a C compiler too and with that in mind, not everyone considers the preprocessor to be evil… so I agree, the preprocessor is going nowhere (in fact, it’s gained new features in both the C and C++ for their upcoming releases).

My little brainstorm last night was some sort of DSL that let me match preprocessor actions and an AST for preprocessor conditions that I could match with the DSL. Less clear is some sort of DSL for associating AST nodes with macro expansion locations.

FWIW, my suggestion is to leave the Clang preprocessor alone and form this DSL around the existing preprocessor functionality if at all possible. It’s going to be far more likely to succeed by adding incremental features to the existing preprocessor facilities than it is to rework the way the preprocessor works in the entirety of Clang.

To be clear it is not necessary to interfere with the Preprocessor to represent it in the AST. (I did say I bet doing preprocessing in Sema would be at least as fast, but that’s far from certain, and I probably shouldn’t have introduced the confusion by mentioning it at all – moving away from the current preprocessor structure is just never going to be worth the trouble, you are 100% right on that point.)

At a superficial level Preprocessor AST nodes could be added which were akin to sugar Type nodes - helpful information, but having no semantic effect whatsoever. This could be done via PPCallbacks so that the PP AST is only produced when a particular flag was on; that would already be a huge step forward.

However, it would also be nice if these nodes did have integrated semantics, so that AST transformations caused by changes in #defines could be modeled using the same sort of TreeTransform overrides and perhaps other techniques the static analyzer folks use, instead of special hand-rolled solutions.

For example, we could model certain #defines as dependent variables, and a translation unit as an “instantiation” of a given set of those variables, storing information on the dependencies within the AST as we do for template-instantiated declarations. Tools could choose to keep certain #defines dependent during parsing, and instantiate them in various ways later, using the preprocessing semantics embedded in the AST nodes.

If this is desirable – it isn’t a big need at present for me, but I can see how it would be for others – the necessary Parser/Sema infrastructure to perform preprocessing is already available and encapsulated in the MetaprogramDecl and StringInjectionStmt nodes in D118788. I.e. the implicit AST nodes you would need are already there. All that would remain is to decide on the best user-facing AST structure.

Sounds like there are a few proposals here:

Traversable data structures for PP stuff: tools definitely miss this.
There are a few libraries covering subsets: clang-tidy has utils/IncludeInserter, there’s tooling::syntax::TokenBuffer.
These build on PPCallbacks. I think this is a great place to start, and I would suggest choosing a scope narrower than “the preprocessor”, e.g. conditional compilation directives.
Pseudoparser’s PPStructure is related (but a real multi-file parse is sufficiently different that it doesn’t make sense to actually share IMO).

Unifying PP structure with the AST, metaprograms etc: maybe I’m being dense, but is this possible in general? e.g. what’s the tree for:

#define H ) + (
int x = (2H1);

Such terrible code may be rare, though I’m sure there are more common variants. I don’t think we can ban it.

And I don’t think we’d be in a better state if clang used a new, AST-based internal model plus all of the complexity of existing PP facilities in case that model failed.
There’s a lot of value in recognizing the common cases (e.g. conditional block forms exactly a set of declarations at namespace scope), but sadly based on the language it’s too hard IMO for this to be the fundamental model, it needs to be an opt-in best-effort layer on top.

:+1:, though a bit sad about it

Aaron, have you looked at the “SuperC” paper that I linked earlier? They don’t represent every preprocessor directive as part of the AST, only the conditional compilation blocks where they “flatten” out the conditional in a way so as to deal with nested conditionals. The paper claims good performance and I put their code up on github (also linked earlier). It’s all written in Java and is more of a use case for their compiler infrastructure tool than it is specifically targeted at the C preprocessor. (They have other use cases they address with their tool as well.)

In their use case, they were only interested in exploring the combination of blocks of code that can be generated by conditional compilation. They didn’t care about macro expansion or anything else the preprocessor does, except insofar as it impacts the condition of a conditionally selected block of code.

A refactoring tool certainly cares a great deal more about the preprocessor directives and how they appear in the source files. I’m guessing clang-format cares, but not as much as a refactoring tool.

The compiler cares even less than clang-format (which is currently the problem; everything is optimized for the level of attention needed by the compiler).

I’m not a particular fan of “modes” in an API, but perhaps what is needed is to extend the existing preprocessor with “modes of interest” so a client like clang-format or clang-tidy can request more detailed retention of preprocessing directives in the resulting AST, without unduly impacting the compiler.

As for a preprocessing AST for performing more sophisticated matching and analysis of preprocessor directives, I think that can be grafted on top of the existing PPCallbacks API, but it still suffers from the unfortunate situation that the preprocessor skips – without any sort of tokenizing or interpretation – disabled conditional compilation blocks. So even just grafting a PPAST on top of the PPCallbacks API, it’s still going to be incomplete without getting more feedback from the preprocessor. It can certainly be started that way, but it’s going to fall short for any reasonable refactoring use case (and possibly the formatting use case as well).

This latter part is what I’ve been hacking on when I have time (which isn’t much lately LOL).

1 Like

I haven’t followed this discussion closely, but it may be that some of what is desired is already provided by the -detailed-preprocessing-record option? llvm-project/Options.td at main · llvm/llvm-project · GitHub

If you look at the implementation of the preprocessor it advances the text pointer over entire blocks of conditionally compiled code that is “disabled”. They never even hit the tokenizer. This is (part of) the problem.

2 Likes