[RFC] A C++ pseudo parser for tooling

To be clear, the folks making this proposal are pretty closely connected to how PCH and modules work - I think it’s fair to assume they’ve spent quite a while thinking about how these things work and could be applied.

Could an index be persisted to disk, reloaded at startup

(Clangd does have this too, FWIW)

and incrementally changed as the user edits?

User edits mean we need to parse new code.
Absent a pseudoparser, the only thing that parses code is clang.
Clang needs a full AST, so this step is slow.

What I meant is that clang could reparse only the code the user has
edited, rather than reparsing the entire source file from scratch.
I believe rust-analyzer does exactly this for Rust source, and my
understanding is that it is considered fast enough.

Yes, you could do this, but not with clang as the parser.

If it were modified to efficiently persist the salient parser state (including partial AST) at checkpoints during translation after every few lines (see https://en.wikipedia.org/wiki/Persistent_data_structure), and then able to serialize that history and reload from a checkpoint just above a modification, then it wouldn’t have to start from scratch every time. Again, I would spend some time thinking about how pre-compiled headers and modules work.

To be clear, the folks making this proposal are pretty closely connected to how PCH and modules work - I think it’s fair to assume they’ve spent quite a while thinking about how these things work and could be applied.

Yeah, We’ve been spending years of effort on improving these things (and will continue) and applying them in clangd – clangd snapshots the initial #includes of the main file (which is called preamble), and uses this preamble to build the AST of the main file (the AST build with an up-to-date/stale preamble is fast).
However, it doesn’t solve all: initial build of preamble can be extremely slow (tens of seconds, or minutes); cross-file refactorings might want to edit an unopened file; there are use cases where it is impractical to have a real compilation environment setup.
These are the areas we aim to improve with a pseudo parser.

there are use cases where it is impractical to have a real compilation environment setup.

To me, this seems like it’ll be the biggest win here.

Figuring out how to build the software, and to then how to get a compilation-database out of the build (if that’s even possible), in order to setup indexing/etc…for every OSS codebase I simply want to explore the code of…it is, indeed, completely impractical. Anything beyond “git clone $X” and opening the resultant files in an editor is not gonna happen.

So, overall, I’ve continued to make do with the primitive text-based tools, because those work universally. If the more advanced tools could work at least as well without needing any additional per-project setup, but then get progressively better if you do end up providing more information, that seems awesome. If you can make it work. :slight_smile:

Hi,

We’d like to propose a pseudo-parser which can approximately parse C++ (including broken code). It parses a file in isolation, without needing headers, compile flags etc. Ambiguities are resolved heuristically, like clang-format. Its output is a clang::syntax tree, which maps the token sequence onto the C++ grammar.
Our motivation comes from wanting to add some low latency features (file outline, refactorings etc) in clangd, but we think this is a useful building block for other tools too.

This is quite interesting and exciting!

This started landing and I don’t see a clear acceptance of the premise here. In fact, this project runs counter to several key design principles of clang. I only noticed it because of the LLVM Weekly mention.

If this were a new LLVM target, there would be much more discussion and yet this project appears to be of similar scale. It’s not completed, yet is being committed piecemeal. The patch reviews aren’t broadly covering key clang contributors.

3 Likes

(Sorry for the late response @alexr, the reply didn’t hit our inbox after email->discourse migration)

This started landing and I don’t see a clear acceptance of the premise here

Yes. Some RFCs seem to attract discussion of the details but the process doesn’t really drive towards an accept/deny conclusion. We understood this to mean it wasn’t terribly controversial as a peripheral part of the clang tooling libraries (not clang itself). Maybe this was wrong?

In fact, this project runs counter to several key design principles of clang

This sounds interesting/important, can you elaborate? I don’t think these were raised during the RFC.

If this were a new LLVM target, there would be much more discussion and yet this project appears to be of similar scale

Compared to a new target, it has much less impact on existing code/tools - there are no plans for this to be part of clang (the binary) or existing pieces of LLVM to depend on it.
Do you think it would be better if this kind of effort lived in clang-tools-extra/? The reason it’s under clang/ is to align with the existing Tooling/ libraries and because clang-tools-extra generally doesn’t contain libraries today.

It’s not completed, yet is being committed piecemeal

AIUI this is the preferred approach.

The patch reviews aren’t broadly covering key clang contributors.

This is is a good point. We’ve gotten high level design feedback from Richard Smith, @chandlerc, Yitzhak Mandelbaum but I don’t think they have interest/bandwidth in reviewing the implementation. Are there people interested doing this?

1 Like

Clang has specific support for faster parsing modes to handle the needs of IDE integration. Syntax coloring being an example. There was an explicit intent to avoid the prior trap of having different [and disagreeing] parsers for different IDE features.

This project is large enough to have a maintenance cost for everyone who uses the monorepo. What folder it appears in doesn’t really change that.

Only for projects where there’s enough of a clear path and clear objectives. The only proposal here is “we want to write a new parser.” What is the final objective? What features does this enable? Why can’t Clang itself be used? If the problem is parsing performance, where are the bottlenecks in the current design of Clang that justify writing an entirely new parser, essentially from scratch?

I’d really like to hear from somebody like Doug Gregor, John McCall, Michael Spenser, etc. The comments have all come from the same company, which would naturally be in support of the work of peers.

1 Like

I should add that I’m aware of projects that have used Clang to parse incomplete code (e.g. skip transitive includes) with success at scraping GitHub. Imagine an anti-plagiarism tool as a motivating example.

Clang has specific support for faster parsing modes to handle the needs of IDE integration.

We’re aware of:

  • using clang PCH for incremental parsing. Yes, this technique works pretty well, and we use it heavily in clangd. However it is not suitable for all purposes: initial PCH creation is too slow for large TU (>10 seconds); it requires build system integration.
  • single-file parse mode: this is far too inaccurate for our purposes.

Are there others that we have missed?

There was an explicit intent to avoid the prior trap

Can you link to where this was made explicit?

Only for projects where there’s enough of a clear path and clear objectives. The only proposal here is “we want to write a new parser.” What is the final objective? What features does this enable? Why can’t Clang itself be used? If the problem is parsing performance, where are the bottlenecks in the current design of Clang that justify writing an entirely new parser, essentially from scratch?

I think most of these questions are addressed in the Introduction and Scope sections of the design doc.

Unlike the clang parser (as a precise C++ parser), this parser tries to achieve different goals: parsing C++ code without build system integration; handle well on broken/partial C++ code; a faster tool to process C++ source.

Extending clang is an option, we don’t think it is a good idea:

  • The clang AST unavoidably expresses semantic information. For example, it can’t represent references without knowing the referents, and it describes whether a name resolves to a global or member variable.
    Clang establishes these invariants by performing semantic analysis and by discarding invalid code.
    We have worked on weakening invariants in a targeted way (see RecoveryExpr) - this approach will not scale to the whole AST.
  • The clang parser is deeply coupled to the semantic analysis that is being skipped, and would have to be massively restructured to make this optional. Moreover a precise C++ parser must rely on full semantic analysis to make parsing decisions, while an accurate single-file parser must be able to pursue multiple parse branches. These don’t seem easy to reconcile.
  • From the implementation side, modifying the clang to achieve similar goals would be invasive, complex and risky.

I’d really like to hear from somebody like Doug Gregor, John McCall, Michael Spenser, etc.

+1, would like to hear thoughts on it, @rjmccall (sorry, don’t know discourse handles for others). It would be especially helpful to hear from people active both on Clang and tooling applications, @AaronBallman

I should add that I’m aware of projects that have used Clang to parse incomplete code (e.g. skip transitive includes) with success at scraping GitHub. Imagine an anti-plagiarism tool as a motivating example.

That’s interesting to know (we might have missed that during the investigation before the project), can you give a concrete project? Anti-plagiarism seems like it may require significantly less precision than e.g. indexing or refactoring.

Not familiar with the specifics beyond the Introduction and Scope you linked, but to sing a familiar tune: how invasive would it be to introduce any necessary hooks into the Parser, so that this could be represented as a tool which overrides the Sema behavior you don’t want?

Actually in this case even hooks may not be necessary: clang’s semantic analysis is triggered by calls to Sema::ActOn* within Parser, but since the Parser and Sema libraries are independent, Parser doesn’t know or care what Sema::ActOn* does, only what it returns. So, might it be possible to create a separate .cpp of dummy ActOn implementations which return empty AST nodes, or nodes constructed with the bare minimum info relied on by the Parser, and include that in a tool’s sources in place of clangSema?

The Introduction/Scope makes a few other points beyond semantic analysis which might also be significant, but the Sema link seems like the most important.

FWIW, I’m also concerned about adding an additional parser. The goal of being to parse partial or broken code is one we’ve wanted to support before (like in clang-format), but is really difficult when you consider just how many C and C++ extensions exist in the wild. Both C and C++ are being continually updated, so this secondary parser is likely to be a pretty significant maintenance burden. Also, things like pragmas are a bit interesting – the preprocessor handles part of that parsing functionality and then inserts annotation tokens into the token stream so that the real parser can finish parsing, is the intention to reimplement the preprocessor as well? Once I start to think about things like template instantiations, concept satisfaction, etc, I get even more worried.

Clang currently has some ability to retain information in the AST for erroneous constructs; ideally we could use something similar to that so that we can reuse as much of the parser and semantics engine as we can.

1 Like

Actually in this case even hooks may not be necessary: clang’s semantic analysis is triggered by calls to Sema::ActOn* within Parser, but since the Parser and Sema libraries are independent, Parser doesn’t know or care what Sema::ActOn* does, only what it returns. So, might it be possible to create a separate .cpp of dummy ActOn implementations which return empty AST nodes, or nodes constructed with the bare minimum info relied on by the Parser, and include that in a tool’s sources in place of clangSema?

Sharing the Parser code is an obvious approach, but it is infeasible:

  • The clang parser is designed around being able to gather semantic information (from Sema) to make forward progress, e.g. an identifier must be resolved definitively (by Sema::classifyName). While in a pseudo parser, identifiers are resolved heuristically (e.g. inferred from the context). The parser would need to be redesigned in order to satisfy this. Moreover many methods in Sema have contracts that are hard/impossible to satisfy by a dummy implementation.
  • The clang parser is coupled to the clang AST. The clang AST is designed to fully express C++ semantics, rather than the syntax, which is not suitable for our purposes, we need to describe all grammatical constructs.

FWIW, I’m also concerned about adding an additional parser. The goal of being to parse partial or broken code is one we’ve wanted to support before (like in clang-format), but is really difficult when you consider just how many C and C++ extensions exist in the wild. Both C and C++ are being continually updated, so this secondary parser is likely to be a pretty significant maintenance burden.

This is a fair point. We expect that adding new features in the pseudo parser is relatively cheap by modifying the grammar (which is already provided by the standard C++).

Also, things like pragmas are a bit interesting – the preprocessor handles part of that parsing functionality and then inserts annotation tokens into the token stream so that the real parser can finish parsing, is the intention to reimplement the preprocessor as well?

No, we don’t intend to reimplement a real preprocessor. Parsing pragma is not in our initial plan – pragma contents will just be represented as a list of raw tokens in a directive. If there is a great demand of more precise parsing, someone could add a sub-grammar for them.

Once I start to think about things like template instantiations, concept satisfaction, etc, I get even more worried.

These are not in our scope of the pseudo parser.

Clang currently has some ability to retain information in the AST for erroneous constructs; ideally we could use something similar to that so that we can reuse as much of the parser and semantics engine as we can.

Yeah, the clang AST does have some facilities (invalid bit in Decl, RecoveryExpr) for broken code, but they are ad-hoc and extremely incomplete.

Based on our previous work of RecoveryAST, scaling the whole clang AST for erroneous constructs is not feasible. Those efforts were extremely limited in scope compared to what we need here:

  • one category of node (Expr) rather than all categories. (And one that tends to affect a small blast radius vs e.g. Type or Decl).
  • the weak semantics were limited to one new node kinds (RecoveryExpr) and thus its parents (Expr) rather than all node kinds
  • threw away most syntax/semantics information (preserving only subexpressions and possibly type)
  • recovered from errors in only the most common cases, rather than every case

Nevertheless this relatively small enhancement required many changes across clang and consumers of its AST, including tens of crashes due to violated invariants that spanned multiple release cycles.

So in a sense the Parser already has some semantics baked in; the division isn’t quite so clean. And a patch to clarify that border would be very invasive, so a full rewrite really is the best way to totally remove Sema dependencies. I understand.

I support your effort given the first point but have to disagree strongly on this point. Clang represents almost all syntax in the AST, so much so that the few frustrating exceptions should probably be considered bugs.

Among these is as you mention int a, b; being represented as two separate decls; also if I recall some attributes are moved around silently during parsing, and of course macros aren’t represented. (Any others? Can’t be more than a handful.)

It would be ideal if these could be fixed these in the AST, as they are a common source of frustration (Q: how do I get such and such info from the AST? A: you can’t, it’s lost during parsing). But given the validity of your first rationale this is an entirely separate matter.

I support this effort and would love to see fuzzy parser experiments upstream. The main reason is that Clang already provides some IDE functionality (e.g. clangd, but even before that). And some (most?) proprietary IDEs have tiered parsers. I.e., they start with something primitive but fast so go to definition and some other features can work within milliseconds after opening a new file (at least for the most common scenarios) and they start a full-fledged more precise parser in the background and fall back to the results from that parser after it is done.

If Clang wants to provide all the language related facilities for IDEs, I think a fuzzy parser would fit this use case pretty well and it has the potential of significantly improving the user experience when working on big projects.

1 Like

Sure there can :slight_smile: I agree we should try to reduce the gaps, but I don’t think the clang AST’s ad-hoc space-saving design is really compatible with getting to 100% and staying there.

Some completely missing:

  • no distinction between foo() the expression and foo(); the statement, so semicolons are not tracked. (This ties to Expr being a subclass of Stmt). (Deliberate, presumably to save memory)
  • QualifiedTypeLoc does not retain the location of const. (Deliberate, presumably to save memory).
  • Many nodes are missing “less-critical” punctuation or other tokens, such as commas in FunctionTypeLoc, the lparen in decltype(auto), public in class T : virtual public S {}
  • explicit function/variable template instantiations: there’s no node that carries the locations of the explicit instantiation at all. (bug? class templates are OK IIRC)
  • Representation of broken code is lossy due to lack of redundancy. e.g. node bounds are calculated from token locations, so a missing ) means RParenLoc is null and we don’t know where node ends. (Tradeoff of the design)
  • Representation of broken code is missing due to forcing invariants. e.g. vector<invalid, 2> is represented as BuiltinType: int. (Consequence of AST being semantics-first)

Some where important abstractions like “range” break down, leaving you to crawl the whole AST:

  • declarators and decl-specifier-sequences are not represented directly in the AST. Technically the information is present in the typelocs but to analyze the syntax you need to reconstruct these syntactic parts, it’s extremely fiddly.
  • AttributedTypeLoc’s range is often attribute’s range (excluding the attributed type)
  • the range of Decls with attributes often excludes the attribute

Clang represents almost all syntax in the AST

Clang has pointers from semantics → tokens that cover most tokens (I think <90%), and this is valuable.

However there’s more to representing the syntax than this. One useful test is whether you can implement a tree walk over (range, interpretation) pairs such that ranges nest properly. This is effectively impossible with clang’s AST (It’s highly desirable and obvious, and AFAIK nobody’s managed to do it. I’d love to be proven wrong!)

You could construct a syntactic model (clang::syntax::Tree) from a clang AST, and this was the first thing we tried. (buildSyntaxTree). This made a nice demo but we couldn’t make it robust without building most of a parser and using the AST as a guide, at which point why make the AST mandatory?

1 Like

Thanks for this perspective, I had not considered the need for fine-grained syntax details (distinguishing char const from const char etc), and was simply not aware of some of the others you mention. While each of these is solvable, the larger point – that it is not necessarily a virtue to cram more and more details into the AST which are not needed by the ordinary compilations for which the AST must be optimized – is well taken.

Btw @LegalizeAdulthood, check out [Pseudo] Token/TokenStream, PP directive parser. · llvm/llvm-project@7c1ee5e · GitHub recently introduced by @sam-mccall for this new parser, that might be what you were describing in the other thread.

FWIW, I’m also concerned about adding an additional parser. The goal of being to parse partial or broken code is one we’ve wanted to support before (like in clang-format), but is really difficult when you consider just how many C and C++ extensions exist in the wild. Both C and C++ are being continually updated, so this secondary parser is likely to be a pretty significant maintenance burden.

This is a fair point. We expect that adding new features in the pseudo parser is relatively cheap by modifying the grammar (which is already provided by the standard C++).

Yes, but the point is, it’s not clear that the community should pay that burden. As a concrete example, when we added the _ExtInt datatype to Clang, would a reviewer have been correct to suggest “you also need to update the pseudo parser”? If so, when we renamed _ExtInt to _BitInt, would we have to do it in both places as well? If the answer to both of these is “no”, then what’s to prevent this parser from falling so far out of line with the rest of the compiler that we then need to consider ripping it out (which is the point I’m at with the -ast-print option that’s only updated on a best-faith effort, as an example).

No, we don’t intend to reimplement a real preprocessor.

That’s good to hear!

These are not in our scope of the pseudo parser.

So this new parser won’t support templates or concepts? That seems like a surprise for a C++ pseudo parsing tool. :slight_smile:

Based on our previous work of RecoveryAST, scaling the whole clang AST for erroneous constructs is not feasible.

Okay, that’s certainly good feedback. I remember some of the issues that RecoveryExpr have caused, and I can imagine it’d be hard to shoe-horn your needs on top of what Clang’s AST needs are without some really hard tension between the scenarios.

I’m still concerned about the amount of burden having two C++ parsers in the project puts on the community as a whole. Given that Clang and LLVM are designed to be consumed as a library, have you considered making this a separate tool in clang-tools-extra? (I’m presuming that you don’t expect the community to update the pseudo parser every time we add a new extension or implement a new standard feature, and having the project in clang-tools-extra makes it more clear that Clang doesn’t maintain the other parser.) Or a downstream project rather than an upstream one?

I don’t think so, no. My expectation is the basic guarantee “don’t break the build or the tests” applies. So to rename a clang::tok::Kind that was already used in the grammar, you’d have you do something. Feature-wise, the pseudoparser may lag (or lead!) clang.

This seems comparable to when people modifying AST must/must not update clang-tidy, clangd etc. It’s definitely not zero!

Only that there’s no hard requirement they match.

  • In particular, no plans to depend on this from clang, and I’d consider that a large shift that would need its own RFC to consider this question
  • it’s hard to have such a requirement because a heuristic parser is always going to misparse some code and so diverge from clang anyway

Being unreliable enough that it gets removed from the tree would constitute failure. We should be careful that such a failure doesn’t damage clang, but we needn’t avoid any project that might fail.

Parsing templates and concepts (and their usages) is certainly in scope. You originally asked about template instantiation and concept satisfaction, which are semantic questions, right?

(Of course, parsing depends on template instantiation, but not for a parser that’s willing to guess and be wrong sometimes…)

Yes, and we’re open to moving it. I agree it’d send a useful signal. The reasons we put it under lib/Tooling/ instead:

  • precedent that clang-tools-extra standalone tools and clang/lib/ contains libraries. (Accidental?)
  • relatedness to existing libraries under lib/Tooling, particularly lib/Tooling/Syntax
  • nobody mentioned this until after code started landing :slight_smile:

@alexr wasn’t convinced upthread that placing it under clang-tools-extra would improve the maintenance burden, maybe we should couple it with an explicit doc laying out expectations?

We’d have to take clangd out of tree to depend on it. clangd benefits a lot from being in-tree, and I think clang benefits too (it encourages us to fix things in clang rather than work around them).

1 Like

I don’t think so, no. My expectation is the basic guarantee “don’t break the build or the tests” applies. So to rename a clang::tok::Kind that was already used in the grammar, you’d have you do something. Feature-wise, the pseudoparser may lag (or lead!) clang.
This seems comparable to when people modifying AST must/must not update clang-tidy, clangd etc. It’s definitely not zero!

I read this as more of a yes than a no, at least based on the practices in Clang of having a switch statement with an llvm_unreachable() at the bottom of it for a fair amount of parsing work. When a new token kind is added, we’ll have to update both parsers because of “don’t break the build or the tests”. Even without reaching an unreachable, there will be additional warnings about no longer having a fully covered switch (in -Werror configurations), etc. So in practice, it seems likely that changes to Clang’s parser will require code reviewers and authors to consider impacts on both parsers. :frowning:

Being unreliable enough that it gets removed from the tree would constitute failure. We should be careful that such a failure doesn’t damage clang, but we needn’t avoid any project that might fail.

Maybe I was unclear, so I’ll try again with different words. Currently, it seems like you’re envisioning the maintenance of this to be a best-faith effort for people changing Clang’s parser (if you change Clang’s parser, it’d be nice if you also changed the pseudo parser, but not strictly required). We have other things that are best-faith effort like this, such as -ast-print. My experience with those best-faith efforts is that they all bit rot (almost every time – C indexing, AST dumping, and JSON node dumping are three other examples of this). Because of these experiences, I think we should be more conservative with adding new best-faith efforts to the compiler libraries. (This is somewhat orthogonal to your RFC and is more a statement about keeping maintenance costs in mind in general.)

@alexr wasn’t convinced upthread that placing it under clang-tools-extra would improve the maintenance burden, maybe we should couple it with an explicit doc laying out expectations?

I don’t think it eliminates the maintenance burdens – it’s still likely to be easy to break the pseudo parser unintentionally and that will require someone to go fix it. So my personal preference is for this to live out-of-tree until it’s far closer to completion and these sort of kinks are worked out downstream rather than upstream. However…

We’d have to take clangd out of tree to depend on it. clangd benefits a lot from being in-tree, and I think clang benefits too (it encourages us to fix things in clang rather than work around them).

This is may be a compelling reason for it to live in-tree. I see the relationship between clang and clangd as much more mutually beneficial. However, I don’t think you’d have to take clangd out of tree; couldn’t you allow in-tree clangd to optionally use the out-of-tree pseudo parser in the same way we allow other optional dependencies in tree (like Z3)? This shifts the burden from “clang proper” to “folks who maintain clangd now need to work with two parsers”, but that seems like where the burden largely belongs too.

1 Like