[RFC] A C++ pseudo parser for tooling

Haojian_Wu · November 5, 2021, 1:37pm

Hello everyone,

We’d like to propose a pseudo-parser which can approximately parse C++ (including broken code). It parses a file in isolation, without needing headers, compile flags etc. Ambiguities are resolved heuristically, like clang-format. Its output is a clang::syntax tree, which maps the token sequence onto the C++ grammar.
Our motivation comes from wanting to add some low latency features (file outline, refactorings etc) in clangd, but we think this is a useful building block for other tools too.

Design is discussed in detail here: https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit?usp=sharing

The proposal is based on the experience with a working prototype. Initially, we will focus on building the foundation. We consider the first version as experimental, and plan to use and validate it with applications in clangd (the detailed plan is described here).

As soon as we have consensus on the proposal, we plan to start this work in the clang repository (code would be under clang/Tooling/Syntax). We hope we can start sending out patches for review at the end of November.

Eager to hear your thoughts. Comments and suggestions are much appreciated.

Thanks,
Haojian

vvassilev · November 5, 2021, 5:22pm

Hi Haojian,

I remember the netbeans C++ autocompletion had such 'inaccurate' but fast parser. I was wondering if you looked at this and if there is other prior art in the area.

Long time ago I believe I talked with Vladimir about that. Adding him in cc.

Best, Vassil

Andrew_Tomazos · November 6, 2021, 12:59am

Unfortunately it’s not possible to parse C++ even close to accurately without preprocessing (and so build-system integration). There are predefined macros that determine what code is conditionally included, conditionally included code can change basically anything, redefine anything. Macros can expand to an arbitrary token sequence (or even create new tokens through stringization or concatenation). It means that any identifier can become any token sequence. That’s even before we mention how name lookup is needed for disambiguation. To parse C++ you in fact need to do full preprocessing and a large chunk of semantic analysis.

Given how inaccurate the parse from the best possible “single source file” parser is - it’s not clear what the use case is for it. clang-format (largely) only makes whitespace changes, so there is limited opportunity for inaccuracies in its parse to lead to errors.

To generate file outlines and do refactoring I suspect you’re better off waiting for a proper parse than using a completely inaccurate one. In the dev environment I use, past versions of the indexer had tried to do such an approximate parse, and current versions do a full correct C++ parse, so I’ve experienced the difference first-hand. It’s night and day.

Just my 2c. -Andrew

dblaikie · November 6, 2021, 2:36am

Yeah, FWIW I’d +1 Andrew’s comments here - it was sort of one major premise of clang being designed as a reusable library, that C++ is just too complicated to reimplement separately/repeatedly in various tools.

For something’s that’s going to change significant code - how slow is a clang-based solution? What’s the tradeoff being made?

Sam_McCall · November 6, 2021, 3:22pm

Unfortunately it’s not possible to parse C++ even close to accurately without preprocessing (and so build-system integration).

We’re not convinced this is true, if we’re talking about “average code”.
Our measurements show tree-sitter achieving 95%+ average accuracy on a large codebase.
(We hope to achieve higher accuracy, better handling of broken code, and finer-grained info by specializing for C++).

Certainly there are cases where it’s not possible to parse without both preprocessing and semantic analysis, but these aren’t most code. The strategy here is to make informed guesses and rely on error-tolerance to avoid too much fallout from a bad guess. (This is the third category of error listed under error-resilience in the doc).

Macros can expand to an arbitrary token sequence (or even create new tokens through stringization or concatenation). It means that any identifier can become any token sequence.

That’s even before we mention how name lookup is needed for disambiguation. To parse C++ you in fact need to do full preprocessing and a large chunk of semantic analysis.

These are covered in some detail in the design document, I’d be interested in your thoughts there, especially real-world examples that are important and not solvable in this way.
(Though yes, we expect to get some cases wrong and to fail catastrophically on code where PP is used in unidiomatic ways, just as clang-format does).

Given how inaccurate the parse from the best possible “single source file” parser is - it’s not clear what the use case is for it.

Some use cases are listed in the doc, granted if the parse is too inaccurate it won’t be useful for them.
FWIW several of these use-cases are places where we’re using regexes today.

clang-format (largely) only makes whitespace changes, so there is limited opportunity for inaccuracies in its parse to lead to errors.

Sure. It can lead to style errors though. We enforce both clang-format and a style guide on a large part of our codebase, and it works.
Of course this is only weak evidence as clang-format must infer much less structure.

To generate file outlines and do refactoring I suspect you’re better off waiting for a proper parse than using a completely inaccurate one.

Funny you should mention clangd does provide an AST based outline, and it’s great. For our internal deployment, the editor team decided to go with a (closed-source, relatively simple) pseudo-parser outline instead. It was worse, but OK, and having it immediately available was judged more important.
This made me pretty sad but I find it hard to disagree.

In the dev environment I use, past versions of the indexer had tried to do such an approximate parse, and current versions do a full correct C++ parse, so I’ve experienced the difference first-hand. It’s night and day.

Agree. This is why we have an AST-based indexer (and many flavors of it, just-in-time, background, networked). This won’t go away.
However the time to build that index can be a night and a day, too. People edit large codebases on small laptops…
We think this can be two orders of magnitude faster. If there’s a way to do that with clang, I’d love to hear it!

Sam_McCall · November 6, 2021, 3:50pm

Yeah, FWIW I’d +1 Andrew’s comments here - it was sort of one major premise of clang being designed as a reusable library, that C++ is just too complicated to reimplement separately/repeatedly in various tools.

Yes. This is a good argument for reusable implementations, but I’m not sure one is enough.
c.f. clang-format not using clang beyond the lexer, and the success attributable to that.
Ideally we’d share an impl there, in practice its maturity as a product and concrete design choices in its parser combine to make that hard.

For something’s that’s going to change significant code - how slow is a clang-based solution? What’s the tradeoff being made?

Basically it’s the difference between interactive latency and not.
For our internal clangd deployment (because those are the numbers I have) 90%ile is most of a minute to parse headers, and several minutes in the build system to get ready (generated headers, flags…).
Secondarily, it’s the difference between just using the tool and having to “set it up”. We do a lot of user support for clangd and I can tell you this is a nontrivial concern. (For people who build with something that’s not recent mainline clang/gcc, target weird platforms, don’t build on the machine they edit on, use non-cmake build systems, …)

You can see this tradeoff play out in the recent discussions about whether an “east const” conversion belongs in clang-format vs clang-tidy: one of the arguments for putting it in clang-format is it’s the only way to make it fast and easy enough that people want to use it.

Andrew_Tomazos · November 7, 2021, 1:06am

People tend to think “idiomatic average code” is whatever biased tiny sample of code they have seen is (search for Bjarne’s adaptation of elephant and blind man parable). I estimate there are 10 billion lines of production C++ in the world, and many accuse that as an underestimate by as much as 10x (see http://www.tomazos.com/howmuchcpp.pdf). So Google has between 0.2-2.0% of it (and I’ve seen that too BTW - I’m a former SWE). You’ll find due to various influences (like styleguide/monorepo/culture) Googles code is somewhat more homogeneous than the larger population (this can be seen via open-sourced Google code - not relying on proprietary knowledge). If you really want to test it properly I would use the ACTCD19 dataset (see https://codesearch.isocpp.org/faq.html). But I understand it would be difficult to set up as getting the 10,000s of packages building with clang (for baseline comparison) rather than gcc is non-trivial (as far as I know).
You say that the use cases are listed in the design document. What section/page are you referring to? I couldn’t find them. Even if we accepted that the accuracy was 95% (which kind of sounds, due to its roundness, like a made-up stat to be honest - plus it’s not clear what the denominator/unit you’re using is), for most use cases I can think of screwing up 5% of the time would be grossly unacceptable and counter-productive. Programming is hard, and the last thing you need is error-prone tooling making it harder.

dblaikie · November 7, 2021, 5:14am

Yeah, FWIW I’d +1 Andrew’s comments here - it was sort of one major premise of clang being designed as a reusable library, that C++ is just too complicated to reimplement separately/repeatedly in various tools.

Yes. This is a good argument for reusable implementations, but I’m not sure one is enough.
c.f. clang-format not using clang beyond the lexer, and the success attributable to that.
Ideally we’d share an impl there, in practice its maturity as a product and concrete design choices in its parser combine to make that hard.

Given the long time scale of these things - any chance of a plan to converge clang-format and this new thing eventually? (so we have 2 rather than 3 versions of C++ understanding in the LLVM project)

For something’s that’s going to change significant code - how slow is a clang-based solution? What’s the tradeoff being made?

Basically it’s the difference between interactive latency and not.
For our internal clangd deployment (because those are the numbers I have) 90%ile is most of a minute to parse headers, and several minutes in the build system to get ready (generated headers, flags…).

How much of this work is equivalent/shared/cached by the build system? (eg: if I just did a build, then I wanted to refactor a function - how long are we talking there?)

Secondarily, it’s the difference between just using the tool and having to “set it up”. We do a lot of user support for clangd and I can tell you this is a nontrivial concern. (For people who build with something that’s not recent mainline clang/gcc, target weird platforms, don’t build on the machine they edit on, use non-cmake build systems, …)

The second one I have less concern for, I’ll admit.

Csaba_Raduly · November 7, 2021, 12:11pm

Nobody knows what "most C++ programmers" do.
-- Bjarne Stroustrup

Csaba

Haojian_Wu · November 8, 2021, 9:55am

Yeah, FWIW I’d +1 Andrew’s comments here - it was sort of one major premise of clang being designed as a reusable library, that C++ is just too complicated to reimplement separately/repeatedly in various tools.

Yes. This is a good argument for reusable implementations, but I’m not sure one is enough.
c.f. clang-format not using clang beyond the lexer, and the success attributable to that.
Ideally we’d share an impl there, in practice its maturity as a product and concrete design choices in its parser combine to make that hard.

Given the long time scale of these things - any chance of a plan to converge clang-format and this new thing eventually? (so we have 2 rather than 3 versions of C++ understanding in the LLVM project)

In an ideal world, we’d like a single pseudo parser, in practice it is hard, reasons are described in the doc – clang-format’s parser is specified for formatting-purpose, and highly coupled with clang-format internals (extending/refactoring it is possible but hard, and introduces risks to a mature product); clang-format supports non-c-family languages (supporting non-c-family languages is not our goal). That said, we don’t plan to share the code with clang-format (at least for initial stages).

For something’s that’s going to change significant code - how slow is a clang-based solution? What’s the tradeoff being made?

Basically it’s the difference between interactive latency and not.
For our internal clangd deployment (because those are the numbers I have) 90%ile is most of a minute to parse headers, and several minutes in the build system to get ready (generated headers, flags…).

How much of this work is equivalent/shared/cached by the build system? (eg: if I just did a build, then I wanted to refactor a function - how long are we talking there?)

If preamble is built, 90%idl of BuildAST is ~1.7s; clangd also caches ASTs, if a cached AST is hit, AST-based operations (hover, go-to-def, etc) are generally fast, 95%idl is ~400ms.

Haojian_Wu · November 8, 2021, 11:18am

People tend to think “idiomatic average code” is whatever biased tiny sample of code they have seen is (search for Bjarne’s adaptation of elephant and blind man parable). I estimate there are 10 billion lines of production C++ in the world, and many accuse that as an underestimate by as much as 10x (see http://www.tomazos.com/howmuchcpp.pdf). So Google has between 0.2-2.0% of it (and I’ve seen that too BTW - I’m a former SWE). You’ll find due to various influences (like styleguide/monorepo/culture) Googles code is somewhat more homogeneous than the larger population (this can be seen via open-sourced Google code - not relying on proprietary knowledge). If you really want to test it properly I would use the ACTCD19 dataset (see https://codesearch.isocpp.org/faq.html). But I understand it would be difficult to set up as getting the 10,000s of packages building with clang (for baseline comparison) rather than gcc is non-trivial (as far as I know).

Yes, this is a good and interesting point. Our measurement was based on two datasets (one for google-style internal code, one for LLVM open-source code), I’d admit that they might not reflect the whole C++ world.
I agree that we should cover “general” C++ code. We could run more experiments/measurements on 3rd-party code (happy to do that if we need more data), in practice I think we might follow the devflow like clang-format – we start with google-style/llvm-style code, extend and polish the parser to support more general code based on users’ feedback and bug reports.

You say that the use cases are listed in the design document. What section/page are you referring to? I couldn’t find them.

I think they are (implicitly) mentioned in the “Scope” and “Work plan” sections (will make them clearer).

IDE use cases (for clangd)

provide code-folding, outline, syntax highlighting, selection features without a long “warmup” time;
a fast index to provides approximate results;

Other use cases we aim to support:

smart diff and merge tool for C++ code;
a fast linter, a cpplint replacement, with clang-tidy-like extensibility;
syntactic grep/sed tools;

Even if we accepted that the accuracy was 95% (which kind of sounds, due to its roundness, like a made-up stat to be honest - plus it’s not clear what the denominator/unit you’re using is)

The accuracy measurement is comparing annotation ranges generated from different parsers (tree-sitter vs the clang-AST), the accuracy is calculated based on the number of perfectly-matched ranges and mismatched ranges.
The annotations include some critical pieces of C++ source code:

identifiers that introduce a new source code entity: variable, function, class, enum, namespace
curly brace structure: class-body, compound-stmt-body, enum-body, initializer-list.

for most use cases I can think of screwing up 5% of the time would be grossly unacceptable and counter-productive. Programming is hard, and the last thing you need is error-prone tooling making it harder.

I’m not sure this is true. I think this mainly depends on the use cases. Our internal editor team has their own C++ pseudo-parser (which is written for instant outline and indexing symbols). It has been used for years, and they’re happy with that.
IMO letting users wait a few minutes in a newly-opened editor until all IDE features are available is a poor experience.

Andrew_Tomazos · November 9, 2021, 2:40am

I don’t know what “fast index to provide approximate results” means. Results of what? Do you mean generating an index? What will the index be used for?
Syntax highlighting is the only use case of those listed that can tolerate inaccuracy. For the rest, a correct parse will be more productive. The trouble is that if people start depending on these features in their workflow, when they fail (and they often will) it will be very disruptive. The cost of the disruption outweighs the time saved waiting for a correct parse.
I think you are better off spending your time on optimizing the correct parser infrastructure. I’m sure more can be done - particularly in terms of caching, persisting and resusing state (think like PCH and modules etc).

Sam_McCall · November 9, 2021, 2:34pm

IDE use cases (for clangd)

provide code-folding, outline, syntax highlighting, selection features without a long “warmup” time;

a fast index to provides approximate results;

Other use cases we aim to support:

smart diff and merge tool for C++ code;

a fast linter, a cpplint replacement, with clang-tidy-like extensibility;

syntactic grep/sed tools;

I don’t know what “fast index to provide approximate results” means. Results of what? Do you mean generating an index? What will the index be used for?

clangd has a symbol index to enable codebase-wide operations. (see SymbolIndex and some documentation)
These include:

go-to-definition: finding a definition associated with a declaration visible in the AST
code completion: for contexts where the AST cannot provide all results efficiently, such as namespace scopes (including results from non-included headers)
cross-references: finding references from files that are not part of the current AST
Today this index is built from ASTs in various ways (see docs), which takes many hours for large codebases (on machines too slow to build).
Most results are missing for a long time. Many users turn off indexing (e.g. to avoid battery drain) and the results stay missing. If compile flag metadata is missing for the project, these features don’t work at all.

The idea for clangd is to augment (not replace) this index with a pseudo-parser based index that processes each file once. It would be halfway between the AST index and grep. This index would provide the same operations with lower fidelity, and would be replaced by the AST-based index as it completes.

Syntax highlighting is the only use case of those listed that can tolerate inaccuracy. For the rest, a correct parse will be more productive. The trouble is that if people start depending on these features in their workflow, when they fail (and they often will) it will be very disruptive. The cost of the disruption outweighs the time saved waiting for a correct parse.

Our experience with clangd is that people very often value latency over correctness when editing C++ code, and this is a situational, quantitative question.
As examples, we’ve failed to replace cpplint and our heuristic outline with clang-tidy and our AST-based outline. Despite being inaccurate and incomplete, users find them useful and are not willing to wait.

I think you are better off spending your time on optimizing the correct parser infrastructure. I’m sure more can be done - particularly in terms of caching, persisting and resusing state (think like PCH and modules etc).

We have worked on projects over several years to improve these things (and other aspects such as error-resilience). We agree there’s more that can be done, and will continue to work on this. We don’t believe this approach will get anywhere near a 100x latency improvement, which is what we’re looking for.

jyknight · November 9, 2021, 3:20pm

That’s definitely not true. I use “etags” quite regularly as a quick code index, for example. I use “git grep” regularly as a symbol reference “index”. My editor matches curly braces to find the bounds of a block, completely ignoring the fact that someone might have done “#define END }”. I use diff/merge tools that completely ignore C syntax clues, and run sed to make global changes in c++ projects.

These existing tools are all quite far from accurate, yet they all work well enough to be extremely useful – and, most importantly, they work without building intermediate artifacts, or doing any build system integration. If this new project can do a better job than these tools, and keep the same simplicity of use, it sounds like a win to me!

Sam_McCall · November 9, 2021, 3:25pm

Yeah, FWIW I’d +1 Andrew’s comments here - it was sort of one major premise of clang being designed as a reusable library, that C++ is just too complicated to reimplement separately/repeatedly in various tools.

Yes. This is a good argument for reusable implementations, but I’m not sure one is enough.
c.f. clang-format not using clang beyond the lexer, and the success attributable to that.
Ideally we’d share an impl there, in practice its maturity as a product and concrete design choices in its parser combine to make that hard.

Given the long time scale of these things - any chance of a plan to converge clang-format and this new thing eventually? (so we have 2 rather than 3 versions of C++ understanding in the LLVM project)

For something’s that’s going to change significant code - how slow is a clang-based solution? What’s the tradeoff being made?

Basically it’s the difference between interactive latency and not.
For our internal clangd deployment (because those are the numbers I have) 90%ile is most of a minute to parse headers, and several minutes in the build system to get ready (generated headers, flags…).

How much of this work is equivalent/shared/cached by the build system? (eg: if I just did a build, then I wanted to refactor a function - how long are we talking there?)

The build system stuff is cacheable[1], so once you’ve done that, a tool might take 30 seconds (per file) each time you run it.

For single-file operations (think go-to-definition), this is enough to avoid the tool. See the (lack of) popularity of clang-rename :-).
This can be mitigated with PCH/preamble as in clangd, which still takes 30 seconds to prepare, but now you can perform subsequent operations quickly. This startup delay is the #1 user complaint about clangd. (We have several significant optimizations here that trade off accuracy, and still).
The PCH is typically hundreds of megabytes per source file, so caching it silently/indefinitely makes people unhappy - ask me how I know! In clangd we retain it while the user has the file open, which works OK for a stateful program.

For codebase-wide operations (find-refs) the parsing time easily gets into hours.
You can mitigate this by building an index, but that takes hours and it’s a significant barrier.

Bottom line: users want tools that are predictably fast (<100ms, including the first run).

[1] In practice, because build system caches are mutable user-controlled state, cache sharing isn’t transparent. Either tools don’t share cache with the ‘real’ build, or the user is required to do a real build to get accurate results - I’ve seen both

Demi_M_Obenour · November 9, 2021, 3:31pm

What about pushing the state to a server? Have a server that has the entire
index, and keeps it up to date whenever a VCS commit is made to the main
branch. Users then only need to compute deltas, which should be much smaller.

Sincerely,
Demi Marie Obenour (she/her/hers)

dblaikie · November 9, 2021, 4:20pm

Fair enough - thanks for the context!

Minor implementation question/request/probably already covered: Hopefully this’ll be fuzz tested (both asan fuzzing for crashes, but also general behavioral correctness fuzzing too, if practical) from the start?
Is it expected that this will fail gracefully? (if it can’t correctly perform the transformation, it’ll be able to explicitly notify the user? (I guess not - presumably some of the things it doesn’t understand it won’t know that it doesn’t understand, and may mangle things - and the user has to look at the diff to understand whether the tool got it right?))

Sam_McCall · November 9, 2021, 5:09pm

I think you are better off spending your time on optimizing the correct
parser infrastructure. I’m sure more can be done - particularly in terms
of caching, persisting and resusing state (think like PCH and modules etc).

We have worked on projects over several years to improve these things (and
other aspects such as error-resilience). We agree there’s more that can be
done, and will continue to work on this. We don’t believe this approach
will get anywhere near a 100x latency improvement, which is what we’re
looking for.

What about pushing the state to a server? Have a server that has the entire
index, and keeps it up to date whenever a VCS commit is made to the main
branch.

We have this for clangd’s index: https://clangd.llvm.org/guides/remote-index
It works great (try it out with LLVM!) but needing to deploy a server means 90% of users won’t ever touch it.

(A shared repository of serialized ASTs is something we’re considering in a tightly controlled corp environment but the barriers are pretty huge: size, security and it only works if everyone uses the same precise version of the tool. And it only makes sense at all if you’re sure you can download 300MB in less than 30 seconds!)

David Blaikie wrote:

Minor implementation question/request/probably already covered: Hopefully this’ll be fuzz tested (both asan fuzzing for crashes, but also general behavioral correctness fuzzing too, if practical) from the start?
Thanks, we should definitely do crash fuzzing at least, will add it to the doc. I don’t really understand how behavioral fuzzing would work. At google we have access to a firehose of incomplete code snapshots that we’d also use for evaluation.
(Regarding stability: C++ isn’t the ideal implementation language here. But we’d like to depend on the clang lexer, be depended on by clangd, and be accessible to the C++ community).

Is it expected that this will fail gracefully? (if it can’t correctly perform the transformation, it’ll be able to explicitly notify the user?
Two parts here: what does the parser produce and what does the tool on top of it do?

The parser’s tree output will include places where it knows there was an error (and maybe how the error was corrected).
But in case of ambiguity it won’t know, e.g. if it parsed “Foo(42);” as a function call but it was actually a constructor.

The behavior on top is up to the tool, e.g. an indexer needs to make a decision because it’s not useful to display a warning each time you query the index.

and the user has to look at the diff to understand whether the tool got it right?
Ultimately yes.
If you don’t have both a human reviewer and automated tests for every change, you’re on shaky ground.
This is true for human-authored changes and clang-based tools like clang-tidy too.
(Of course the accuracy level is still important!)

Demi_M_Obenour · November 9, 2021, 5:41pm

Indeed that is not going to be useful outside of an on-premises corporate
environment with extremely fast network connectivity. And the security
considerations are stringent, especially since clangd is written in C++ and
I am not sure it can be trusted with untrusted input.

Could an index be persisted to disk, reloaded at startup, and incrementally
changed as the user edits? That would avoid having to have a daemon
constantly running.

Sincerely,
Demi Marie Obenour (she/her/hers)

Andrew_Tomazos · November 9, 2021, 6:02pm

Such an indexer architecture is what most people are using for codebases less than a few million lines, which is almost all of them. Only a handful of companies like Google, Bloomberg, etc have to deal with the scalability issues of codebases larger than that. In those outlier settings I’d say a common practice is to break off just the million or so lines around what you’re working on (ie it and its dependencies and some dependents), and just index that subpart. There are then other centralized shared tools that regularly index the checked-in state across the entire codebase, when you need to venture outside of that subpart (eg to look up external refs). The latter doesn’t sync with your unchecked-in changes, but that is a small price to pay compared to using an inferior approximate parser that is always getting things wrong (go-to-definition/autocomplete/call-graph/etc regularly failing).

Topic		Replies	Views
Parser Design? Clang Frontend	1	108	June 2, 2009
Does clang provide a syntax only parse interface Beginners	1	168	October 15, 2023
Status/Future of libclang Clang Frontend	10	95	July 11, 2016
Designing a clang incremental compiler server Clang Frontend	4	127	June 18, 2010
Removing pseudo parser Clang Frontend	13	928	September 19, 2024

[RFC] A C++ pseudo parser for tooling

Related topics