Unfortunately it’s not possible to parse C++ even close to accurately without preprocessing (and so build-system integration).
We’re not convinced this is true, if we’re talking about “average code”.
Our measurements show tree-sitter achieving 95%+ average accuracy on a large codebase.
(We hope to achieve higher accuracy, better handling of broken code, and finer-grained info by specializing for C++).
Certainly there are cases where it’s not possible to parse without both preprocessing and semantic analysis, but these aren’t most code. The strategy here is to make informed guesses and rely on error-tolerance to avoid too much fallout from a bad guess. (This is the third category of error listed under error-resilience in the doc).
Macros can expand to an arbitrary token sequence (or even create new tokens through stringization or concatenation). It means that any identifier can become any token sequence.
That’s even before we mention how name lookup is needed for disambiguation. To parse C++ you in fact need to do full preprocessing and a large chunk of semantic analysis.
These are covered in some detail in the design document, I’d be interested in your thoughts there, especially real-world examples that are important and not solvable in this way.
(Though yes, we expect to get some cases wrong and to fail catastrophically on code where PP is used in unidiomatic ways, just as clang-format does).
Given how inaccurate the parse from the best possible “single source file” parser is - it’s not clear what the use case is for it.
Some use cases are listed in the doc, granted if the parse is too inaccurate it won’t be useful for them.
FWIW several of these use-cases are places where we’re using regexes today.
clang-format (largely) only makes whitespace changes, so there is limited opportunity for inaccuracies in its parse to lead to errors.
Sure. It can lead to style errors though. We enforce both clang-format and a style guide on a large part of our codebase, and it works.
Of course this is only weak evidence as clang-format must infer much less structure.
To generate file outlines and do refactoring I suspect you’re better off waiting for a proper parse than using a completely inaccurate one.
Funny you should mention
clangd does provide an AST based outline, and it’s great. For our internal deployment, the editor team decided to go with a (closed-source, relatively simple) pseudo-parser outline instead. It was worse, but OK, and having it immediately available was judged more important.
This made me pretty sad but I find it hard to disagree.
In the dev environment I use, past versions of the indexer had tried to do such an approximate parse, and current versions do a full correct C++ parse, so I’ve experienced the difference first-hand. It’s night and day.
Agree. This is why we have an AST-based indexer (and many flavors of it, just-in-time, background, networked). This won’t go away.
However the time to build that index can be a night and a day, too. People edit large codebases on small laptops…
We think this can be two orders of magnitude faster. If there’s a way to do that with clang, I’d love to hear it!