Ast build is very long

mosheDO · April 19, 2023, 8:05am

Hi

I’m newbie to clang and I wrote a program that refactor the code of an existing source code.

I build the AST and use the visitors to refactor the code, the building of the AST takes a very long time.
I tried to make the code with threads (it got faster but not enough)

can I do it faster?
Or am I doing something wrong?

Regards, Moshe

AaronBallman · April 19, 2023, 3:44pm

Hello! You’re not doing anything wrong – building the AST takes time because we have to do all of the tokenization (including macro expansion, etc), parsing, semantic analysis, and template instantiations to get to the point of having a complete AST that can be visited.

I’m certain there are compile-time improvements we can make in those areas to make compilation faster, but there aren’t any configuration options to make it go faster.

mosheDO · April 19, 2023, 4:19pm

Thank you for the answer.

Can I instruct the AST to exclude any includes? to make it faster?
or is there any difference in speed if I use matchers instead of visitors?

Also, If I use multiple stages of refactoring can I save somehow the AST of all the includes?
because I’m not refactoring any of the c includes etc.

Thanks, in advanced.

AaronBallman · April 19, 2023, 4:29pm

Nope – the AST likely wouldn’t be correct in that case anyway because it’d be missing declarations, types, etc that are used by the TU being compiled.

Sometimes! AST matchers will memoize matches on previously-matched AST nodes, so it can go faster depending on what you’re trying to match. I think it might also skip matching on child nodes when it’s able to (but I’m not 100% sure I remember that detail correctly).

Not easily, but it may depend on how you’re doing that refactoring. For example, transformer might do something smart there, but I’m not certain.

tobiashieta · April 19, 2023, 7:33pm

Maybe not applicable in your case - but if you build LLVM in debug mode when developing your changes it also makes it VERY slow.

mosheDO · April 23, 2023, 6:39am

As I said I build the AST multiple times in my code. since the includes doesn’t change from stage to stage, Can I create the AST for all the includes once in the beginning and use or import them somehow in the other stages? so, can I at least save the time of the building of the AST for the includes in every stage?

AaronBallman · April 23, 2023, 1:20pm

You could perhaps use modules or PCH for the includes, but that’s still going to incur time to build the AST consuming those (we have to deserialize them to AST nodes, that should be faster than going through lexing, parsing, and semantic analysis to do so, but it still has overhead). Aside from that, no, there’s no real way to do what you’re asking.

tschuett · April 23, 2023, 4:16pm

Depending on which API you use to ask Clang to build the AST, you could use virtual filesystems for some speedup.

Is there any interest in adding queries and caches into Clang? The Rust and later Swift compiler relies heavily on queries and caches.

AaronBallman · April 24, 2023, 11:47am

I don’t know enough about them to know what level of interest there might be. Do you have a link handy that explains what these are and how they work?

tschuett · April 24, 2023, 12:55pm

Rust and Swift are walking in that direction.

AaronBallman · April 27, 2023, 6:51pm

Thank you for the links! That’s an interesting idea that is worth thinking about further. My initial reaction from reading those docs is “how well does this work for C and C++ code though?” – if you request compilation of a function, you have to also go find (recursively) all the types used by the function signature (just to get the declaration AST node) which potentially means opening up an arbitrary number of files to find the correct declarations (and redeclarations), etc. So a bottom-up compilation approach seems like it could be expensive for some requests before you’ve built up a lot of those “base” AST nodes from the system headers.

tschuett · April 27, 2023, 7:04pm

I don’t know whether opening files is the problem or Sema and CodeGen querying several times properties of an AST node that is costly to calculate. In that case caching might help.

erichkeane · April 27, 2023, 7:08pm

One of the bigger issues in C++ is going to be that an identifier’s ‘value’ (that is, what it is declared as) affects parsing, so the parser needs to know all the identifiers, and what they ‘mean’, and more importantly, what they mean in the ‘current scope’.

tschuett · April 27, 2023, 7:12pm

If I understand you correctly, you are talking about Name Resolution. Is x a local variable, a function parameter, a template parameter, or a global variable.

Still Sema and CodeGen are probably running several times over the AST, caching might help?!?

AaronBallman · April 27, 2023, 7:18pm

Sema is run once (it’s what builds the AST) unless you’re instantiating templates (then we need to re-run a bit of Sema to generate the instantiated AST from the template AST), and CodeGen only walks over the AST once as well.

I think there’s other concerns here as well, like point-of-instantiation mattering for templates and so on. I think the idea of “can you give me the AST for the main() function” is a pretty reasonable request, but demand-driven compilation itself strikes me as being pretty difficult to do for C and C++.

Modules (and PCH) is basically the caching we’re able to do without changing program semantics or parsing results.

doryaro · June 1, 2023, 4:03pm

100% true

Topic		Replies	Views
Analysing a file with AST matcher tool takes significant more time than just compile the same file Clang Frontend	6	102	July 16, 2015
AST matchers and thread safety Clang Frontend	5	195	September 25, 2013
Expression Templates for AST matchers Clang Frontend	6	203	April 13, 2015
clang-query cache? Clang Frontend	8	117	August 2, 2015
ASTs from multiple sources Clang Frontend	1	127	August 8, 2012

Ast build is very long

Related topics