[RFC] Identifying wasteful template function bodies

I work on a C++ project for which compilation time is a significant
concern. One of my colleagues was able to significantly shorten the
time Clang took to compile our project, by manually outlining
independently-typed code from large template functions.

This makes intuitive sense to me, because when instantiating a
template function, Clang traverses the body of the function. The
longer the function body, the more nodes in the AST Clang has to
traverse, and the more time it takes. Programmers can read the
function and see that some statements in the function body remain the
same no matter what types the function is instantiated with. By
extracting these statements into a separate, non-template function,
programmers can reduce the amount of nodes Clang must traverse.

I created a contrived example that demonstrates how splitting up a
long template function can improve compile time. (Beware, the files
are large, I needed something that would take Clang a hefty amount of
time to process.)
https://gist.github.com/modocache/77b8ac09280c08bd88f84b92ff43a28b

In the example above, 'example.cpp' defines a template function
'foo<T, U, V, W, X, Y, Z>', whose body is ~46k LoC. It then
instantiates 'foo' 10 times, with 10 different combinations of
template type parameters. In total, 'clang -c -O1 example.cpp -Xclang
-disable-llvm-passes -Xclang -emit-llvm' takes ~35 seconds in total to
compile. Each additional instantiation of 'foo' adds an additional ~3
seconds to the total compile time.

Only the last statement in 'foo' is dependent upon the template type
parameters to 'foo'. 'example-outlined.cpp' moves ~46k LoC of
independently-typed statements out of 'foo' and into a function named
'foo_prologue_outlined', and has 'foo' call 'foo_prologue_outlined'.
'foo_prologue_outlined' is not a template function. The result is
identical program behavior, but a total compile time of just ~5
seconds (~85% faster). Additional instantiations of 'foo' in
'example-outlined.cpp' cost almost no additional compile time.

Although the functions in our project are not as long, some of them
take significantly longer than 35 seconds to compile. By outlining
independently-typed statements, we've been able to reduce compile time
of some functions, from 300s to 200s (1/3rd faster). So, my colleagues
and I are looking for other functions we can manually outline in order
to reduce the amount of time Clang takes to compile our project. To
this end, it would be handy if Clang could tell us, for example, “hey,
I just instantiated 'bar<int, float, double>', but X% of the
statements in that function did not require transformation,” where
'X%' is some threshold that could be set in the compiler invocation.
For now I'm thinking the option to set this warning threshold could be
called '-Wwasteful-template-threshold=' -- but I'm aware that sounds
awkward, and I'd love suggestions for a better name.

I think implementing this feature is possible by adding some state to
TreeTransform, or the Clang template instantiators that derive from
that class. But before I send a patch to do so, I'm curious if anyone
has attempted such a thing before, or if anyone has thoughts or
comments on this feature. I'd prefer not to spend time implementing
this diagnostic in Clang if it's predestined to be rejected in code
review, so please let me know what you think!

(I've cc'ed some contributors who I think have worked in this space
like @rnk, or those who might have better naming suggestions like
@rtrieu.)

- Brian Gesiak

Oops, sorry, I meant to send this to cfe-dev! Looking forward to any
and all advice/opinions, though. Thanks! - Brian

I like this idea. I’m not sure a warning is the best way to surface it. The first alternative that occurs to be would be the -ftime-trace JSON file, so you can dump out complete info about the most often instantiated templates, how many nodes they contained, and how many of them were dependent.

Which brings me to wonder, if clang already tracks whether an Expr is instantiation dependent, why can’t it do this optimization itself? I assume there are good reasons. I recall there are some invariants about Decl or Expr pointer identity, but maybe those are handled by AlwaysRebuild. Another thing that occurs to me is that some semantic analysis or warnings may not fire if the DeclContext of an Expr is dependent. I wonder what the savings would be if we could enumerate those checks, store them in the template pattern, and run only the checks that matter without re-traversing the entire AST. The last reason I can think of is ADL. I think that would defer name lookup for almost every CallExpr in a template.

I like this idea. I’m not sure a warning is the best way to surface it. The first alternative that occurs to be would be the -ftime-trace JSON file, so you can dump out complete info about the most often instantiated templates, how many nodes they contained, and how many of them were dependent.

Which brings me to wonder, if clang already tracks whether an Expr is instantiation dependent, why can’t it do this optimization itself? I assume there are good reasons. I recall there are some invariants about Decl or Expr pointer identity, but maybe those are handled by AlwaysRebuild. Another thing that occurs to me is that some semantic analysis or warnings may not fire if the DeclContext of an Expr is dependent. I wonder what the savings would be if we could enumerate those checks, store them in the template pattern, and run only the checks that matter without re-traversing the entire AST. The last reason I can think of is ADL. I think that would defer name lookup for almost every CallExpr in a template.

I /think/ ADL still allows calls to be resolved up-front if the parameters aren’t dependent. Yeah, here the non-dependent call is resolved and the dependent call is left with an UnresolvedLookupExpr: https://godbolt.org/z/kgmDHj

Awesome, thanks all!

Note that GitHub - aras-p/ClangBuildAnalyzer: Clang build analysis tool using -ftime-trace already has some information that helps in these kinds of diagnosing. It consumes the traces from -ftime-trace. It has good ways of saying which template instantiations took too long over your code base, and hacky ways of saying which templates took too long to instantiate (I wrote the hacky thing). Here's a more recent view into the output that you can get: https://github.com/aras-p/ClangBuildAnalyzer/blob/master/tests/self\-win\-clang\-cl\-9\.0rc2/\_AnalysisOutputExpected\.txt

I will say that one gap in this profiling tool is large quantities of tiny functions / templates, e.g. std::move. -ftime-trace won't trace events that are sufficiently small, so it can miss things like std::move that involve a huge quantity of tiny instantiations.

ClangBuildAnalyzer is interesting. I used it to make one change to LLVM already:
https://github.com/llvm/llvm-project/commit/285cf9a84e1bd2f601b097fc29f75ea4c7779eba
You can see the analysis output that change was informed by here:
https://reviews.llvm.org/P8171

The next area of interest I identified was printAliasInstr, which produced this patch:
https://reviews.llvm.org/D70650