Factoid: Each class template instantiation costs 1KiB

Messing around with clang -cc1 -print-stats, I think I can show that every class template instantiation costs a minimum of 1KiB of memory. To me, that seems like a lot of data just to capture temporary type traits that are often used solely for template metaprogramming namespace purposes. I figured that sharing this factoid might make it concrete, and might motivate talented C++ library developers to think more about optimizing their instantiation counts.

This is what I hacked together to try to measure the AST memory used per class template instantiation:

$ cat t.cpp 
template <int N>
struct ClassTemplate {
  static const int sdm = ClassTemplate<N-1>::sdm + 1;
};
template <>
struct ClassTemplate<0> {
  static const int sdm = 0;
};
int main() {
  return ClassTemplate<MULTIPLIER * 100>::sdm;
}

$ for i in $(seq 1 10) ; do clang -cc1 -DMULTIPLIER=$i t.cpp  -print-stats |& \
   grep Bytes\ used | cut -d' ' -f 3 |  awk '{ sum += $0 } END { print sum }'  ; done
386824
500824
614824
728824
842824
956824
1070824
1184824
1298824
1412824

I’m not a good enough shell programmer to do the subtraction here, but I threw it in sheets and did the obvious difference calculations to show that each step is 114000, and if you divide by 100 (the multiplier) you get 1140 per instantiation. The counts of each record go up in very obvious linear ways:

*** AST Context Stats:
  2080 types total.    
    1 ConstantArray types, 48 each (48 bytes)
    63 Builtin types, 32 each (2016 bytes)
    1 FunctionProto types, 48 each (48 bytes)
    1 InjectedClassName types, 48 each (48 bytes)
    5 Pointer types, 48 each (240 bytes)
    1003 Record types, 32 each (32096 bytes)
    1006 TemplateSpecialization types, 48 each (48288 bytes)
Total bytes = 82784          
0/1001 implicit default constructors created
0/1001 implicit copy constructors created
0/1001 implicit move constructors created
0/1001 implicit copy assignment operators created
0/1001 implicit move assignment operators created
0/1001 implicit destructors created                      
                                                                                 
Number of memory regions: 207
Bytes used: 1147908              
Bytes allocated: 1171456
Bytes wasted: 23548 (includes alignment, etc)
                                                                                 
*** Decl Stats:
  3025 decls total.            
    1 ExternCContext decls, 72 each (72 bytes)
    1 Function decls, 168 each (168 bytes)
    1002 Var decls, 104 each (104208 bytes)
    1 NonTypeTemplateParm decls, 88 each (88 bytes)
    8 Field decls, 80 each (640 bytes)
    1005 CXXRecord decls, 144 each (144720 bytes)
    1001 ClassTemplateSpecialization decls, 184 each (184184 bytes)
    5 Typedef decls, 88 each (440 bytes)
    1 ClassTemplate decls, 88 each (88 bytes)
Total bytes = 434608
                                                                                 
*** Stmt/Expr Stats:
  9021 stmts/exprs total.
    1000 SubstNonTypeTemplateParmExpr, 40 each (40000 bytes)
    1 UnresolvedLookupExpr, 64 each (64 bytes)
    1001 MaterializeTemporaryExpr, 24 each (24024 bytes)
    1006 IntegerLiteral, 32 each (32192 bytes)
    1002 ConstantExpr, 24 each (24048 bytes)
    1 DependentScopeDeclRefExpr, 56 each (56 bytes)
    2004 DeclRefExpr, 32 each (64128 bytes)                   
    1001 ImplicitCastExpr, 24 each (24024 bytes)
    2003 BinaryOperator, 32 each (64096 bytes)
    1 ReturnStmt, 16 each (16 bytes)
    1 CompoundStmt, 16 each (16 bytes)
Total bytes = 272664                                                    

Digging in a bit, I guess this is like 4-to-3-ish Decl to Stmt, which suggests that the template body matters a lot, and mine is pretty simple. It seems like one could use a similar methodology to stamp out unique_ptrs and count the per-instantiation memory overhead, and it would be quite high.

I’m not sure what to do with this information, other than to reflect on the fact that the compiler currently doesn’t give any feedback to the developer about the costs they are paying to compile their program. The state of the art is basically -ftime-trace, which in turn mostly tells people that the compiler spends a lot of time instantiating base or standard library template sets. I’ve seen some limited success with this, but only C++ experts seem to find this data to be actionable. I wonder if we could build a trace flame graph weighted by memory allocation, given that the cost ratio of RAM to compute is increasing. :thinking:

Maybe what this all points to is that header-only library evolution is a dead end when it comes to compile time, and that if you want to have fast compiles, the old ways, i.e. forward declarations, pimpl patterns, and other forms of implementation hiding, are still relevant if you care about compile time, even in a modular world. :person_shrugging:

11 Likes

Sigh

Yeah, this is something I’m aware of. We have a few things that make this particularly bad, and Richard and I discussed it after his great talk at LLVM-Dev last year (which was roughly on this topic!).

I looked into it a while, and there is a bit of surgery that could happen that would help, but it would get worse for a while.

ONE of the worst offenders here is how we store the declarations for a DeclContext. Currently, we do this as an intrusive linked-list, where each declaration holds a link to the NEXT declaration, as well as to the DeclContext. This means we end up not being able to reuse any AST nodes during template instantiation, so basically EVERY node ends up getting re-instantiated.

I looked at one point to get rid of the first thing (changing DeclContext to hold a std::vector of Decls instead of a linked-list), but it ends up messing with all of our Filter Iteration on Decls as well in quite a few places, which made stack sizes increase/etc. I didn’t have as much time as I wanted to look into it unfortunately.

The second part of the problem (Decls having a link to their DeclContext) is one that I don’t know how to fix, nor have an idea on. We end up needing this information so often that I don’t have a good idea how to split this. BUT if we could figure out how to do so, canonicalized decls could be re-used in a number of places, particularly in instantiations, which would decrease the cost.

IIRC, this is the talk Richard gave that showed how Carbon does instantiation (which I thought is clever!) https://www.youtube.com/watch?v=j0BL52NdjAU

We obviously couldn’t do anything like that, but he does point out some cases where we do a very bad job due to the above.

3 Likes

From the build system implementor’s perspective, this was always true (I implemented CMake’s support for building C++ modules). This is because, during development, any change to a module interface unit is highly likely to trigger recompilation of anything importing it. So you still want to hide implementation details from such files. PIMPL is also a technique in the same major direction as module files: controlling your API and ABI surface.

Note that, AFAIK, forward declarations can be tricky with modules because you have to pay attention to what module it is attached to when you forward declare it. That is, you can’t just forward declare other modules’ types in your module interface unit because their symbols incorporate the module name they actually belong to (because weak module ownership[1] turned out to be a dead end) and there’s no syntax like namespace foo { class bar; } for foreign modules; you just have to import them.


  1. It is very possible this is the wrong technical term for what I’m talking about. ↩︎

Sigh

Just to be clear, I’m not pushing for optimizations to Clang here, my goal in sharing this data is so that folks involved in library design discussions, say at WG21, have data to push back and advocate for simpler designs with as few helper templates as possible.

Richard’s talk was great, and maybe one day someone will come along and make an attempt to reimplement the entirety of Clang template instantiation along those lines, but… until that day, library designers should be aware that these costs are fixed, and they aren’t getting smaller any time soon. Also, we haven’t even started talking about debug info. Call it an exercise for the reader. =D

1 Like

Thanks for doing this, @rnk.

Perhaps we can teach clang statistics and timing logic to compute the average memory for a template instantiation in a given translation unit.

Based on your comments I’d like to open another direction. Clang was built under the assumption that the AST is immutable. However, that’s limiting in quite a few places (ASTImporter, lldb, other tools that work on high-level and synthesize code). I wonder if we can relax the assumption of immutable AST and make it mutable in the template instantiation area. That would mean that we could implement the overlay approach in the TreeTransform with far better node reuse. I bet Richard has thought about how that could be implemented in Clang :wink:

Understood, yet tempting…

cc: @mizvekov, @devincoughlin.

Given that this has been the assumption in Clang for 15+ years, I would be pretty wary of trying to walk that back now. That has impacts on tooling, but also, reimplementing the template instantiation engine is something I think we want to avoid for as long as we can due to the risk (silent ABI changes due to template instantiation differences are a serious concern).

1 Like