clang memory usage with C++ template metaprogramming

Several years ago I embarked on a project involving some heavy-duty C++
template metaprogramming. In the end I abandoned it because the compile
times and memory usage with g++ were too big.

On seeing clang's promised reduction of such requirements, I thought I'd
go back to my project and see how clang fared when compiling it.
Although it does indeed run much faster than g++, it actually uses
*more* memory. I'm just posting here to ask if this is to be expected.
If it might be indicative of some issue or if you'd like to know where
all this memory is being used then I'd be happy to try some profiling.

Here are some numbers for one particular compilation:

g++ 4.3.3:
  Wall clock time: 9:38.92
  Peak memory usage: ~1.40GiB

g++ 4.4.3:
  Wall clock time: 7:01.17
  Peak memory usage: ~1.37GiB

clang++ (svn r105478):
  Wall clock time: 0:15.59
  Peak memory usage: ~1.50GiB

TBH I'm astonished that clang was able to gobble up so much memory in so
little time!

John Bytheway

I've also seen this with template metaprogramming-heavy code, but aside from some idle speculation (we think it has to do with type-location information in Clang), we haven't looked into it closely.

  - Doug

Several years ago I embarked on a project involving some heavy-duty
C++ template metaprogramming. In the end I abandoned it because
the compile times and memory usage with g++ were too big.

On seeing clang's promised reduction of such requirements, I
thought I'd go back to my project and see how clang fared when
compiling it. Although it does indeed run much faster than g++, it
actually uses *more* memory. I'm just posting here to ask if this
is to be expected. If it might be indicative of some issue or if
you'd like to know where all this memory is being used then I'd be
happy to try some profiling.

<snip>

I've also seen this with template metaprogramming-heavy code, but
aside from some idle speculation (we think it has to do with
type-location information in Clang), we haven't looked into it
closely.

Fair enough. I was curious, so I ran valgrind/massif to get an idea.
In short:

16.53% (259,009,024B) in 722 places, all below massif's threshold
14.49% (227,086,336B) clang::DeclContext::CreateStoredDeclsMap
12.85% (201,326,592B) clang::SourceManager::createInstantiationLoc
12.83% (201,068,544B) clang::ASTContext::CreateTypeSourceInfo
08.86% (138,792,960B) clang::ASTContext::getTemplateSpecializationType
06.82% (106,841,236B) clang::TokenLexer::ExpandFunctionArguments
04.94% (77,463,552B) clang::CXXConstructorDecl::Create
02.86% (44,883,968B) clang::ClassTemplateSpecializationDecl::Create
02.71% (42,532,864B) clang::ParmVarDecl::Create
02.25% (35,332,096B) clang::TagDecl::startDefinition
02.15% (33,763,328B) clang::TemplateArgumentList::TemplateArgumentList
02.14% (33,554,432B) std::vector<clang::Type*,
std::allocator<clang::Type*> >::_M_insert_aux
02.06% (32,329,728B) clang::CXXRecordDecl::Create
02.05% (32,157,696B) clang::CXXMethodDecl::Create
01.82% (28,585,984B) clang::CXXDestructorDecl::Create
01.59% (24,907,776B) clang::ASTContext::getFunctionType
01.27% (19,861,504B) clang::ASTContext::getLValueReferenceType
01.08% (16,908,288B) clang::TypedefDecl::Create

So indeed type location information is a significant part, but nothing
is overwhelming, which I guess is a good sign and nothing is worth changing.

I wonder idly: How plausible would it be to allow execution in a mode
where no source information was maintained, and thus reduce memory usage
(at the expense of useful errors/warnings)? Such a mode might be useful
at times. I'm guessing it would be prohibitively difficult.

John Bytheway

Several years ago I embarked on a project involving some heavy-duty
C++ template metaprogramming. In the end I abandoned it because
the compile times and memory usage with g++ were too big.

On seeing clang's promised reduction of such requirements, I
thought I'd go back to my project and see how clang fared when
compiling it. Although it does indeed run much faster than g++, it
actually uses *more* memory. I'm just posting here to ask if this
is to be expected. If it might be indicative of some issue or if
you'd like to know where all this memory is being used then I'd be
happy to try some profiling.

<snip>

I've also seen this with template metaprogramming-heavy code, but
aside from some idle speculation (we think it has to do with
type-location information in Clang), we haven't looked into it
closely.

Fair enough. I was curious, so I ran valgrind/massif to get an idea.
In short:

16.53% (259,009,024B) in 722 places, all below massif's threshold
14.49% (227,086,336B) clang::DeclContext::CreateStoredDeclsMap

In theory, we might be able to use a smaller data structure for DeclContexts with only a few elements in them, which would probably help reduce memory usage when we're dealing with many instantiations of small templates.

12.85% (201,326,592B) clang::SourceManager::createInstantiationLoc
06.82% (106,841,236B) clang::TokenLexer::ExpandFunctionArguments

There must be some preprocessor metaprogramming going on this example, too? That's pretty big for the preprocessor.

12.83% (201,068,544B) clang::ASTContext::CreateTypeSourceInfo

Yes, this is the type-source information I mentioned. If we make template instantiation "perfect" with respect to type-source information, so that any dependent type instantiates down to something that structurally identical to the form it had when it was written in the source, then we could avoid allocating memory for type-source information in each type instantiation. We're not too far from this goal, but it has to be *perfect* for us to use the optimization.

04.94% (77,463,552B) clang::CXXConstructorDecl::Create
02.05% (32,157,696B) clang::CXXMethodDecl::Create
01.82% (28,585,984B) clang::CXXDestructorDecl::Create

A number of these could be eliminated if we were to lazily create the implicitly-declared default constructor, copy constructor, copy-assignment operator, and destructor.

08.86% (138,792,960B) clang::ASTContext::getTemplateSpecializationType

02.15% (33,763,328B) clang::TemplateArgumentList::TemplateArgumentList
01.59% (24,907,776B) clang::ASTContext::getFunctionType
01.27% (19,861,504B) clang::ASTContext::getLValueReferenceType
01.08% (16,908,288B) clang::TypedefDecl::Create

Not much we can do about these, except look for ways to make the various AST nodes smaller.

So indeed type location information is a significant part, but nothing
is overwhelming, which I guess is a good sign and nothing is worth changing.

I wonder idly: How plausible would it be to allow execution in a mode
where no source information was maintained, and thus reduce memory usage
(at the expense of useful errors/warnings)? Such a mode might be useful
at times. I'm guessing it would be prohibitively difficult.

We discussed this back when we improved type-source location information, but I am very much against having such a mode: the AST should always be the same, for all clients, or the size of the testing matrix explodes and we get far worse coverage. We should spend time optimizing the system as a unified whole rather than trying to separate out the less-efficient bits that provide needed functionality.

  - Doug

Fair enough. I was curious, so I ran valgrind/massif to get an
idea. In short:

<snip>

12.85% (201,326,592B) clang::SourceManager::createInstantiationLoc
06.82% (106,841,236B) clang::TokenLexer::ExpandFunctionArguments

There must be some preprocessor metaprogramming going on this
example, too? That's pretty big for the preprocessor.

Yes, there is. Give me variadic templates and constexpr functions and I
can dispose of most of it :).

12.83% (201,068,544B) clang::ASTContext::CreateTypeSourceInfo

Yes, this is the type-source information I mentioned. If we make
template instantiation "perfect" with respect to type-source
information, so that any dependent type instantiates down to
something that structurally identical to the form it had when it was
written in the source, then we could avoid allocating memory for
type-source information in each type instantiation. We're not too far
from this goal, but it has to be *perfect* for us to use the
optimization.

A laudable goal.

04.94% (77,463,552B) clang::CXXConstructorDecl::Create
02.05% (32,157,696B) clang::CXXMethodDecl::Create
01.82% (28,585,984B) clang::CXXDestructorDecl::Create

A number of these could be eliminated if we were to lazily create the
implicitly-declared default constructor, copy constructor,
copy-assignment operator, and destructor.

That sounds like the easiest of these; if it is then it's a shame these
are not a larger proportion of the problem.

I wonder idly: How plausible would it be to allow execution in a
mode where no source information was maintained, and thus reduce
memory usage (at the expense of useful errors/warnings)? Such a
mode might be useful at times. I'm guessing it would be
prohibitively difficult.

We discussed this back when we improved type-source location
information, but I am very much against having such a mode: the AST
should always be the same, for all clients, or the size of the
testing matrix explodes and we get far worse coverage. We should
spend time optimizing the system as a unified whole rather than
trying to separate out the less-efficient bits that provide needed
functionality.

Yeah, that feels wise.

Thanks for the insight,

John Bytheway

Actually, we still don't unique non-canonical TSTs; it's possible that
doing so would cut down on memory usage in these cases, at least when none
of the template arguments are expressions.

We also make some unnecessary TSTs during argument expansion.

John.