Massive performance regression in clang 3.9

Hi,

I have observed a massive performance regression when upgrading from clang 3.7
to 3.9. I have also checked that current tip-of-trunk seems to have this
regression too.

I have managed to isolate the patch at which the regression appeared as
r248431 on 23rd Sept 2015.

I have attached a simple test-case that demonstrates the problem rather
effectively.

Prior to r248431, this code would compile in under 20 seconds, consuming
approx. 400Mb of memory. GCC, for comparison, takes longer at around 50
seconds on my machine but consumes only ~250Mb.

Following patch r248431 and all the way up to tip-of-trunk, clang is
terminated by the OOM handler due excessive memory usage in excess of 7
Gigabytes!!!

This seems a little unreasonable -- the test-code in question is a little
pathological maybe but is designed to exploit template memoisation, so the
excessive memory usage would imply to me that this memoisation is no longer
working.

Is anyone else able to confirm my findings? If so, I will raise this as a bug
report. Richard, I have copied you as the author of the patch in question
(hope you don't mind!).

Cheers,
Andy

test.cpp (1.24 KB)

That’s not the problem. The issue is that this:

(sizeof…(Is) + Is)…

substitutes into the sizeof… subexpression once for each element of Is, which massively amplifies any change in the behaviour of sizeof…

r248431 changes our behaviour such that sizeof… now scans its parameter’s expansion looking for unexpanded packs (which can appear if it’s used inside an alias template), making the instantiation of that fragment quadratic in the size of the pack Is.

I’ll take a look and see if we can handle this a bit better.

Hi,

I have observed a massive performance regression when upgrading from clang
3.7
to 3.9. I have also checked that current tip-of-trunk seems to have this
regression too.

I have managed to isolate the patch at which the regression appeared as
r248431 on 23rd Sept 2015.

I have attached a simple test-case that demonstrates the problem rather
effectively.

Prior to r248431, this code would compile in under 20 seconds, consuming
approx. 400Mb of memory. GCC, for comparison, takes longer at around 50
seconds on my machine but consumes only ~250Mb.

Following patch r248431 and all the way up to tip-of-trunk, clang is
terminated by the OOM handler due excessive memory usage in excess of 7
Gigabytes!!!

This seems a little unreasonable -- the test-code in question is a little
pathological maybe but is designed to exploit template memoisation, so the
excessive memory usage would imply to me that this memoisation is no longer
working.

That's not the problem. The issue is that this:

(sizeof...(Is) + Is)...

substitutes into the sizeof... subexpression once for each element of Is,
which massively amplifies any change in the behaviour of sizeof...

r248431 changes our behaviour such that sizeof... now scans its
parameter's expansion looking for unexpanded packs (which can appear if
it's used inside an alias template), making the instantiation of that
fragment quadratic in the size of the pack Is.

I'll take a look and see if we can handle this a bit better.

As of r284653, the performance and memory usage for simple sizeof...
expansions should be a lot better. But your approach is still O(N^2 log N)
when compiled with Clang. You can remove a factor of N by passing around
the size of the pack instead of recomputing it:

template <unsigned Size, typename List, bool Odd>
struct PartialVariadicIndicesSequence;

template <unsigned Size, unsigned... Is>
struct PartialVariadicIndicesSequence<Size, VariadicIndices<Is...>, false> {
  using Type = VariadicIndices<Is..., Size + Is ...>;
};

template <unsigned Size, unsigned... Is>
struct PartialVariadicIndicesSequence<Size, VariadicIndices<Is...>, true> {
  using Type = VariadicIndices<Is..., Size + Is ..., Size * 2>;
};

template <unsigned N>
struct MakeVariadicIndicesImpl : PartialVariadicIndicesSequence<
    N / 2,
    typename MakeVariadicIndicesImpl<N / 2>::Type,
    ((N % 2) != 0)
  > { };

Is anyone else able to confirm my findings? If so, I will raise this as a

As of r284653, the performance and memory usage for simple sizeof...
expansions should be a lot better.

Wonderful! Brilliant! Thanks! We're now within a shade of the compilation
time of gcc again. Memory consumption is back to a much more reasonable
400Mb.

But your approach is still O(N^2 log N)
when compiled with Clang. You can remove a factor of N by passing around
the size of the pack instead of recomputing it [...]

And this makes a huge difference, decreasing the compile time to a mere second.
Thanks for the suggestion. It is one worth remembering. Intuitively it seems
that sizeof... should just "know" the size of the pack (i.e. O(1) lookup) in
the way that the normal sizeof "knows" the size of an object, but obviously it
isn't the case.

As mentioned before, libc++ uses a similar such construct here:

http://llvm.org/viewvc/llvm-project/libcxx/trunk/include/__tuple?revision=280007&view=markup#l86

(granted, not for more recent versions of clang, but I think still used with
gcc for example)

It would probably be worth doing the same trick here, especially as sizeof...
is calculated seven times. I should have some time later when I can put
together a patch.

Thanks again,
Andy