libc++ Performance (compared to libstdc++)

Hi everyone,

I was chatting with Marshall offline last week, and I mentioned that several of my users had noted general performance regressions switching from libstdc++ to libc++. Marshall said that he's heard similar things, but has received few specific reports. He did recall looking at the problem which I believe is described here (Careful With That STL map insert, Eugene · Aras' website), which is still a problem. I'll certainly admit that I'd not investigated most of these in detail (with the exception of a std::complex -ffast-math issue, ⚙ D18639 Use __builtin_isnan/isinf/isfinite in complex). We do have a few performance-related libc++ bugs open:

21192 – Reading from stdin is 10x slower than libstdc++ - Reading from stdin is 1-2 orders of magnitude slower than using libstdc++ [I just tested this myself and updated the bug report].
19708 – libc++'s std::find is about 50% slower than the libstdc++'s - std::find is significantly slower than libstdc++.
20837 – libc++ std::sort has O(n^2) worst case, standard mandates O(N log(N)) - libc++'s std::sort is O(N^2) in the worst case (instead of O(N*ln(N))).
26886 – std::stable_sort exceeds maximum complexity defined by c++11 standard - libc++'s std::stable_sort also has a worst-case complexity issue.
15456 – A faster implementation of std::function - A faster implementation of std::function is possible
16747 – Poor performance of degenerated unordered_multimap and 21275 – Performance of unordered_multimap::insert much slower than GCC 4.8.2 - Our unordered_multimap insert is much slower than libstdc++'s. In PR16747, Howard interestingly explains libc++ has this problem because of an additional (i.e. not-required-by-the-standard) guarantee that libc++ provides regarding member ordering.

but very few are related to containers.

Baptiste Wicht has a benchmark covering use of several common standard algorithms with vectors, lists and dqueues (articles/bench.cpp at master · wichtounet/articles · GitHub) which he used for his post C++ benchmark – std::vector VS std::list VS std::deque | Blog blog("Baptiste Wicht");, and I've compiled this using LLVM/Clang/libc++ r271873 @ -O3, using both libc++ and libstdc++ 4.8.5, and run on an Intel Xeon E5-2699 v3 @ 2.30GHz running Linux 3.10.0. If you try this yourself, note that even on a fast machine the benchmark takes several hours to run.

$ clang++ -std=c++11 -O3 -I../../include -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-gnu
$ clang++ -std=c++11 -stdlib=libc++ -O3 -I../../include -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-llvm

Of the 248 tests, libc++ was faster by at least 5% in 58 of the tests and libstdc++ was faster by at least 5% in 94 of the tests. libc++ was faster by at least 20% in 14 of the tests and libstdc++ was faster by at least 20% in 64 of the tests. The real problem, however, comes from the extremums. libc++ is never more than 65% faster than libstdc++:

destruction___Trivial_128_ list -0.65
destruction___Trivial_4096_ vector -0.40
destruction___Trivial_1024_ list -0.38
destruction___Trivial_1024_ vector -0.37
random_remove___NonTrivialArray_32_ vector -0.3

but libc++ is sometimes over 10x slower than libstdc++:

fill_back___NonTrivialStringMovable list_inserter 9.96
fill_back___NonTrivialStringMovable vector_reserve 10.21
fill_back___NonTrivialStringMovableNoExcept vector_reserve 10.82
fill_back___NonTrivialStringMovableNoExcept vector_inserter 11.15
fill_back___NonTrivialStringMovable vector_inserter 11.93

I've attached the full list.

A second benchmark, http://beta.visl.sdu.dk/svn/visl/tools/benchmarks/src/set.cpp (C++ Set Performance 2 « A howl on the wind…), modified only to repeat each test 30 instead of 7 times, and compiled as before:

uint32_t std::set erase: -0.37
std::string std::set erase: -0.30
std::string std::set insertion: -0.23
std::string std::unordered_set erase: -0.16
std::string std::unordered_set iterate: -0.15
std::string std::set lookup: -0.15
uint32_t std::set insertion: -0.13
uint32_t std::unordered_set iterate: -0.072
std::string std::set iterate: -0.062
uint32_t std::set iterate: -0.054
uint32_t std::set lookup: -0.015
uint32_t std::unordered_set erase: 0.085
std::string std::unordered_set insertion: 0.22
std::string std::unordered_set lookup: 0.30
uint32_t std::unordered_set insertion: 0.51
uint32_t std::unordered_set lookup: 0.61

In this benchmark, libc++ beats libstdc++ by more than 5% in 10 tests, and libstdc++ beats libc++ by more than 5% in 5 tests. Again, however, libc++'s downside is larger, being up to 61% slower (in the 'uint32_t std::unordered_set lookup' test) than libstdc++. libstdc++ loses only by 37% to libc++, at most, in the 'uint32_t std::set erase' test. Also, I can easily imagine that users are more-likely to notice a performance difference in lookup than in erase.

To pick another benchmark, I compiled and ran the one from Data-Oriented Hash Table – Nathan Reed’s coding blog - and this must be good because the post ends with, "And remember, if Chandler Carruth and Mike Acton give you advice about data structures, listen to them. ;)". I modified the benchmark only by adding constexpr to min() and max() of XorshiftRNG to make it compile with libc++. This benchmarks many configurations and takes nearly an hour to run. I'll summarize the results I'll say that libc++ is almost always slower than libstdc++, and that as the element size and/or the number of elements increases it gets worse. Here are the relative timing differences; these are tests for unordered_map:

Fill for 8-byte elements, 32-byte elements, 128-byte elements, 1K-byte elements, 4K-byte elements:
100000 -0.022 0.026 0.10 0.057 0.058
200000 -0.028 0.0041 0.12 0.083 0.057
300000 -0.022 -0.023 0.018 0.040 0.035
400000 -0.010 -0.0094 -0.077 0.048 0.049
500000 0.22 0.28 0.18 0.17 0.094
600000 -0.064 -0.081 -0.11 0.0033 0.020
700000 -0.059 -0.043 -0.089 -0.0047 0.027
800000 -0.037 -0.053 -0.072 0.0092 0.035
900000 0.36 0.30 0.20 0.15 0.099
1000000 0.31 0.23 0.17 0.13 0.098

The first column is the number of elements; negative numbers mean libc++ is faster. For 1000000 4K elements, libstdc++ is faster by 9.8%. For 1000000 8-byte elements, libstdc++ is faster by nearly 32%.

Pre-sized fill:
100000 0.024 0.020 0.12 0.091 0.042
200000 0.015 0.016 0.16 0.041 0.084
300000 0.00038 0.033 0.17 0.090 0.090
400000 0.070 0.041 0.069 0.084 0.094
500000 0.069 0.025 0.061 0.12 0.10
600000 -0.0096 0.023 0.016 0.074 0.072
700000 0.060 0.0013 0.036 0.094 0.085
800000 0.029 -0.0048 0.035 0.083 0.075
900000 -0.011 0.0025 -0.037 0.078 0.085
1000000 0.0022 0.019 0.011 0.084 0.081

Time for 100K lookups:
100000 0.12 0.13 0.089 0.11 0.071
200000 0.13 0.12 0.10 0.11 0.072
300000 0.098 0.12 0.053 0.085 0.080
400000 0.18 0.11 0.088 0.059 0.030
500000 0.12 0.080 0.072 0.075 0.033
600000 0.095 0.10 0.076 0.063 0.017
700000 0.17 0.097 0.12 0.083 0.043
800000 0.16 0.13 0.11 0.092 0.050
900000 0.094 0.086 0.056 0.041 -0.0047
1000000 0.14 0.11 0.092 0.073 0.016

Time for 100K failed lookups:
100000 0.15 0.15 0.11 0.087 0.077
200000 0.14 0.11 0.083 0.074 0.059
300000 0.10 0.091 0.090 0.12 0.097
400000 0.10 0.12 0.061 0.11 0.099
500000 0.19 0.20 0.12 0.12 0.12
600000 0.18 0.18 0.12 0.11 0.095
700000 0.082 0.096 0.068 0.079 0.070
800000 0.20 0.19 0.14 0.13 0.12
900000 0.20 0.17 0.10 0.11 0.097
1000000 0.25 0.22 0.17 0.14 0.13

Time to remove half the elements:
100000 0.15 0.15 0.078 0.066 0.066
200000 0.12 0.12 0.039 0.055 0.060
300000 0.12 0.070 0.024 0.038 0.095
400000 0.16 0.16 0.15 0.13 0.13
500000 0.10 0.12 0.12 0.13 0.13
600000 0.077 0.015 0.036 0.026 0.049
700000 0.086 0.16 0.16 0.17 0.14
800000 0.076 0.044 0.051 0.043 0.064
900000 0.11 0.11 0.11 0.12 0.13
1000000 0.10 0.090 0.061 0.081 0.10

Many of our users have code that is sensitive to the performance of standard containers and algorithms, and this preliminary benchmarking lends support to the anecdotes that libc++ is slower than libstdc++. Worryingly, the extremes of these differences are pretty large. Obviously application impact can't be judged by some benchmarks I happened to find on the internet, but this is something we, as a community, should look at more closely.

Thanks again,
Hal

psums.txt (10.7 KB)

Hi Hal,

Thanks for putting this together. I’ll start looking at the benchmarks and seeing what I can do.
Here are my drive by thoughts (more to come tomorrow)

these benchmarks could be a good starting point for an permanten/commit-based libc++ performance regression
test like LLD got (http://lists.llvm.org/pipermail/llvm-dev/2016-January/094132.html)

From: "Eric Fiselier" <eric@efcs.ca>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "cfe-dev" <cfe-dev@lists.llvm.org>, "Marshall Clow"
<mclow.lists@gmail.com>
Sent: Saturday, July 2, 2016 1:03:53 AM
Subject: Re: libc++ Performance (compared to libstdc++)

Hi Hal,

Thanks for putting this together. I'll start looking at the
benchmarks and seeing what I can do.
Here are my drive by thoughts (more to come tomorrow)

> Hi everyone,

> I was chatting with Marshall offline last week, and I mentioned
> that
> several of my users had noted general performance regressions
> switching from libstdc++ to libc++. Marshall said that he's heard
> similar things, but has received few specific reports. He did
> recall
> looking at the problem which I believe is described here (
> http://aras-p.info/blog/2015/12/11/careful-with-that-stl-map-insert-eugene/
> ), which is still a problem. I'll certainly admit that I'd not
> investigated most of these in detail (with the exception of a
> std::complex -ffast-math issue, http://reviews.llvm.org/D18639 ).
> We
> do have a few performance-related libc++ bugs open:

> https://llvm.org/bugs/show_bug.cgi?id=21192 - Reading from stdin is
> 1-2 orders of magnitude slower than using libstdc++ [I just tested
> this myself and updated the bug report].

> https://llvm.org/bugs/show_bug.cgi?id=19708 - std::find is
> significantly slower than libstdc++.

> https://llvm.org/bugs/show_bug.cgi?id=20837 - libc++'s std::sort is
> O(N^2) in the worst case (instead of O(N*ln(N))).

> https://llvm.org/bugs/show_bug.cgi?id=26886 - libc++'s
> std::stable_sort also has a worst-case complexity issue.

> https://llvm.org/bugs/show_bug.cgi?id=15456 - A faster
> implementation
> of std::function is possible

> https://llvm.org/bugs/show_bug.cgi?id=16747 and
> https://llvm.org/bugs/show_bug.cgi?id=21275 - Our
> unordered_multimap
> insert is much slower than libstdc++'s. In PR16747, Howard
> interestingly explains libc++ has this problem because of an
> additional (i.e. not-required-by-the-standard) guarantee that
> libc++
> provides regarding member ordering.

> but very few are related to containers.

> Baptiste Wicht has a benchmark covering use of several common
> standard algorithms with vectors, lists and dqueues (
> https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp
> ) which he used for his post
> http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html
> , and I've compiled this using LLVM/Clang/libc++ r271873 @ -O3,
> using both libc++ and libstdc++ 4.8.5, and run on an Intel Xeon
> E5-2699 v3 @ 2.30GHz running Linux 3.10.0. If you try this
> yourself,
> note that even on a fast machine the benchmark takes several hours
> to run.

> $ clang++ -std=c++11 -O3 -I../../include -I../../../boost_1_61_0
> bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-gnu

> $ clang++ -std=c++11 -stdlib=libc++ -O3 -I../../include
> -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o
> /tmp/b-llvm

> Of the 248 tests, libc++ was faster by at least 5% in 58 of the
> tests
> and libstdc++ was faster by at least 5% in 94 of the tests. libc++
> was faster by at least 20% in 14 of the tests and libstdc++ was
> faster by at least 20% in 64 of the tests. The real problem,
> however, comes from the extremums. libc++ is never more than 65%
> faster than libstdc++:

> destruction___Trivial_128_ list -0.65

> destruction___Trivial_4096_ vector -0.40

> destruction___Trivial_1024_ list -0.38

> destruction___Trivial_1024_ vector -0.37

> random_remove___NonTrivialArray_32_ vector -0.3

Whoo hoo!

> but libc++ is sometimes over 10x slower than libstdc++:

> fill_back___NonTrivialStringMovable list_inserter 9.96

> fill_back___NonTrivialStringMovable vector_reserve 10.21

> fill_back___NonTrivialStringMovableNoExcept vector_reserve 10.82

> fill_back___NonTrivialStringMovableNoExcept vector_inserter 11.15

> fill_back___NonTrivialStringMovable vector_inserter 11.93

This is almost certainly testing against GCC's COW string, which is
obviously going to be a lot faster.
I looked at the tests and it looks like the strings are being copied
(not moved) into the vector, so the name is a bit of the type is a
bit of a misnomer.

Indeed. I had forgotten that libstdc++ uses COW strings (I imagine that if I were to test using the GCC 5+ implementation we might see a different result).

> I've attached the full list.

> A second benchmark,
> http://beta.visl.sdu.dk/svn/visl/tools/benchmarks/src/set.cpp (
> https://tinodidriksen.com/2012/02/20/cpp-set-performance-2/ ),
> modified only to repeat each test 30 instead of 7 times, and
> compiled as before:

> uint32_t std::set erase: -0.37

> std::string std::set erase: -0.30

> std::string std::set insertion: -0.23

> std::string std::unordered_set erase: -0.16

> std::string std::unordered_set iterate: -0.15

> std::string std::set lookup: -0.15

> uint32_t std::set insertion: -0.13

> uint32_t std::unordered_set iterate: -0.072

> std::string std::set iterate: -0.062

> uint32_t std::set iterate: -0.054

> uint32_t std::set lookup: -0.015

> uint32_t std::unordered_set erase: 0.085

> std::string std::unordered_set insertion: 0.22

> std::string std::unordered_set lookup: 0.30

> uint32_t std::unordered_set insertion: 0.51

> uint32_t std::unordered_set lookup: 0.61

> In this benchmark, libc++ beats libstdc++ by more than 5% in 10
> tests, and libstdc++ beats libc++ by more than 5% in 5 tests.
> Again,
> however, libc++'s downside is larger, being up to 61% slower (in
> the
> 'uint32_t std::unordered_set lookup' test) than libstdc++.
> libstdc++
> loses only by 37% to libc++, at most, in the 'uint32_t std::set
> erase' test. Also, I can easily imagine that users are more-likely
> to notice a performance difference in lookup than in erase.

I can't reproduce the unordered_set<uint32_t> insertion results, but
I could reproduce the 60% difference for lookup and I managed to
significantly reduce it.
I checked in the optimization in r274423 which results in a ~45%
speedup for unordered_set<...>::find.

Great! I re-ran this test and now see this:

uint32_t std::set erase: -0.35
std::string std::set erase: -0.32
std::string std::set insertion: -0.23
std::string std::unordered_set erase: -0.17
std::string std::set lookup: -0.16
std::string std::unordered_set iterate: -0.15
uint32_t std::set insertion: -0.12
std::string std::set iterate: -0.070
uint32_t std::unordered_set iterate: -0.022
uint32_t std::set lookup: 0.0065
uint32_t std::set iterate: 0.0072
uint32_t std::unordered_set lookup: 0.030
uint32_t std::unordered_set erase: 0.072
std::string std::unordered_set lookup: 0.12
std::string std::unordered_set insertion: 0.20
uint32_t std::unordered_set insertion: 0.54

This is significantly better. I still see the insertion issue, however, and I confirmed that I still see that difference even if I strip down the benchmark only to test std::unordered_set<uint32_t> insertion. I'll file a separate bug report for this issue.

Thanks again,
Hal

If you're looking for "fair" comparisons between the library portions of libc++ and libstdc++, it may make more sense to use g++ as the compiler instead. I'm pretty sure that libc++ works with g++, and I also believe that the libstdc++ that ships with gcc 5.0+ no longer uses the non-conforming COW strings. I would suggest using clang++ with a newer libstdc++, but I believe that isn't currently supported due to some ABI tag weirdness.

From: "Ben via cfe-dev Craig" <cfe-dev@lists.llvm.org>
To: cfe-dev@lists.llvm.org
Sent: Tuesday, July 5, 2016 11:40:16 AM
Subject: Re: [cfe-dev] libc++ Performance (compared to libstdc++)

If you're looking for "fair" comparisons between the library portions
of
libc++ and libstdc++, it may make more sense to use g++ as the
compiler
instead. I'm pretty sure that libc++ works with g++, and I also
believe
that the libstdc++ that ships with gcc 5.0+ no longer uses the
non-conforming COW strings. I would suggest using clang++ with a
newer
libstdc++, but I believe that isn't currently supported due to some
ABI
tag weirdness.

Good point. Also, using the newer libstdc++ with Clang should now be supported in Clang trunk as of a few days ago (the ABI tag attribute support landed).

-Hal