Performance issues of dynamic_cast in libcxxabi

Recently I did a benchmark for the performance of dynamic_cast. I compared the performance of the libc++14 dynamic_cast implementation with the libstdc++10 version and the result surprised me a little. Here are the interesting parts.

The first benchmark tests the performance when dynamic_cast is used for down-casting a non-virtual base sub-object pointer to the most derived type along a single chain of class inheritance hierarchy:

template <std::size_t Depth>
struct Chain : Chain<Depth - 1> {};

template <>
struct Chain<0> {
  virtual ~Chain();

template <typename Dyn, typename From, typename To = Dyn>
static void DynCast() {
  Dyn obj;
  From* from_ptr = &obj;
  To* to_ptr = dynamic_cast<To*>(from_ptr);

// Dynamic cast from Chain<0>* to Chain<X>* for X = 1, 2, 3, ...
BENCHMARK(DynCast<Chain<1>, Chain<0>>)->Name("Chain, 1 level");
BENCHMARK(DynCast<Chain<2>, Chain<0>>)->Name("Chain, 2 levels");
BENCHMARK(DynCast<Chain<3>, Chain<0>>)->Name("Chain, 3 levels");
// etc.

I believe the scenario in this benchmark is the most used scenario where people use dynamic_cast. The Itanium ABI also encourages implementations to optimize such cases. The benchmark results are:

The blue bars represents libstdc++ and the orange bars represents libc++. The CPU time consumed of libc++ is proportional to the number of levels on the class hierarchy, while the libstdc++ implementation has a constant CPU time consumption.

A similar pattern also appears when trying to down-casting a non-virtual base sub-object pointer to the most derived type on a DAG-shaped class inheritance network:

template <std::size_t Index, std::size_t Depth>
struct Dag : Dag<Index, Depth - 1>, 
             Dag<Index + 1, Depth - 1> {};

template <std::size_t Index>
struct Dag<Index, 0> {
  virtual ~Dag();

BENCHMARK(DynCast<Dag<0, 3>, Dag<3, 0>>)->Name("DAG, 3 levels");
BENCHMARK(DynCast<Dag<0, 4>, Dag<4, 0>>)->Name("DAG, 4 levels");
BENCHMARK(DynCast<Dag<0, 5>, Dag<5, 0>>)->Name("DAG, 5 levels");

The results are: (this is my first post and I cannot post more than 1 figure :frowning: )

DAG, 3 levels                      60.6 ns
DAG, 4 levels                       118 ns
DAG, 5 levels                       253 ns

DAG, 3 levels                      7.10 ns
DAG, 4 levels                      7.11 ns
DAG, 5 levels                      7.14 ns

This time the performance of libc++ is much worse than libstdc++ with a 35x slow down when down-casting 5 levels.

I went to the __dynamic_cast source code in libc++ and seems like that the parameter src2dst_offset is not used for optimizing the two simple cases above. Even if the dynamic type of the object is the same as the destination type and src2dst_offset hints that a direct conversion can be immediately made, __dynamic_cast still traverses the class inheritance graph to find a public path. Is this behavior intentional? Can we improve the performance by leveraging the src2dst_offset parameter?

P.S. You can find the benchmark source code and benchmark results at here.

Just to have a more apples-to-apples comparison, maybe use -DLIBCXX_CXX_ABI={libcxxabi, libsupc++} instead? That would directly compare libc++abi and libsupc++. I don’t expect it to make a difference, but removing variables is always a good idea when benchmarking.

Thanks for your reply. I build two libc++ versions that links against libcxxabi and libsupc++ from the latest master branch of llvm-project, respectively. The new benchmark is done on these two new libc++ builds.

Here are the results: (figures are measured as multiples of some baseline)

                    libcxxabi   libsupc++
Chain, 1 level      6.589       5.579
Chain, 2 levels     8.511       5.675
Chain, 3 levels     10.168      5.643
Chain, 4 levels     11.986      5.605
Chain, 5 levels     13.915      5.579
Chain, 6 levels     15.929      5.579
Chain, 7 levels     18.384      5.676
Chain, 8 levels     20.165      5.655
Chain, 9 levels     22.444      5.749
DAG, 3 levels       53.078      6.618
DAG, 4 levels       103.082     6.554
DAG, 5 levels       216.760     6.558

The results are indeed similar as I posted before.

These numbers really don’t look great. Would you maybe be interested in improving the implementation and contribute your benchmarks?

Sure I’m interested. I can start crafting a patch and see what improvements we can get.

1 Like

I got a problem. Where should I put my benchmark code within the LLVM source tree? I don’t see a benchmark directory under libcxxabi.

Ideally we would add a benchmarks sub-directory in libcxxabi/, but that should be a separate effort. I think for now you can just put them in libcxx/benchmarks. Maybe add a subdirectory libcxxabi there, so we can move the benchmarks once we add the infrastructure to libcxxabi/

I have created a revision on Phabricator: ⚙ D137315 [libc++abi] Improve performance of __dynamic_cast

If you are looking to optimise dynamic_cast, there is still low-hanging fruit with final classes.

For example, I sometimes convert
typeid(p) == typeid(T*)

which is slightly faster.

I have always thought this should be a standard compile-time transform.

Do you mean the compiler should inline the conversion code when we’re downcasting to the dynamic type? This can avoid a function call into libcxxabi but the actual performance gain should be tested.

Yup, exactly.

Maybe we can make a draft patch to clang and perform some performance benchmarks on real-world examples.