Performance issues of dynamic_cast in libcxxabi

Recently I did a benchmark for the performance of dynamic_cast. I compared the performance of the libc++14 dynamic_cast implementation with the libstdc++10 version and the result surprised me a little. Here are the interesting parts.

The first benchmark tests the performance when dynamic_cast is used for down-casting a non-virtual base sub-object pointer to the most derived type along a single chain of class inheritance hierarchy:

template <std::size_t Depth>
struct Chain : Chain<Depth - 1> {};

template <>
struct Chain<0> {
  virtual ~Chain();
};

template <typename Dyn, typename From, typename To = Dyn>
static void DynCast() {
  Dyn obj;
  From* from_ptr = &obj;
  To* to_ptr = dynamic_cast<To*>(from_ptr);
}

// Dynamic cast from Chain<0>* to Chain<X>* for X = 1, 2, 3, ...
BENCHMARK(DynCast<Chain<1>, Chain<0>>)->Name("Chain, 1 level");
BENCHMARK(DynCast<Chain<2>, Chain<0>>)->Name("Chain, 2 levels");
BENCHMARK(DynCast<Chain<3>, Chain<0>>)->Name("Chain, 3 levels");
// etc.

I believe the scenario in this benchmark is the most used scenario where people use dynamic_cast. The Itanium ABI also encourages implementations to optimize such cases. The benchmark results are:

The blue bars represents libstdc++ and the orange bars represents libc++. The CPU time consumed of libc++ is proportional to the number of levels on the class hierarchy, while the libstdc++ implementation has a constant CPU time consumption.

A similar pattern also appears when trying to down-casting a non-virtual base sub-object pointer to the most derived type on a DAG-shaped class inheritance network:

template <std::size_t Index, std::size_t Depth>
struct Dag : Dag<Index, Depth - 1>, 
             Dag<Index + 1, Depth - 1> {};

template <std::size_t Index>
struct Dag<Index, 0> {
  virtual ~Dag();
};

BENCHMARK(DynCast<Dag<0, 3>, Dag<3, 0>>)->Name("DAG, 3 levels");
BENCHMARK(DynCast<Dag<0, 4>, Dag<4, 0>>)->Name("DAG, 4 levels");
BENCHMARK(DynCast<Dag<0, 5>, Dag<5, 0>>)->Name("DAG, 5 levels");

The results are: (this is my first post and I cannot post more than 1 figure :frowning: )

libc++:
DAG, 3 levels                      60.6 ns
DAG, 4 levels                       118 ns
DAG, 5 levels                       253 ns

libstdc++:
DAG, 3 levels                      7.10 ns
DAG, 4 levels                      7.11 ns
DAG, 5 levels                      7.14 ns

This time the performance of libc++ is much worse than libstdc++ with a 35x slow down when down-casting 5 levels.

I went to the __dynamic_cast source code in libc++ and seems like that the parameter src2dst_offset is not used for optimizing the two simple cases above. Even if the dynamic type of the object is the same as the destination type and src2dst_offset hints that a direct conversion can be immediately made, __dynamic_cast still traverses the class inheritance graph to find a public path. Is this behavior intentional? Can we improve the performance by leveraging the src2dst_offset parameter?

P.S. You can find the benchmark source code and benchmark results at here.

Just to have a more apples-to-apples comparison, maybe use -DLIBCXX_CXX_ABI={libcxxabi, libsupc++} instead? That would directly compare libc++abi and libsupc++. I don’t expect it to make a difference, but removing variables is always a good idea when benchmarking.

Thanks for your reply. I build two libc++ versions that links against libcxxabi and libsupc++ from the latest master branch of llvm-project, respectively. The new benchmark is done on these two new libc++ builds.

Here are the results: (figures are measured as multiples of some baseline)

                    libcxxabi   libsupc++
===========================================
Chain, 1 level      6.589       5.579
Chain, 2 levels     8.511       5.675
Chain, 3 levels     10.168      5.643
Chain, 4 levels     11.986      5.605
Chain, 5 levels     13.915      5.579
Chain, 6 levels     15.929      5.579
Chain, 7 levels     18.384      5.676
Chain, 8 levels     20.165      5.655
Chain, 9 levels     22.444      5.749
-------------------------------------------
DAG, 3 levels       53.078      6.618
DAG, 4 levels       103.082     6.554
DAG, 5 levels       216.760     6.558

The results are indeed similar as I posted before.

These numbers really don’t look great. Would you maybe be interested in improving the implementation and contribute your benchmarks?

Sure I’m interested. I can start crafting a patch and see what improvements we can get.

1 Like

I got a problem. Where should I put my benchmark code within the LLVM source tree? I don’t see a benchmark directory under libcxxabi.

Ideally we would add a benchmarks sub-directory in libcxxabi/, but that should be a separate effort. I think for now you can just put them in libcxx/benchmarks. Maybe add a subdirectory libcxxabi there, so we can move the benchmarks once we add the infrastructure to libcxxabi/

I have created a revision on Phabricator: âš™ D137315 [libc++abi] Improve performance of __dynamic_cast

If you are looking to optimise dynamic_cast, there is still low-hanging fruit with final classes.

For example, I sometimes convert
dynamic_cast<T*>(p)
to
typeid(p) == typeid(T*)

which is slightly faster.

I have always thought this should be a standard compile-time transform.

Do you mean the compiler should inline the conversion code when we’re downcasting to the dynamic type? This can avoid a function call into libcxxabi but the actual performance gain should be tested.

Yup, exactly.

Maybe we can make a draft patch to clang and perform some performance benchmarks on real-world examples.

How would I go about implementing this? Would we indeed call typeid::operator=? What if the <typeinfo> system header was not included? Could we still just let the compiler implicitly include the <typeinfo> header? Or would you rather teach clang to directly generate the corresponding code inline?

If I understand correctly, it should be sufficient to rewrite every T* result = dynamic_cast<T*>(ptr); where T is a final class, into

void** vtable = *reinterpret_cast<void***>(ptr);
T* result = vtable[-1] == typeid(T) ? (ptr + src2dst_offset) : nullptr;

Offset -1 in the vtable stores the typeinfo of the object. I think the additional optimization

void** vtable = *reinterpret_cast<void***>(ptr);
T* result = vtable == vtable_of(T) ? (ptr + src2dst_offset) : nullptr;

should also be correct. Note how I am no longer following the pointer into the vtable. Instead, I am comparing the vtable pointer itself. This would be even better than typeid(p) == typeid(T*), as typeid(p) would still load the typeid from the vtable. (Also, I would not need to think about the cxx_abi_relative_vtable encoding for vtables during implementing this)

Afaict, comparing the vtable pointers is correct based on https://itanium-cxx-abi.github.io/cxx-abi/abi.html#vtable:

There may be multiple virtual tables for a particular class, if it is used as a base class for other classes. However, the virtual table pointers within all the objects (instances) of a particular most-derived class point to the same set of virtual tables.

However, there is the _LIBCXXABI_FORGIVING_DYNAMIC_CAST flag inside libcxxabi which allows for non-merged type information. I wonder if clang itself would then also need a -fforgiving-dynamic-cast flag?

[quote=“avogelsgesang, post:13, topic:66296, full:true”]

How would I go about implementing this? Would we indeed call typeid::operator=? [/quote]

I’m quite out of my depth here in terms of the internals of clang and dynamic_cast, but that would be my suggestion, since, as you note, there might be more problems with a more efficient/direct approach.

I think the additional optimization

void** vtable = reinterpret_cast<void**>(ptr); T* result = vtable ==
vtable_of(T) ? result : nullptr; |

should also be correct. Note how I am no longer following the pointer
into the vtable. Instead, I am comparing the vtable pointer itself. This
would be even better than |typeid(p) == typeid(T*)|, as |typeid(p)|
would still load the |typeid| from the vtable. (Also, I would not need
to think about the |cxx_abi_relative_vtable| encoding for vtables during
implementing this)

Afaict, comparing the vtable pointers is correct based on

Itanium C++ ABI

There may be multiple virtual tables for a particular class, if it
is used as a base class for other classes. However, the virtual
table pointers within all the objects (instances) of a particular
most-derived class point to the same set of virtual tables.

I also /think/ that Clang will always emit the vtable with the same
visibility as the RTTI (in the face of -fvisibility=hidden,
attribute((type_visibility(“default”)), etc.), so that, if checking
type equivalence via RTTI pointer comparison works, checking it via
vtable pointer comparison should also work in that final class case.
But I’m far from an expert here.

However, there is the |_LIBCXXABI_FORGIVING_DYNAMIC_CAST| flag inside
libcxxabi which allows for non-merged type information. I wonder if

clang> itself would then also need a |-fforgiving-dynamic-cast| flag?

Yeah, different ABI implementations use different strategies how to
decide type equality in various cases. (For example, GNU traditionally
deviates from the Itanium ABI and uses full RTTI type name comparison
instead of just pointer comparison.) I guess that would make any
Clang-level optimization here rather brittle—it would probably need to
know and take into account details of the targeted ABI implementation.