Recently I did a benchmark for the performance of dynamic_cast
. I compared the performance of the libc++14 dynamic_cast
implementation with the libstdc++10 version and the result surprised me a little. Here are the interesting parts.
The first benchmark tests the performance when dynamic_cast
is used for down-casting a non-virtual base sub-object pointer to the most derived type along a single chain of class inheritance hierarchy:
template <std::size_t Depth>
struct Chain : Chain<Depth - 1> {};
template <>
struct Chain<0> {
virtual ~Chain();
};
template <typename Dyn, typename From, typename To = Dyn>
static void DynCast() {
Dyn obj;
From* from_ptr = &obj;
To* to_ptr = dynamic_cast<To*>(from_ptr);
}
// Dynamic cast from Chain<0>* to Chain<X>* for X = 1, 2, 3, ...
BENCHMARK(DynCast<Chain<1>, Chain<0>>)->Name("Chain, 1 level");
BENCHMARK(DynCast<Chain<2>, Chain<0>>)->Name("Chain, 2 levels");
BENCHMARK(DynCast<Chain<3>, Chain<0>>)->Name("Chain, 3 levels");
// etc.
I believe the scenario in this benchmark is the most used scenario where people use dynamic_cast
. The Itanium ABI also encourages implementations to optimize such cases. The benchmark results are:
The blue bars represents libstdc++ and the orange bars represents libc++. The CPU time consumed of libc++ is proportional to the number of levels on the class hierarchy, while the libstdc++ implementation has a constant CPU time consumption.
A similar pattern also appears when trying to down-casting a non-virtual base sub-object pointer to the most derived type on a DAG-shaped class inheritance network:
template <std::size_t Index, std::size_t Depth>
struct Dag : Dag<Index, Depth - 1>,
Dag<Index + 1, Depth - 1> {};
template <std::size_t Index>
struct Dag<Index, 0> {
virtual ~Dag();
};
BENCHMARK(DynCast<Dag<0, 3>, Dag<3, 0>>)->Name("DAG, 3 levels");
BENCHMARK(DynCast<Dag<0, 4>, Dag<4, 0>>)->Name("DAG, 4 levels");
BENCHMARK(DynCast<Dag<0, 5>, Dag<5, 0>>)->Name("DAG, 5 levels");
The results are: (this is my first post and I cannot post more than 1 figure )
libc++:
DAG, 3 levels 60.6 ns
DAG, 4 levels 118 ns
DAG, 5 levels 253 ns
libstdc++:
DAG, 3 levels 7.10 ns
DAG, 4 levels 7.11 ns
DAG, 5 levels 7.14 ns
This time the performance of libc++ is much worse than libstdc++ with a 35x slow down when down-casting 5 levels.
I went to the __dynamic_cast
source code in libc++ and seems like that the parameter src2dst_offset
is not used for optimizing the two simple cases above. Even if the dynamic type of the object is the same as the destination type and src2dst_offset
hints that a direct conversion can be immediately made, __dynamic_cast
still traverses the class inheritance graph to find a public path. Is this behavior intentional? Can we improve the performance by leveraging the src2dst_offset
parameter?
P.S. You can find the benchmark source code and benchmark results at here.