I have been working on a sorting library for a while, and it happens to contain reimplementations of many standard algorithms borrowed from libc++ (mostly because I needed to add a few more features to standard algorithms, akin to those in range-v3). I was working on a top-down in-place mergesort implementation, and I believe that I have found a potential albeit small optimization for std::inplace_merge, specifically in the helper function half_inplace_merge. Here is the original function borrowed from libc++ (coding style has been changed, don’t pay attention to that):

template<typename Compare, typename InputIterator1,

typename InputIterator2, typename OutputIterator>

auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,

InputIterator2 first2, InputIterator2 last2,

OutputIterator result, Compare comp)

→ void

{

for (; first1 != last1 ; ++result) {

if (first2 == last2) {

std::move(first1, last1, result);

return;

}

if (comp(*first2, *first1)) {

*result = std::move(*first2);

++first2;

} else {

*result = std::move(*first1);

++first1;

}

}

// first2 through last2 are already in the right spot.

}

I made a simple observation: at every loop iteration, we check whether first1 is different from last1 and whether first2 is equal to last2. However, we know that we will loop at least min(last1 - first1, last2 - first2) times before actually needing to check whether first1 == last1 or first2 == last2, so why not take advantage of that by comparing against an integer counter and skipping a few iterator comparisons? Moreover, the value min(last1 - first1, last2 - first2) is already known when half_inplace_merge is called, so we can directly pass them to the function? The modified function looked like this:

template<typename Compare, typename InputIterator1,

typename InputIterator2, typename OutputIterator,

typename Size>

auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,

InputIterator2 first2, InputIterator2 last2,

OutputIterator result, Size min_len,

Compare comp)

→ void

{

for (; min_len != 0 ; --min_len) {

if (comp(*first2, *first1)) {

*result = std::move(*first2);

++first2;

} else {

*result = std::move(*first1);

++first1;

}

++result;

}

for (; first1 != last1 ; ++result) {

if (first2 == last2) {

std::move(first1, last1, result);

return;

}

if (comp(*first2, *first1)) {

*result = std::move(*first2);

++first2;

} else {

*result = std::move(*first1);

++first1;

}

}

// first2 through last2 are already in the right spot.

}

However, at this point, the new “optimization” made the code dramatically slower than it was before when sorting a vector of one million integers with a mergesort built on the top of std::inplace_merge. It kind of baffled me until I finally noticed that the second loop had some information that the first didn’t have: the second loop knows that first1 != last1 and first2 != last2 while it runs. However, it is apparently possible to give that information to the first loop too. I defined the following macro:

#define ASSUME(cond) do { if (!(cond)) __builtin_unreachable(); } while(0)

Then I changed the first loop (the “optimization”) as follows:

for (; min_len != 0 ; --min_len) {

ASSUME(first1 != last1);

ASSUME(first2 != last2);

if (comp(*first2, *first1)) {

*result = std::move(*first2);

++first2;

} else {

*result = std::move(*first1);

++first1;

}

++result;

}

This additional information solved the “dramatically slower” problem, and then running the benchmarks again showed that the mergesort based on the modified half_inplace_merge was generally slightly faster than the old one (sometimes the difference is lost in the noise, but there are also cases where the difference is neat albeit small). I guess that the ASSUME macro actually solved an aliasing problem which wasn’t present in the second loop due to its loop conditions.

Anyway, I benchmarked that code using MinGW GCC 6.2, so the results might be different with Clang. I wanted to tell that small optimization story, and felt that I needed to share it here since I borrowed the original code from libc++. Now, you can use it as you want if you feel that the small performance gain proves to be consistent and worth it (IIRC you have benchmarks for some standard algorithms).

The code used for the benchmark is a modified version of the following: https://github.com/Morwenn/cpp-sort/blob/master/benchmarks/bench.cpp

Here is a graph with the results I got; merge_sort-new uses the modified half_inplace_merge (the names on the left correspond to different data distributions used to test the sorting algorithms): https://i.imgur.com/hwzJ84U.

With plenty of love <3

Morwenn