Potential optimization for std::inplace_merge

I have been working on a sorting library for a while, and it happens to contain reimplementations of many standard algorithms borrowed from libc++ (mostly because I needed to add a few more features to standard algorithms, akin to those in range-v3). I was working on a top-down in-place mergesort implementation, and I believe that I have found a potential albeit small optimization for std::inplace_merge, specifically in the helper function half_inplace_merge. Here is the original function borrowed from libc++ (coding style has been changed, don’t pay attention to that):

template<typename Compare, typename InputIterator1,
typename InputIterator2, typename OutputIterator>
auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, InputIterator2 last2,
OutputIterator result, Compare comp)
→ void
{
for (; first1 != last1 ; ++result) {
if (first2 == last2) {
std::move(first1, last1, result);
return;
}

if (comp(*first2, *first1)) {
*result = std::move(*first2);
++first2;
} else {
*result = std::move(*first1);
++first1;
}
}
// first2 through last2 are already in the right spot.
}

I made a simple observation: at every loop iteration, we check whether first1 is different from last1 and whether first2 is equal to last2. However, we know that we will loop at least min(last1 - first1, last2 - first2) times before actually needing to check whether first1 == last1 or first2 == last2, so why not take advantage of that by comparing against an integer counter and skipping a few iterator comparisons? Moreover, the value min(last1 - first1, last2 - first2) is already known when half_inplace_merge is called, so we can directly pass them to the function? The modified function looked like this:

template<typename Compare, typename InputIterator1,
typename InputIterator2, typename OutputIterator,
typename Size>
auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, InputIterator2 last2,
OutputIterator result, Size min_len,
Compare comp)
→ void
{
for (; min_len != 0 ; --min_len) {
if (comp(*first2, *first1)) {
*result = std::move(*first2);
++first2;
} else {
*result = std::move(*first1);
++first1;
}
++result;
}

for (; first1 != last1 ; ++result) {
if (first2 == last2) {
std::move(first1, last1, result);
return;
}

if (comp(*first2, *first1)) {
*result = std::move(*first2);
++first2;
} else {
*result = std::move(*first1);
++first1;
}
}
// first2 through last2 are already in the right spot.
}

However, at this point, the new “optimization” made the code dramatically slower than it was before when sorting a vector of one million integers with a mergesort built on the top of std::inplace_merge. It kind of baffled me until I finally noticed that the second loop had some information that the first didn’t have: the second loop knows that first1 != last1 and first2 != last2 while it runs. However, it is apparently possible to give that information to the first loop too. I defined the following macro:

#define ASSUME(cond) do { if (!(cond)) __builtin_unreachable(); } while(0)

Then I changed the first loop (the “optimization”) as follows:

for (; min_len != 0 ; --min_len) {

ASSUME(first1 != last1);
ASSUME(first2 != last2);

if (comp(*first2, *first1)) {
*result = std::move(*first2);
++first2;
} else {
*result = std::move(*first1);
++first1;
}
++result;
}

This additional information solved the “dramatically slower” problem, and then running the benchmarks again showed that the mergesort based on the modified half_inplace_merge was generally slightly faster than the old one (sometimes the difference is lost in the noise, but there are also cases where the difference is neat albeit small). I guess that the ASSUME macro actually solved an aliasing problem which wasn’t present in the second loop due to its loop conditions.

Anyway, I benchmarked that code using MinGW GCC 6.2, so the results might be different with Clang. I wanted to tell that small optimization story, and felt that I needed to share it here since I borrowed the original code from libc++. Now, you can use it as you want if you feel that the small performance gain proves to be consistent and worth it (IIRC you have benchmarks for some standard algorithms).

The code used for the benchmark is a modified version of the following: https://github.com/Morwenn/cpp-sort/blob/master/benchmarks/bench.cpp

Here is a graph with the results I got; merge_sort-new uses the modified half_inplace_merge (the names on the left correspond to different data distributions used to test the sorting algorithms): https://i.imgur.com/hwzJ84U.

With plenty of love <3

Morwenn

Thanks for all of this hard work. It looks like a promising optimization.

Would you be interested in submitting a patch?

/Eric

Sure, I guess I can use the _LIBCPP_UNREACHABLE macro from even though it looks like it’ll be yet another incude for the already pretty big .

If I understand the process correctly, I only have to submit an SVN patch to the cfe-commits mailing list. Are there any other legalese I should be aware of besides knowing the project’s license?

Thanks in advance.

Morwenn

From: "Morwenn Ed via cfe-dev" <cfe-dev@lists.llvm.org>
To: cfe-dev@lists.llvm.org
Sent: Thursday, October 27, 2016 2:12:39 PM
Subject: [cfe-dev] Potential optimization for std::inplace_merge

I have been working on a sorting library for a while, and it happens
to contain reimplementations of many standard algorithms borrowed
from libc++ (mostly because I needed to add a few more features to
standard algorithms, akin to those in range-v3). I was working on a
top-down in-place mergesort implementation, and I believe that I
have found a potential albeit small optimization for
std::inplace_merge , specifically in the helper function
half_inplace_merge . Here is the original function borrowed from
libc++ (coding style has been changed, don't pay attention to that):

template<typename Compare, typename InputIterator1,
typename InputIterator2, typename OutputIterator>
auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, InputIterator2 last2,
OutputIterator result, Compare comp)
-> void
{
for (; first1 != last1 ; ++result) {
if (first2 == last2) {
std::move(first1, last1, result);
return;
}

if (comp(*first2, *first1)) {
*result = std::move(*first2);
++first2;
} else {
*result = std::move(*first1);
++first1;
}
}
// first2 through last2 are already in the right spot.
}

I made a simple observation: at every loop iteration, we check
whether first1 is different from last1 and whether first2 is equal
to last2 . However, we know that we will loop at least min(last1 -
first1, last2 - first2) times before actually needing to check
whether first1 == last1 or first2 == last2 , so why not take
advantage of that by comparing against an integer counter and
skipping a few iterator comparisons? Moreover, the value min(last1 -
first1, last2 - first2) is already known when half_inplace_merge is
called, so we can directly pass them to the function? The modified
function looked like this:

template<typename Compare, typename InputIterator1,
typename InputIterator2, typename OutputIterator,
typename Size>
auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, InputIterator2 last2,
OutputIterator result, Size min_len,
Compare comp)
-> void
{
for (; min_len != 0 ; --min_len) {
if (comp(*first2, *first1)) {
*result = std::move(*first2);
++first2;
} else {
*result = std::move(*first1);
++first1;
}
++result;
}

for (; first1 != last1 ; ++result) {
if (first2 == last2) {
std::move(first1, last1, result);
return;
}

if (comp(*first2, *first1)) {
*result = std::move(*first2);
++first2;
} else {
*result = std::move(*first1);
++first1;
}
}
// first2 through last2 are already in the right spot.
}

However, at this point, the new "optimization" made the code
dramatically slower than it was before when sorting a vector of one
million integers with a mergesort built on the top of
std::inplace_merge . It kind of baffled me until I finally noticed
that the second loop had some information that the first didn't
have: the second loop knows that first1 != last1 and first2 != last2
while it runs. However, it is apparently possible to give that
information to the first loop too. I defined the following macro:

#define ASSUME(cond) do { if (!(cond)) __builtin_unreachable(); }
while(0)

FYI, for Clang, you'll need to use __builtin_assume(cond) (http://clang.llvm.org/docs/LanguageExtensions.html#builtin-functions).

-Hal

Thanks, I didn’t that __builtin_assume existed in Clang.

So, IIRC to proceed with a patch, I should define a _LIBCPP_ASSUME macro wrapping it. Also, if I’m not mistaken, libc++ is sometimes used with g++ and MSVC, so the macro would either use __builtin_assume, __builtin_unreachable or __assume depending on the target. I guess that isn’t the right place to put such a macro, but I don’t know which is the right place.

Would you provide some guidance?

Thanks in advance :slight_smile:

Morwenn