libc++ red-black tree: halving the branches

I came up with a neat little trick (that may have already been considered) that halves the number of conditionals in the venerable red-black tree rebalancing operations. The key observation is that there is a big if whose two branches are identical except that “left” and “right” are switched.

Basically, you use pointer-to-members for each of left and right, and then a single swap reflects the case. You can see this in this experimental modification of the code for __tree_balance_after_insert <http://pastie.org/3730128> when compared to the current version (it uses a unified __tree_rotate that takes a pointer to member for what its notion of “left” should be).

In my quick experiments I found that this halved the number of branches in the generated code (iirc 26 to 13 or something like that). The code size reduction wasn’t as drastic, but it was still ~30% (I’m guessing the optimizer usually is doing some heroics to fold some commonalities, but still failing to recognize the fundamental sense of the commonality). Correct me if I’m wrong, but the way that __tree_rebalance_after_insert and friends are written means that only one copy of that routine should appear in the final binary, so the code size savings is a fixed constant; (or is the intent to have it get inlined into __tree<…>::__insert_node_at ?). I don’t have libcxx set up as my stdlib currently so I haven’t been able to compile a large program (e.g. LLVM/clang) with it and see how things go down in that regard.

Possible cons off the top of my head:

  • Increased register pressure. In the experimental version that I posted, the left and right pointer-to-members each get put in their own register, adding two registers of pressure. This can be improved by keeping just one of them and using xor to switch sides as necessary.
  • On architectures with only very simple addressing modes, this might require extra instructions for each memory access.

I have yet to do any performance or holistic code-size measurements, but I want to solicit feedback on the idea to see if it might be a path worth investigating.

–Sean Silva

I came up with a neat little trick (that may have already been considered) that halves the number of conditionals in the venerable red-black tree rebalancing operations. The key observation is that there is a big `if` whose two branches are identical except that "left" and "right" are switched.

Basically, you use pointer-to-members for each of `left` and `right`, and then a single swap reflects the case. You can see this in this experimental modification of the code for __tree_balance_after_insert <http://pastie.org/3730128> when compared to the current version (it uses a unified __tree_rotate that takes a pointer to member for what its notion of "left" should be).

Very clever. I believe this is good code to familiarize one's self with in order to add another good tool to the toolbox.

In my quick experiments I found that this halved the number of branches in the generated code (iirc 26 to 13 or something like that).

In counting branches for performance comparisons (as opposed to code size comparisons) it is important to count the number of branches executed by each path through the routine, as opposed to the total number of branches. Your rewrite, at first glance, seems to reduce the total, but not the number of branches executed by any given path.

The code size reduction wasn't as drastic, but it was still ~30% (I'm guessing the optimizer usually is doing some heroics to fold some commonalities, but still failing to recognize the fundamental sense of the commonality). Correct me if I'm wrong, but the way that __tree_rebalance_after_insert and friends are written means that only one copy of that routine should appear in the final binary, so the code size savings is a fixed constant; (or is the intent to have it get inlined into __tree<...>::__insert_node_at ?).

I believe you are correct, unless the client is using allocator smart pointers, which is exceedingly rare. Typically only the following will be instantiated:

void
std::__1::__tree_balance_after_insert<std::__1::__tree_node_base<void*>*>
    (std::__1::__tree_node_base<void*>*, std::__1::__tree_node_base<void*>*);

So 30% savings on 1 routine, which I'm guessing is on the order of about 100 instructions. So save about 30 non-lined-and-instantiated-exactly-once instructions in code size.

I don't have libcxx set up as my stdlib currently so I haven't been able to compile a large program (e.g. LLVM/clang) with it and see how things go down in that regard.

Possible cons off the top of my head:
- Increased register pressure. In the experimental version that I posted, the left and right pointer-to-members each get put in their own register, adding two registers of pressure. This can be improved by keeping just one of them and using xor to switch sides as necessary.
- On architectures with only very simple addressing modes, this might require extra instructions for each memory access.

I have yet to do any performance or holistic code-size measurements, but I want to solicit feedback on the idea to see if it might be a path worth investigating.

I don't believe code size is going to be a driver on this one as the potential savings tops out in the neighborhood of 30 instructions (which is small compared to the entire R/B tree container code). The bottom line is going to be performance. Your rewrite is interesting, but for me, significantly harder to read. So it will have to motivate with significant performance wins. And if it does, I think __tree_remove would benefit from the same treatment. However if we're looking at zero or marginal performance wins, then I favor the current code: it is easier to reason about and maintain.

Howard

Not only this, but also whether the change impacts the predictability of the branches to the CPU’s branch predictor. This type of change (unifying code paths to share a set of branches) can have a really bad result of making all of the branches between ‘hard’ and ‘unpredictable’ instead of them being ‘well predicted’. However, in some (more rare) cases it can have the opposite effect! So, I would want to see very careful measurement of performance as well. I think it has a small possibility of making no difference, a small possibility of providing a significant performance boost, and a larger possibility of providing a significant performance loss.

There is a good tool for this on linux: the perf events tool. It can measure the number of branches executed in a benchmark program, and the % which are unpredicted and mispredicted. The worst case here is mispredicted, the almost-as-bad case is unpredicted.

The other complication is that different chips have different prediction strategies, so the results my swing wildly between architectures.

You might want to look at
http://cvsweb.netbsd.org/bsdweb.cgi/src/common/lib/libc/gen/rb.c?rev=1.11&content-type=text/x-cvsweb-markup&only_with_tag=MAIN
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/sys/rbtree.h?rev=1.2&content-type=text/x-cvsweb-markup

for a version of red-black trees aggressively exploiting the symmetry.

Joerg

Really, even with the duplication of left/right logic? Seems to me that
if the code size reduction isn't a win and there are performance costs
you could still improve maintainability while maintaining the current
generated code structure by passing the member pointers as template arguments to a
helper function template.

The libc++ code base is open source and I am happy to review patches.

Howard

Awww... I'm not trying to get you to change anything. My question is
whether you think it's better for maintainability (in this case?) to
write out two totally symmetric branches with duplicated logic than it
is to eliminate that duplication, and if so, why. I could also take
your response to mean "I really don't have time to deal with questions
like that," which would be fine, too.

When I design code I often evaluate two or more versions of working, testable code, and then (hopefully) develop a favorite. When I have only one version of working and tested code in front of me, and if there are not customer complaints against that code, I generally tend to leave it alone (if it isn't broken don't fix it).

Even if I'm exploring hypothetical code, such as for teaching/presentation purposes, I generally write actual code to test my hypotheses. I'm just not smart enough to fully evaluate different designs without testing, or relying on closely related tests that I've already performed.

To properly answer your question I would have to write an alternative version of the code in question. So I'm asking you to pose your question in terms of alternative code I can study, instead of taking on the job myself of crafting that alternative. Not only will this save me time. It will also save you time when you have to review what I've tested only to discover that I misunderstood your English. The compiler will ensure that your C++ is unambiguous.

Howard

Well, I wasn't looking for an answer that's any more "proper" than your
statement above that the current code is easier to reason about and
maintain. It's OK with me if you don't feel like sticking your neck out
further on this one without something more solid to look at, but while
that project does sound like a lot of fun, it was a mostly-idle question
and I don't have time to take it on at the moment.

Cheers,

I’ve got a little bit for everybody:

Howard: I completely understand your points. By no means did I intend this code in its current form to get into libc++, it was mostly just an experiment. I would not myself advocate this code change without clear evidence of a performance win.

Chandler: perf is a really good tool for monitoring this stuff. There are a lot of low-level performance considerations, and measuring is the only way to really evaluate this kind of change. Also I think there are more “macro” optimizations to be made to the rebalancing code to reduce branches and branch mispredicts, bringing me to…

Joerg: wow, that code is insane. the author really groks the red-black tree algorithms like nothing that I’ve ever seen before (not even CLRS). For example in the insert rebalancing, they put the Case 1 recoloring in the loop by itself. Case 2 and 3 for are put outside the loop, which makes so much sense since Case 2 and 3 only execute exactly once (if at all), since the tree is balanced after that code runs. Also, their use of an array to hold the left and right nodes is beautiful! I though I was clever with the pointer-to-members, but the array version is simpler, and to boot they can use the result of comparisons like (father == grandpa->rb_right) directly to index which one to use.

Yet another case in the field of computing where a more-complicated solution (pointer to members to determine left and right) has a better, simpler solution.

Dave: from the point of view of code understandability, the current code is probably about as good as it gets, since it is essentially a mirror image in code in the CLRS algorithms book, which to my knowledge is where pretty much everybody learns how red-black trees work (and uses as a reference for the rb-tree algorithms). Unfortunately, the CLRS code is more theoretically oriented and their “programming language” for the algorithms lacks many of the more powerful capabilities that real computers have, such as dynamically choosing offsets for member access. Actually, my original inspiration for this was the “else” branch in RB-Insert-Fixup, which simply says “(same as then branch but with left and right reversed)”.

I have to say, that netbsd red-black tree has really piqued my interest. Two other red-black trees I have seen “in the wild”—the linux one and the nginx one—also follow the more to-the-book implementation like libc++'s; perhaps this is because the netbsd-like code has no significant advantage? Or maybe it is simply obscurity of this approach? idk, but I think I’m going to end up writing at least a couple red-black trees in order to test this… fun…

I've got a little bit for everybody:

Howard: I completely understand your points. By no means did I intend this code in its current form to get into libc++, it was mostly just an experiment. I would not myself advocate this code change without clear evidence of a performance win.

Chandler: perf is a really good tool for monitoring this stuff. There are a lot of low-level performance considerations, and measuring is the only way to really evaluate this kind of change. Also I think there are more "macro" optimizations to be made to the rebalancing code to reduce branches and branch mispredicts, bringing me to...

Joerg: wow, that code is insane. the author really groks the red-black tree algorithms like nothing that I've ever seen before (not even CLRS). For example in the insert rebalancing, they put the Case 1 recoloring in the loop by itself. Case 2 and 3 for are put *outside* the loop, which makes so much sense since Case 2 and 3 only execute exactly once (if at all), since the tree is balanced after that code runs. Also, their use of an array to hold the left and right nodes is beautiful! I though I was clever with the pointer-to-members, but the array version is simpler, and to boot they can use the result of comparisons like (father == grandpa->rb_right) directly to index which one to use.

Yet another case in the field of computing where a more-complicated solution (pointer to members to determine left and right) has a better, simpler solution.

Dave: from the point of view of code understandability, the current code is probably about as good as it gets, since it is essentially a mirror image in code in the CLRS algorithms book, which to my knowledge is where pretty much everybody learns how red-black trees work (and uses as a reference for the rb-tree algorithms). Unfortunately, the CLRS code is more theoretically oriented and their "programming language" for the algorithms lacks many of the more powerful capabilities that real computers have, such as dynamically choosing offsets for member access. Actually, my original inspiration for this was the "else" branch in RB-Insert-Fixup, which simply says "(same as then branch but with left and right reversed)".

I have to say, that netbsd red-black tree has really piqued my interest. Two other red-black trees I have seen "in the wild"---the linux one and the nginx one---also follow the more to-the-book implementation like libc++'s; perhaps this is because the netbsd-like code has no significant advantage? Or maybe it is simply obscurity of this approach? idk, but I think I'm going to end up writing at least a couple red-black trees in order to test this.... fun...

If that's not enough fun, have you considered comparing your red-black tree implementations with other tree implementations, say avl-trees?

I'm on Debian amd64 if you need to run your tests on different hardware.

I'll even give writing the avl-tree performance tests a whirl, once you show me how you run the perf tests with red-black trees - a tar ball I can unpack and "make check" with is all I need.

Yes, I've got a vested interest in this - my treedb project in SourceForge uses avl-trees and I'd like to add red-black-trees in at some stage :slight_smile:

If we are speaking alternative implementations of std::map and co, does anyone know if a Skip List has ever been used to implement those ? They are in theory equivalent to red-black trees and avl trees, but use randomness instead of rebalancing which make them much simpler: you don’t need a book to code them.

I have never heard about any performance comparison between the two though, and I am not really I could come up with an optimized implementation…

–Matthieu.

Skip list tend to be fairly bad for modern memory architectures,
especially in multi-threaded applications. The problem is that they result
in a lot of cache trashing, so they are only beneficial if you have a
really small hot set.

Joerg

Skip lists, in general, have very poor cache utilisation which affects their performance in real code but doesn't always show up in microbenchmarks. We based a lot of code around skip lists when they were cool in the '90s and would now suffering for it if not for the fact that computers became so fast in the intervening period that we could strip out those caches completely and still be fast enough.

David

I have not looked closely at skip lists, but I believe that the various complexity and stability requirements on std::map and the other tree algorithms place strong restrictions on the implementation that can be used. In particular they say that certain complexity requirements must be reached, which isn't guaranteed by random algorithms.

I'm sure implementing such algorithms in an STL-like way would be interesting, and might well be a welcome contribution to boost.

Chris

Speaking of cache utilization, it's hard to beat btrees. I don't know
if you can implement the requirements of std::map with btrees, though:

https://github.com/boostcon/2011_presentations/raw/master/tue/proposed_b_tree_library.pdf
http://blip.tv/boostcon/the-proposed-boost-b-tree-library-5349968

OK, re-lurking...

One issue with the B-Tree family is that those are not node based structures. std::map and al have the property that an object inserted should be at a fixed place in memory throughout its lifetime which straight B-Trees do not ensure at all (why with split and fusion of nodes).

To get better caching I only know of specializing the allocator at the moment, so that all objects get allocated close to each other.

– Matthieu