[libcxx] optimizing shared_ptr atomics in destructors

Currently, when the last shared_ptr to an object is destroyed, libc++ performs two atomic decrements, one for the "strong" shared count, and one for the "weak" shared count. I think we can do better than this in the uncontended case, but I would like some feedback for this optimization, particularly on the ARM side.

Here's the code change...
diff --git a/src/memory.cpp b/src/memory.cpp
index 08f2259..b459eb1 100644
--- a/src/memory.cpp
+++ b/src/memory.cpp
@@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT
      return __libcpp_atomic_add(&t, 1, _AO_Relaxed);
  }

  template <class T>
  inline T
  decrement(T& t) _NOEXCEPT
  {
      return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);
  }

  } // namespace
@@ -96,7 +96,9 @@ __shared_weak_count::__release_shared() _NOEXCEPT
  void
  __shared_weak_count::__release_weak() _NOEXCEPT
  {
- if (decrement(__shared_weak_owners_) == -1)
+ if (__libcpp_atomic_load(&__shared_weak_owners_, _AO_Acquire) == 0)
+ __on_zero_shared_weak();
+ else if (decrement(__shared_weak_owners_) == -1)
          __on_zero_shared_weak();
  }

The general idea is that if the current thread is destroying the last weak reference, then no other thread can legally be accessing this object. Given that, we can avoid an expensive atomic store. On x86_64, a quick-and-dirty benchmark is showing an 8% improvement in performance for the combination of make_shared<int> and the accompanying destruction. I don't have performance numbers for other architectures at this point. That 8% is pretty promising though, as the atomic operation improvements are showing through, despite being measured along with a heap allocation and deallocation.

Note that this optimization wouldn't be safe for the strong count, as the last strong count decrement can still contend with a weak_ptr::lock() call.

This comes at the cost of adding an extra load acquire for all but the last decrement (and sometimes even the last decrement). On x86, this is really cheap (just a regular mov). Currently with aarch64 and 32-bit armv8, you get an extra lda, and with armv7 you get extra barriers.

I would hope / expect that on LL/SC architectures, the first acquire load could be folded with the locked load in the atomic add. The check and branch (inside the ll / sc loop) would then be the only overhead. Is this a reasonable optimization to hope for in the future on the compiler front?

Also, I'm being a bit conservative here by making my atomic load an acquire operation. It might be safe to make the operation relaxed, but that seems risky to me, as __on_zero_shared_weak may end up touching unsynchronized data in those cases.

Currently, when the last shared_ptr to an object is destroyed, libc++
performs two atomic decrements, one for the "strong" shared count, and one
for the "weak" shared count. I think we can do better than this in the
uncontended case, but I would like some feedback for this optimization,
particularly on the ARM side.

Here's the code change...
diff --git a/src/memory.cpp b/src/memory.cpp
index 08f2259..b459eb1 100644
--- a/src/memory.cpp
+++ b/src/memory.cpp
@@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT
     return __libcpp_atomic_add(&t, 1, _AO_Relaxed);
}

template <class T>
inline T
decrement(T& t) _NOEXCEPT
{
     return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);
}

} // namespace
@@ -96,7 +96,9 @@ __shared_weak_count::__release_shared() _NOEXCEPT
void
__shared_weak_count::__release_weak() _NOEXCEPT
{
- if (decrement(__shared_weak_owners_) == -1)
+ if (__libcpp_atomic_load(&__shared_weak_owners_, _AO_Acquire) == 0)
+ __on_zero_shared_weak();
+ else if (decrement(__shared_weak_owners_) == -1)
         __on_zero_shared_weak();
}

The general idea is that if the current thread is destroying the last weak
reference, then no other thread can legally be accessing this object.
Given that, we can avoid an expensive atomic store. On x86_64, a
quick-and-dirty benchmark is showing an 8% improvement in performance for
the combination of make_shared<int> and the accompanying destruction. I
don't have performance numbers for other architectures at this point. That
8% is pretty promising though, as the atomic operation improvements are
showing through, despite being measured along with a heap allocation and
deallocation.

Do you have a repo with this benchmark?

Note that this optimization wouldn't be safe for the strong count, as the

last strong count decrement can still contend with a weak_ptr::lock() call.

This comes at the cost of adding an extra load acquire for all but the
last decrement (and sometimes even the last decrement). On x86, this is
really cheap (just a regular mov). Currently with aarch64 and 32-bit
armv8, you get an extra lda, and with armv7 you get extra barriers.

I would hope / expect that on LL/SC architectures, the first acquire load
could be folded with the locked load in the atomic add. The check and
branch (inside the ll / sc loop) would then be the only overhead. Is this
a reasonable optimization to hope for in the future on the compiler front?

What do you mean exactly, could you provide assembly? I think I understand
(sounds clever & doable), but assembly is easier :slight_smile:
That can be benchmarked as well.

Also, I'm being a bit conservative here by making my atomic load an acquire

operation. It might be safe to make the operation relaxed, but that seems
risky to me, as __on_zero_shared_weak may end up touching unsynchronized
data in those cases.

I haven't thought enough about shared_ptr to convince myself either way.
Would be good to benchmark to see if it's even worth proving.

I will put the code up for review, including the pair of benchmarks that I have authored.

For armv7a, armv8, and aarch64, I used the following reduced .cpp code to generate assembly. I haven’t figured out the libc++ cross compiling setup yet, and I’m not well set up to run benchmarks on those platforms.

struct __shared_weak_count {
long _shared_weak_owners;
void __release_weak() noexcept;
void __on_zero_shared_weak() noexcept;
};

void
__shared_weak_count::__release_weak() noexcept {
if (__atomic_load_n(&_shared_weak_owners, 2 /_AO_Acquire/) == 0)
__on_zero_shared_weak();
else if (__atomic_add_fetch(&_shared_weak_owners, -1, 4 /_AO_Acq_Rel/) == -1)
__on_zero_shared_weak();
}

ARMv7a assembly, with notes:

_ZN19__shared_weak_count14__release_weakEv:
.fnstart
@ BB#0: @ %entry
ldr r1, [r0] @bcraig note: it would be nice to combine this load with the ldrex
dmb ish @bcraig note: … and this barrier with the one in BB#2
cmp r1, #0
it eq @bcraig note: unsure of why this is here…
beq _ZN19__shared_weak_count21__on_zero_shared_weakEv
dmb ish
.LBB0_1: @ %atomicrmw.start
@ =>This Inner Loop Header: Depth=1
ldrex r1, [r0]
subs r2, r1, #1
strex r3, r2, [r0]
cmp r3, #0
bne .LBB0_1
@ BB#2: @ %atomicrmw.end
cmp r1, #1
dmb ish
it ne
bxne lr
b _ZN19__shared_weak_count21__on_zero_shared_weakEv

Aarch64 assembly, with notes:
_ZN19__shared_weak_count14__release_weakEv:
.fnstart
@ BB#0: @ %entry
lda r1, [r0] @bcraig note: it would be nice to combine this with the ldaex
cbz r1, .LBB0_3
.LBB0_1: @ %atomicrmw.start
@ =>This Inner Loop Header: Depth=1
ldaex r1, [r0]
subs r2, r1, #1
stlex r3, r2, [r0]
cmp r3, #0
bne .LBB0_1
@ BB#2: @ %atomicrmw.end
cmp r1, #1
bne .LBB0_4
.LBB0_3: @ %if.then5
b _ZN19__shared_weak_count21__on_zero_shared_weakEv
.LBB0_4: @ %if.end6
bx lr

I believe that you will see this transformation if you do a non-atomic increment on the loaded value and then a weak compare and swap to write the result back (with a back branch to retry if it failed). I’m not 100% sure though, because this will leave a ldrex that isn’t paired with a store / release in the case that the loaded value is 0 and some ARM cores don’t like it when that happens.

David


        ldr     r1, [r0]  @bcraig note: it would be nice to combine this load with the ldrex

I believe that you will see this transformation if you do a non-atomic increment on the loaded value and then a weak compare and swap to write the result back (with a back branch to retry if it failed).  I’m not 100% sure though, because this will leave a ldrex that isn’t paired with a store / release in the case that the loaded value is 0 and some ARM cores don’t like it when that happens.

No luck. With this body instead…
auto val = __atomic_load_n(&_shared_weak_owners, 2 /_AO_Acquire/);
while(val != 0 && !__atomic_compare_exchange_n(&_shared_weak_owners, &val, val-1, true /weak/, 4 /Acq_Rel/, 2 /Acquire/));
if(val == 0) __on_zero_shared_weak();

I still get ldr, ldrex, and barriers for both. Even worse, this approach makes x86 a lot worse. I get a lock CMPXCHG loop instead of a lock XADD.

Relevant review:

I just updated it with an extra benchmark. The performance of weak_ptr decrements on x86 did get significantly worse, but I think that’s a fair trade off. For every weak_ptr in the wild, there’s likely ten shared_ptrs that should be unique_ptrs that will benefit from this change.

That’s quite surprising (not the x86 part, that’s expected). The expand atomics pass is meant to merge qualifying load into the corresponding ldrex, though it might only work for a non-atomic initial load. I wonder if it would make sense for this pass to expand atomicrmw into ll/sc pairs as a first pass, run the other optimisations, and then collapse them again (actually, I think for ARMv<8.1 it does the expansion early already, but it’s been a while since I looked at the code).

David