Papers comparing SSO and COW strings

The "Why" section of the libcxx documentation states that "it is
generally accepted that building std::string using the "short string
optimization" instead of using Copy On Write (COW) is a superior
approach for multicore machines". [1a] Similar considerations lie at
the core of N2668 that had effectively banned COW implementations in
C++11 [2].

The thing is that N2668 doesn't reference any particular research on
the speed and downsides of COW string implementations and I'm having a
hard time finding one. So far I've seen the well-known article by Herb
Sutter [3] and one more paper [4] but both are built around a few
synthetic benchmarks and are 10+ years old. Unfortunately I can't find
any benchmarks featuring real-world applications and measured on a
modern hardware which changed a lot since then. For instance, atomics
have in some sense became both cheaper (with improvements in SMP
systems) and more expensive (with a wider spread of NUMA and a
constantly growing number of cores that increases contention).

In theory I see two different kinds of speed-up that may come from
non-COW strings:
1) Improvements that make the existing code run faster. Possible reasons are:
    a) No need for atomic reference counters
    b) Improved data locality on NUMA systems for threads that
maintain own copies of their strings
    c) Short string optimization (which could technically co-exist
with COW but normally doesn't. A notable exception is fbstring [5])
2) Improvements that allow writing a better code. By limiting the
number of cases where pointers and iterators may be invalidated, the
C++11 standard allows a wider use of non-owning references to strings.
This goes well with the string_view in C++17.

At the same time, a code that relies heavily on the COW-ness of
strings may face a performance degradation with the non-COW
implementation. I wonder if anyone have reported seeing this on

I'm looking for papers and articles that cover these topics. Anything
from a documented and analyzed speed-up of a given application that
switched to libc++ (from e.g. pre-5.1 libstdc++), to a comprehensive
research. Regarding the hardware I'm primarily interested in x86_64
but data on other architectures would be also useful.

Does anyone have relevant links?


[2] Concurrency Modifications to Basic String
[3] Optimizations That Aren't (In a Multithreaded World)
[5] folly/ at main · facebook/folly · GitHub