The "Why" section of the libcxx documentation states that "it is
generally accepted that building std::string using the "short string
optimization" instead of using Copy On Write (COW) is a superior
approach for multicore machines". [1a] Similar considerations lie at
the core of N2668 that had effectively banned COW implementations in
The thing is that N2668 doesn't reference any particular research on
the speed and downsides of COW string implementations and I'm having a
hard time finding one. So far I've seen the well-known article by Herb
Sutter  and one more paper  but both are built around a few
synthetic benchmarks and are 10+ years old. Unfortunately I can't find
any benchmarks featuring real-world applications and measured on a
modern hardware which changed a lot since then. For instance, atomics
have in some sense became both cheaper (with improvements in SMP
systems) and more expensive (with a wider spread of NUMA and a
constantly growing number of cores that increases contention).
In theory I see two different kinds of speed-up that may come from
1) Improvements that make the existing code run faster. Possible reasons are:
a) No need for atomic reference counters
b) Improved data locality on NUMA systems for threads that
maintain own copies of their strings
c) Short string optimization (which could technically co-exist
with COW but normally doesn't. A notable exception is fbstring )
2) Improvements that allow writing a better code. By limiting the
number of cases where pointers and iterators may be invalidated, the
C++11 standard allows a wider use of non-owning references to strings.
This goes well with the string_view in C++17.
At the same time, a code that relies heavily on the COW-ness of
strings may face a performance degradation with the non-COW
implementation. I wonder if anyone have reported seeing this on
I'm looking for papers and articles that cover these topics. Anything
from a documented and analyzed speed-up of a given application that
switched to libc++ (from e.g. pre-5.1 libstdc++), to a comprehensive
research. Regarding the hardware I'm primarily interested in x86_64
but data on other architectures would be also useful.
Does anyone have relevant links?
 Concurrency Modifications to Basic String
 Optimizations That Aren't (In a Multithreaded World)
 folly/FBString.md at main · facebook/folly · GitHub