mem* Implementations

We are planning on landing implementations for commonly used mem
functions (memcpy, memset, memcmp, and bcmp).
https://reviews.llvm.org/D72516 by Guillaume adds several initial
benchmarks for these functions.

Design Philosophy

These functions use a significant amount of compute in typical
programs. In “Profiling a warehouse-scale computer” [S. Kanev, et.
al., ISCA ‘15], the authors observed ~5% of processor cycles being
spent in these functions, making their performance critical. This has
influenced our design in several ways:

* C++ implementation: We developed our implementations in C++. This
makes understanding and reasoning about the correctness of the
implementation far easier than handcrafted assembly. By leaving
codegen to the compiler, we can leverage existing compiler features
(scheduling, FDO, etc.) that exist today.

* Small Size Distributions: Using a profiler to observe size
distributions for calls into libc functions, we found most operations
are small. This has made it critical to optimize short operation
latency, while preserving large operation throughput.

For `memcpy`, 96% of sizes are <= 128 bytes. 99% are <= 1024 bytes.
For `memset`, 91% of sizes are <= 128 bytes. 99.9% are <= 1024 bytes.
For `memcmp`, 99.5% of sizes are <= 128 bytes, ~100% are <= 1024 bytes.

In the rare cases where we have found exceptions to these
distributions, we’ve found efficiency bugs such as a spuriously
copied, large data structure.

* Small Code Footprint: Our implementations consist of concise
patterns for working with chunks of data. While further
specialization can produce better results on microbenchmarks, we did
not see these wins materialize on macrobenchmarks measuring
application productivity.

* Avoiding Runtime Dispatch: Our implementation leverages the
compiler to choose appropriately wide instructions available
statically at compile time, but does not use runtime dispatch to
switch between implementations (say 128-/256-/512-bit vector widths).

In our experience, the overhead of dispatching between implementations
(via branches or the PLT) overwhelmed the performance benefits on
macrobenchmark performance by ~0.5-1%.

Rather than tune for individual microarchitecture variations, we would
prefer to leverage on fast string operations provided by the ISA (for
example “rep movsb” on x86). This allows us to leverage wider data
paths over time, without having to build custom dispatch logic which
carries its own overheads.

* Compiler/Library Codesign Opportunities: LLVM can lower calls for
`memcmp() == 0` to a call to `bcmp` (https://reviews.llvm.org/D56593).
Equality comparison can be more aggressively optimized and implement
specializations for early mismatch detection.

For hot callsites, we can inline calls, leveraging locally available
information for size and alignment.

We see hardware support for these operations as the future. The
implementation of these instructions can access processor features
that aren’t available through the ISA and that can vary across
processor generations. When available, we would plan to fully inline
the implementations of these functions with a short, fixed instruction
sequence that provides the maximum performance available from the
hardware.

Thanks,
Chris Kennelly

Rather than tune for individual microarchitecture variations, we would
prefer to leverage on fast string operations provided by the ISA (for
example “rep movsb” on x86). This allows us to leverage wider data
paths over time, without having to build custom dispatch logic which
carries its own overheads.

So you are going to write memcpy in C++ loops and assume LLVM will compile that to “rep movsb” unconditionally for all x86 architectures? Is this what LLVM does today with such loops, or is LLVM expected to change its codegen to support this?

What performance loss do you consider acceptable for large data copies by virtue of disallowing custom dispatch logic?

Thanks,

Jeff