[RFC] Adding a -memeq-lib-function flag to allow the user to specify a memeq function.

Hi all,

We’d like to suggest adding a -memeq-lib-function flag to allow the user to specify a **memeq()** function to improve string equality check performance.

Right now, when llvm encounters a string equality check, e.g. if (memcmp(a, b, s) == 0), it tries to expand to an equality comparison if s is a small compile-time constant, and falls back on calling memcmp() else.

This is sub-optimal because memcmp has to compute much more than equality.

We propose adding a way for the user to specify a memeq library function (e.g. -memeq-lib-function=user_memeq) which will be called instead of memcmp() when the result of the memcmp call is only used for equality comparison.

memeq can be made much more efficient than memcmp because equality comparison is trivially parallel while lexicographic ordering has a chain dependency.

We measured an very large improvement of this approach on our internal codebase. A significant portion of this improvement comes from the stl, typically std::string::operator==().

Note that this is a backend-only change. Because the c family of languages do not have a standard memeq() (posix used to have bcmp() but it was removed in 2001), c/c++ code cannot communicate the equality comparison semantics to the compiler.

We did not add an RTLIB entry for memeq because the user environment is not guaranteed to contain a memeq() function as the libc has no such concept.

If there is interest, we could also contribute our optimized memeq to compiler-rt.

A proof of concept patch for this for this RFC can be found here: https://reviews.llvm.org/D56248

Comments & suggestions welcome !
Thanks,

Clement

Hi all,

We'd like to suggest adding a -memeq-lib-function flag to allow the user to specify a `memeq()` function to improve string equality check performance.

Hi, Clement,

We really shouldn't be adding backend flags for anything at this point (except for debugging and the like). A function attribute should be fine, or global metadata if necessary. A function attribute should play better with LTO, and so that's generally the recommended design point.

Right now, when llvm encounters a string equality check, e.g. `if (memcmp(a, b, s) == 0)`, it tries to expand to an equality comparison if `s` is a small compile-time constant, and falls back on calling `memcmp()` else.

This is sub-optimal because memcmp has to compute much more than equality.

We propose adding a way for the user to specify a `memeq` library function (e.g. `-memeq-lib-function=user_memeq`) which will be called instead of `memcmp()` when the result of the memcmp call is only used for equality comparison.

`memeq` can be made much more efficient than `memcmp` because equality comparison is trivially parallel while lexicographic ordering has a chain dependency.

We measured an very large improvement of this approach on our internal codebase. A significant portion of this improvement comes from the stl, typically `std::string::operator==()`.

Note that this is a backend-only change. Because the c family of languages do not have a standard `memeq()` (posix used to have `bcmp()` but it was removed in 2001), c/c++ code cannot communicate the equality comparison semantics to the compiler.

We did not add an RTLIB entry for memeq because the user environment is not guaranteed to contain a `memeq()` function as the libc has no such concept.

If there is interest, we could also contribute our optimized `memeq` to compiler-rt.

That would be useful.

Thanks again,

Hal

A proof of concept patch for this for this RFC can be found here: https://reviews.llvm.org/D56248

Comments & suggestions welcome !
Thanks,

Clement

Thanks for the suggestions Hal,

So if I understand correctly, you’re recommending we add a module flag to LLVM, something like:

!llvm.module.flags = !{…, !123}
!123 = !{i32 1, !“memeq_lib_function”, !“user_memeq”}

I’ve given it a try in the following patch: https://reviews.llvm.org/D56311
If this sounds reasonable I can start working on adding a CodeGenOptions to clang to see what this entails.

I don’t think the function attribute works here because we want this to be globally enabled instead of per-function (but maybe I misunderstood what you were suggesting).

I don’t think the function attribute works here because we want this to be globally enabled instead of per-function (but maybe I misunderstood what you were suggesting).

Hm, actually I think you might have been referring to a function attribute at call site, same as trap-func-name in CodeGenModule::ConstructDefaultFnAttrList:

if (AttrOnCallSite) {
// Attributes that should go on the call site only.
if (!CodeGenOpts.SimplifyLibCalls ||
CodeGenOpts.isNoBuiltinFunc(Name.data()))
FuncAttrs.addAttribute(llvm::Attribute::NoBuiltin);
if (!CodeGenOpts.TrapFuncName.empty())
FuncAttrs.addAttribute(“trap-func-name”, CodeGenOpts.TrapFuncName);
}

The nice thing is that trap-func-name paves the way as it’s a pretty similar use case. Here’s a v3 diff with this approach for comparison: https://reviews.llvm.org/D56313.

I don't think the function attribute works here because we want this to be globally enabled instead of per-function (but maybe I misunderstood what you were suggesting).

Our general scheme for these kinds of things is to add a per-function attribute, and then have the frontend add the attribute to every function in the module. The rationale for this scheme comes from how it interacts with the separate-compilation/LTO model: Imagine I have some flag affecting code generation, say -fenable-memeq, and I use that flag when compiling some source files and not others, and then I use LTO, I want the option to apply only to code in those functions which came from translation units compiled with the -fenable-memeq flag.

That having been said, I see several reasons to favor the per-call-site attribute over the function-level attribute in this case:

1. With function-level attributes, there's always the question of what to do with inlining. Either you block inlining upon an attribute mismatch, or you allow it and drop the conflicting attributes in some conservative sense. With call-site attributes, this is not an issue.

2. The attribute will apply rarely, and so while putting the attribute on all functions will increase the bitcode size / memory footprint in the common case, having it only on relevant call sites will not have that overhead.

3. While we have one function now, there could be a large number, and encoding these all in function-level attributes on every function will become unwieldy (because it will magnify the issues above). Having the frontend attach information per-call-site seems like a better approach.

To bikeshed, "trap-func-name", why "trap"?

-Hal

Hm, actually I think you might have been referring to a function attribute at call site, same as `trap-func-name` in `CodeGenModule::ConstructDefaultFnAttrList`:

if (AttrOnCallSite) {
    // Attributes that should go on the call site only.
    if (!CodeGenOpts.SimplifyLibCalls ||
        CodeGenOpts.isNoBuiltinFunc(Name.data()))
      FuncAttrs.addAttribute(llvm::Attribute::NoBuiltin);
    if (!CodeGenOpts.TrapFuncName.empty())
      FuncAttrs.addAttribute("trap-func-name", CodeGenOpts.TrapFuncName);
  }

The nice thing is that `trap-func-name` paves the way as it's a pretty similar use case. Here's a v3 diff with this approach for comparison: ⚙ D56313 [SelectionDAG][RFC] Allow the user to specify a memeq function (v3)..

This seems a somewhat odd and overcomplicated way to go about this.

Given that bcmp was required in POSIX until relatively recently, I will guess that almost all platforms support it already. From a quick check, glibc, freebsd, netbsd, newlib, and musl all seem to contain it. So, couldn’t we just add bcmp to the runtime function list for those platforms which support it? And, add an optimization to translate a call to memcmp into bcmp if it exists?

Of course, it would then also be a good idea to go back to POSIX and present the performance numbers to make a case for why it was actually a quite valuable function and should be reinstated into the standard as well.

If we are considering an optimization to convert calls to memcmp into bcmp, then does it make sense to add an intrinsic for bcmp like there is for memcmp? That way IR writers can express their requirements precisely: memcmp if you care about the direction of inequality, and bcmp if you do not.

I don’t think the function attribute works here because we want this to be globally enabled instead of per-function (but maybe I misunderstood what you were suggesting).

Our general scheme for these kinds of things is to add a per-function attribute, and then have the frontend add the attribute to every function in the module. The rationale for this scheme comes from how it interacts with the separate-compilation/LTO model: Imagine I have some flag affecting code generation, say -fenable-memeq, and I use that flag when compiling some source files and not others, and then I use LTO, I want the option to apply only to code in those functions which came from translation units compiled with the -fenable-memeq flag.

Thanks for the clarification.

That having been said, I see several reasons to favor the per-call-site attribute over the function-level attribute in this case:

  1. With function-level attributes, there’s always the question of what to do with inlining. Either you block inlining upon an attribute mismatch, or you allow it and drop the conflicting attributes in some conservative sense. With call-site attributes, this is not an issue.
  1. While we have one function now, there could be a large number, and encoding these all in function-level attributes on every function will become unwieldy (because it will magnify the issues above). Having the frontend attach information per-call-site seems like a better approach.

Agreed.

  1. The attribute will apply rarely, and so while putting the attribute on all functions will increase the bitcode size / memory footprint in the common case, having it only on relevant call sites will not have that overhead.

This would mean that the frontend decides where to apply the optimization, so it has to know whether or not the result is used only for equality. I think the backend is a better place to do that, because there are some optimizations happening in the backend that will result in this pattern: the most obvious one is inlining, e.g. https://godbolt.org/z/nMtSAi. I can also think of constant folding or MergeICmps (which generates memcmps).
BTW the backend is already doing that analysis (see isOnlyUsedInZeroEqualityComparison()).

To bikeshed, “trap-func-name”, why “trap”?

That’s the already existing call site attribute that gets attached to trap instructions. This was just to show that we’re already doing what I’m RFCing here, but for another use case. In my patch (https://reviews.llvm.org/D56313) I’m using “memeq-func-name”.

Hi David & James,

If we are considering an optimization to convert calls to memcmp into bcmp, then does it make sense to add an intrinsic for bcmp like there is for memcmp? That way IR writers can express their requirements precisely: memcmp if you care about the direction of inequality, and bcmp if you do not.

As mentioned in my answer to Hal above, I think the backend at least should be able to do this.

Then adding an intrinsic is only for for convenience, because if the backend can do the optimization automatically it’s enough for the frontend to emit the attribute. For example, for a language that has a runtime which as a memeq/bcmp, the frontend could just emit the call site annotation. I have no strong opinion on whether the convenience justifies an intrinsic.

This seems a somewhat odd and overcomplicated way to go about this.

Given that bcmp was required in POSIX until relatively recently, I will guess that almost all platforms support it already. From a quick check, glibc, freebsd, netbsd, newlib, and musl all seem to contain it. So, couldn’t we just add bcmp to the runtime function list for those platforms which support it? And, add an optimization to translate a call to memcmp into bcmp if it exists?

That would indeed be much simpler, but this seems brittle to me. The approach you’re suggesting works for us (google) because we fully control our environment, but I’m afraid it will not work for others.
For example, someone might distribute linux binaries built with LLVM to their users, but since there is nothing guaranteeing that bcmp is available on the user’s libc, they won’t be able to run it on their systems.
Is there a precedent for that approach ?

Of course, it would then also be a good idea to go back to POSIX and present the performance numbers to make a case for why it was actually a quite valuable function and should be reinstated into the standard as well.

Indeed :slight_smile:

However, there may be cases where the optimizer cannot determine that a call to memcmp can be reduced to bcmp. e.g. I pass the result of memcmp() to some externally-defined function unknown to LLVM. I would like to be able to specify bcmp explicitly in this case.

Ah yes, that makes sense.

There are many library functions that are only available on some platforms and not others. LLVM can easily be told which do include it and which do not, and emit a call only for those which do.

For Glibc Linux, we can be sure that bcmp is available on all systems – is is there now, it always has been, and always will be in the future (due to backwards compatibility guarantees). It’s just defined as an alias for memcmp, however, so there’s no advantage (nor, for that matter, disadvantage) to using it.

But of course it’s not only glibc which has it – it’s present in almost every environment out there.

e.g. for Linux there’s also a few other libc implementations that people use:
musl: includes it, calls memcmp.
uclibc-ng: includes it, alias of memcmp (only if you compile with UCLIBC_SUSV3_LEGACY, but “buildroot”, the way one generally uses uclibc, does so).
dietlibc: includes it, alias of memcmp.
Android bionic: doesn’t include it.

Some other platforms:
RTEMS (newlib): includes it, calls memcmp.
NetBSD: includes it, separate optimized implementation.
FreeBSD: includes it, separate optimized implementation.
Darwin/macOS/iOS/etc: includes it, alias of memcmp.
Windows: doesn’t include it.
Solaris: includes it (haven’t checked impl).

I do note, sadly, that currently out of all these implementations, only NetBSD and FreeBSD seem to actually define a separate more optimized bcmp function. That does mean that this optimization would be effectively a no-op, for the vast majority of people.

For Google’s purposes, if a call to bcmp is emitted by the compiler, you can of course just override it with a more optimal version. But if you can show a similar performance win in public code, it’d be great to attempt to push a more optimized version upstream at least to glibc. Some more precise numbers than “very large improvement” are probably necessary to show it’s actually worth it. :slight_smile:

This seems a somewhat odd and overcomplicated way to go about this.

Given that bcmp was required in POSIX until relatively recently, I will guess that almost all platforms support it already. From a quick check, glibc, freebsd, netbsd, newlib, and musl all seem to contain it. So, couldn’t we just add bcmp to the runtime function list for those platforms which support it? And, add an optimization to translate a call to memcmp into bcmp if it exists?

That would indeed be much simpler, but this seems brittle to me. The approach you’re suggesting works for us (google) because we fully control our environment, but I’m afraid it will not work for others.
For example, someone might distribute linux binaries built with LLVM to their users, but since there is nothing guaranteeing that bcmp is available on the user’s libc, they won’t be able to run it on their systems.
Is there a precedent for that approach ?

There are many library functions that are only available on some platforms and not others. LLVM can easily be told which do include it and which do not, and emit a call only for those which do.

For Glibc Linux, we can be sure that bcmp is available on all systems – is is there now, it always has been, and always will be in the future (due to backwards compatibility guarantees). It’s just defined as an alias for memcmp, however, so there’s no advantage (nor, for that matter, disadvantage) to using it.

But of course it’s not only glibc which has it – it’s present in almost every environment out there.

e.g. for Linux there’s also a few other libc implementations that people use:
musl: includes it, calls memcmp.
uclibc-ng: includes it, alias of memcmp (only if you compile with UCLIBC_SUSV3_LEGACY, but “buildroot”, the way one generally uses uclibc, does so).

I’m afraid about the “almost” and “generally”: what about users who don’t ?

NetBSD: includes it, separate optimized implementation.

Quoting from the NetBSD man page:
“The bcmp() interface is obsolete. Do not add new code using it. It will soon be purged. Use memcmp(9) instead. (The bcmp() function is now a macro for memcmp(9).)”

I do note, sadly, that currently out of all these implementations, only NetBSD and FreeBSD seem to actually define a separate more optimized bcmp function. That does mean that this optimization would be effectively a no-op, for the vast majority of people.

This might or might not be considered really an issue.

  • In my original proposal, people have to explicitly opt-in to the feature and link to their memcmp implementation, they do not get the improvement automatically.
  • In this proposal, they have to patch their libc, which might be slightly more painful depending on the system.

Here’s a patch with this proposal to see what this looks like: https://reviews.llvm.org/D56436

For Google’s purposes, if a call to bcmp is emitted by the compiler, you can of course just override it with a more optimal version.

For our purpose anything is fine because we fully control our environment. I’m trying to make sure that we do not create pain for other users.

But if you can show a similar performance win in public code, it’d be great to attempt to push a more optimized version upstream at least to glibc. Some more precise numbers than “very large improvement” are probably necessary to show it’s actually worth it. :slight_smile:

We were planning to contribute it to compiler-rt. Contributing a deprecated function to the libc sounds… weird.

I’m afraid about the “almost” and “generally”: what about users who don’t ?

Even so, it should be fine to enable it for those platforms which do include it.

I do note, sadly, that currently out of all these implementations, only NetBSD and FreeBSD seem to actually define a separate more optimized bcmp function. That does mean that this optimization would be effectively a no-op, for the vast majority of people.

This might or might not be considered really an issue.

Right, the issue is adding an effectively useless optimization in llvm.

  • In my original proposal, people have to explicitly opt-in to the feature and link to their memcmp implementation, they do not get the improvement automatically.
  • In this proposal, they have to patch their libc, which might be slightly more painful depending on the system.

Users may also include a function named bcmp in their binary, which will overrides the one from libc.

Here’s a patch with this proposal to see what this looks like: https://reviews.llvm.org/D56436

It feels like this optimization would be better done in llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp, and the presence/name checking done via TargetLibraryInfo.cpp.

But if you can show a similar performance win in public code, it’d be great to attempt to push a more optimized version upstream at least to glibc. Some more precise numbers than “very large improvement” are probably necessary to show it’s actually worth it. :slight_smile:

We were planning to contribute it to compiler-rt. Contributing a deprecated function to the libc sounds… weird.

Yes, contributing an optimization for a deprecated function is indeed weird. Thus the importance of reliable performance numbers justifying the importance – I’d never have thought that the performance cost of returning an ordering from memcmp would be important, and I suspect nobody else did.

I’m afraid about the “almost” and “generally”: what about users who don’t ?

Even so, it should be fine to enable it for those platforms which do include it.

I do note, sadly, that currently out of all these implementations, only NetBSD and FreeBSD seem to actually define a separate more optimized bcmp function. That does mean that this optimization would be effectively a no-op, for the vast majority of people.

This might or might not be considered really an issue.

Right, the issue is adding an effectively useless optimization in llvm.

  • In my original proposal, people have to explicitly opt-in to the feature and link to their memcmp implementation, they do not get the improvement automatically.
  • In this proposal, they have to patch their libc, which might be slightly more painful depending on the system.

Users may also include a function named bcmp in their binary, which will overrides the one from libc.

Here’s a patch with this proposal to see what this looks like: https://reviews.llvm.org/D56436

It feels like this optimization would be better done in llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp,

I’ll have a look at this approach.

and the presence/name checking done via TargetLibraryInfo.cpp.

Yes, that’s how it’s done in this version.

But if you can show a similar performance win in public code, it’d be great to attempt to push a more optimized version upstream at least to glibc. Some more precise numbers than “very large improvement” are probably necessary to show it’s actually worth it. :slight_smile:

We were planning to contribute it to compiler-rt. Contributing a deprecated function to the libc sounds… weird.

Yes, contributing an optimization for a deprecated function is indeed weird. Thus the importance of reliable performance numbers justifying the importance – I’d never have thought that the performance cost of returning an ordering from memcmp would be important, and I suspect nobody else did.

Fair enough, let me give some numbers for this change.
Before that, a caveat with any benchmarks for comparing strings is that the results depend a lot the distribution of sizes and content of these strings. So it makes more sense to benchmark an actual application, and we have our own custom benchmarks.
That being said, one of the cases where we have found this optimization to be impactful is operator==(const string&, const string&). libcxx has a family of benchmarks for BM_StringRelational_Eq), which I’m going to use here.

BM_StringRelational_Eq benchmarks comparison of strings of size 7 (Small), 63 (Large) and 5k (Huge) characters, in four scenarios (scenarii ?):

  • The equal case (Control), which is theoretically the worst case as you have to prove that all bytes are equal.
  • The case when strings differ. In that case bcmp() only needs to prove that one byte differs to return nonzero. Typical cases where strings differ are at the start of the string (ChangeFirst), but also, interestingly, at the end (ChangeLast, when you are comparing strings with a common prefix, which happens frequently e.g. when comparing subsequent elements of a sorted list of strings). Another interesting case is the case when the change position is in the middle (ChangeMiddle).

For this comparison, I’m using as base the call to memcmp, and as experiment the following crude bcmp() implementation (I’m assuming X86_64), that shows how we can take advantage of the observations above to optimize typical cases.

#define UNALIGNED_LOAD64(_p) (*reinterpret_cast<const uint64_t *>(_p))

extern "C" int bcmp(const void* p1, const void* p2, size_t n) throw() {
const char* a = reinterpret_cast<const char*>(p1);
const char* b = reinterpret_cast<const char*>(p2);
if (n >= 8) {
uint64_t u = UNALIGNED_LOAD64(a) ^ UNALIGNED_LOAD64(b);
uint64_t v = UNALIGNED_LOAD64(a + n - 8) ^ UNALIGNED_LOAD64(b + n - 8);
if ((u | v) != 0) {
return 1;
}
}
return memcmp(a, b, n);
}

Note that:

  • there is a bit of noise in the results, but you’ll see that this quite dumb bcmp() reasonably improves {Large,Huge}_{ChangeFirst,ChangeLast} (note the the improvement to {Large,Huge}_ChangeLast cannot be achieved with the semantics of memcmp) without hurting the ChangeMiddle and Control cases.
  • the small string case (size==7) is not modified by our change because there is a “fast path” for very small sizes on operator==.
  • We are still experimenting with the final bcmp() implementation (in particular improving the Control and ChangeMiddle cases by improving parallelism). Our current version is better than memcmp() across the board on X86.

base (ns) exp (ns)
BM_StringRelational_Eq_Empty_Empty_Control 1.65 1.7
BM_StringRelational_Eq_Empty_Small_Control 1.37 1.4
BM_StringRelational_Eq_Empty_Large_Control 1.37 1.44
BM_StringRelational_Eq_Empty_Huge_Control 1.38 1.44
BM_StringRelational_Eq_Small_Small_Control 6.53 6.51
BM_StringRelational_Eq_Small_Small_ChangeFirst 1.96 1.94
BM_StringRelational_Eq_Small_Small_ChangeMiddle 5.06 4.95
BM_StringRelational_Eq_Small_Small_ChangeLast 6.77 6.84
BM_StringRelational_Eq_Small_Large_Control 1.38 1.41
BM_StringRelational_Eq_Small_Huge_Control 1.37 1.39
BM_StringRelational_Eq_Large_Large_Control 5.54 5.8
BM_StringRelational_Eq_Large_Large_ChangeFirst 6.25 3.06
BM_StringRelational_Eq_Large_Large_ChangeMiddle 5.5 5.94
BM_StringRelational_Eq_Large_Large_ChangeLast 6.04 3.42
BM_StringRelational_Eq_Large_Huge_Control 1.1 1.1
BM_StringRelational_Eq_Huge_Huge_Control 118 118
BM_StringRelational_Eq_Huge_Huge_ChangeFirst 5.65 3.02
BM_StringRelational_Eq_Huge_Huge_ChangeMiddle 69.5 66.9
BM_StringRelational_Eq_Huge_Huge_ChangeLast 118 3.43

I’m afraid about the “almost” and “generally”: what about users who don’t ?

Even so, it should be fine to enable it for those platforms which do include it.

I do note, sadly, that currently out of all these implementations, only NetBSD and FreeBSD seem to actually define a separate more optimized bcmp function. That does mean that this optimization would be effectively a no-op, for the vast majority of people.

This might or might not be considered really an issue.

Right, the issue is adding an effectively useless optimization in llvm.

  • In my original proposal, people have to explicitly opt-in to the feature and link to their memcmp implementation, they do not get the improvement automatically.
  • In this proposal, they have to patch their libc, which might be slightly more painful depending on the system.

Users may also include a function named bcmp in their binary, which will overrides the one from libc.

Here’s a patch with this proposal to see what this looks like: https://reviews.llvm.org/D56436

It feels like this optimization would be better done in llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp,

I’ll have a look at this approach.

You’re right, that’s a better place to do this indeed: https://reviews.llvm.org/D56593

I’d like to move forward with this. Since there are no objections to the approach suggested by James, I’ve tentatively sent the corresponding v5 implementation (https://reviews.llvm.org/D56593) for review.

FWIW, I’m still somewhat inclined to suggest LLVM add intrinsics analogous to memcpy, memmove, and memset for memcmp and bcmp so that we can model these cleanly and then expand them late (backend) instead of early even in cases of short sequences, etc.

That doesn’t make this initial patch bad, quite the opposite, I think the patch you have is going in the right direction. I just think we need to push beyond this a bit as well.