Scalable vector support and MOPS support

Hi,

Is there any plan to introduce SVE or MOPS support (in string library; I suppose we can start porting some part of AOR library into C++)?

For scalable vector, I am particularly interested in whether we can provide an abstraction layer for both Arm and RISC-V (or even future architectures).

Regards,
Yifan

Hi Yifan,

I think a SIMD abstraction layer is hard to do in the general case (though maybe you have some good ideas on how it could be done, or what it might look like for a subset of use cases?).

You can read a bit about the original mem* design philosophy in this thread: [libc-dev] mem* Implementations

I expect the string functions will continue to get tuned for different microarchitectures as more people turn their attention to them (and you are welcome to contribute). This is definitely an important area of work.

Regards,

David

If you think that SVE or MOPS can improve some of the implementations, we have to do two things:

  1. Demonstrate that they actually make things better using appropriate benchmarks.
  2. If the improvements are possible only on specific hardware, then we need to set up appropriate testing strategy in place with emulators or real hardware.

About the AOR library, I have a few comments:

  1. We dropped the AOR library in to the libc tree with the intention to absorb it’s math functions. However, in the last one year, we have been able to put together high-quality greenfield implementations of the math functions. So, our plan is to delete the AOR directory completely in the coming weeks.
  2. It is better to keep the the non-libc parts of the AOR library outside the libc.
  3. Our intention was never to absorb those parts of AOR which are in assembly. We want to keep the libc fully implemented in C++, in the worst case using inline assembly in unavoidable contexts.

Hi,

I was thinking about a subset of use cases in mem* part, where SVE can make more branchless scan across the memory area.

As for MOPS, I think it follows the policy that “choosing hardware‘s best offering’ as it can provide aarch64 version of rep movsb. However, I am aware that it will be a long way before we can get real hardware with reasonable mops implementation in practical applications.

LLVM codegen already supports MOPS:
https://reviews.llvm.org/D117763
Do you need assembler/intrinsics in the libc or tune the backend?

@gchatelet owns/drives the development of mem* functions. We are very sensitive to the performance of those functions. So, as I have said previously, any change there has to be done with sufficient demonstrations that we are actually benefiting.

Hi,

I agree with @dpxf and @sivachandra, the implementation of the string functions are micro-architecturaly tuned and if we want to add improvements we should be benchmarking these. What may work for one uarch might not for another. My contributions to libc mem* functions for AArch64 were mainly targetting Neoverse N1, using lessons learned from AOR.

In AOR we now have a memcpy-sve.S version, this was added after benchmarking showed that using a SVE loop for smaller sizes (up to 32 bytes I believe) showed an improvement over the original cmp+branch code, I think it is feasible to add this to LIBC and prove it’s improvement to these sizes with the available benchmarks.

@sivachandra and Guillaume (couldn’t tag you, seems new users can only tag up to two users!?) I should have asked this before but do you have any CI with benchmarking to track changes in the mem* benchmarks?

do you have any CI with benchmarking to track changes in the mem* benchmarks?

Unfortunately no we don’t have public CI. I think that would be a great addition.
We do run the benchmarks internally at Google before doing a release though.

#include <arm_sve.h>
#include <cstddef>
#include <cstdint>
#include <iostream>
template <class T> struct VecImpl;

#define IMPLEMENTATION(TYPE, BITCOUNT, SIGN)                                   \
  template <> struct VecImpl<TYPE> {                                           \
    using VecType = sv##TYPE;                                                  \
    static svbool_t init_predicates(bool val) {                                \
      return svdup_b##BITCOUNT(val);                                           \
    }                                                                          \
    static VecType init_register(TYPE val) {                                   \
      return svdup_##SIGN##BITCOUNT(val);                                      \
    }                                                                          \
    template <class C> static C increase(C t, svbool_t preds) {                \
      return svqincp_b##BITCOUNT(t, preds);                                    \
    }                                                                          \
    template <class C> static C decrease(C t, svbool_t preds) {                \
      return svqdecp_b##BITCOUNT(t, preds);                                    \
    }                                                                          \
  };

IMPLEMENTATION(int8_t, 8, s);
IMPLEMENTATION(uint8_t, 8, u);

template <class T> using VectorType = typename VecImpl<T>::VecType;

template <class T> size_t find_first_unbounded(const T *__restrict address, T val) {
  size_t offset = 0;
  const svbool_t predicates = VecImpl<T>::init_predicates(true);
  svbool_t compare, valid;
  svsetffr();

  while (true) {
    VectorType<T> data = svldff1(predicates, &address[offset]);
    valid = svrdffr();
    if (__builtin_expect(svptest_last(predicates, valid), true)) {
      compare = svcmpeq(predicates, data, val);
      offset = VecImpl<T>::increase(offset, predicates);
      if (__builtin_expect(svptest_any(predicates, compare), false)) {
        offset = VecImpl<T>::decrease(offset, predicates);
        break;
      }
    } else {
      compare = svcmpeq(valid, data, val);
      if (svptest_any(valid, compare)) {
        break;
      } else {
        offset = VecImpl<T>::increase(offset, valid);
        svsetffr();
      }
    }
  };
  svbool_t breaked = svbrkb_z(predicates, compare);
  offset = VecImpl<T>::increase(offset, breaked);
  return offset;
}

(I did not give a careful check of the code above: it should be the translation of strlen-sve.S)

As for the abstraction, I think some thin wrapper like this will work (more operations can be added such has bounded scan, memory movement/comparison, etc.). It provides the core function set for string operations while leave some room for further internal use cases.

AFAICT it is quite hard to come up with the right abstraction.

As a matter of fact I’m working on the 3rd incarnation of the mem* functions framework (v1 is the one in production, v2 has been removed because it is too rigid, v3 will soon be posted as a patch).

Again AFAICT there is no one size fits all. Microarchitectural differences make it so that even the high level algorithm can change dramatically (e.g. use of ZVA for bzero on aarch64 ). That said I’m not opposed to abstraction layers (quite the opposite actually), I’m just saying that the best one is usually not the first that comes to mind. So I would be in favor of tinkering enough with the code up until good abstractions emerge.

Regarding strlen there is another caveat. In the general case it is undefined behavior in C to read past the '\0' character (in the sense that it may be out of the object). Now we know that on most architectures we can at least read cache lines safely which allows for some optimizations but this will not play nicely with sanitizers. They will reasonably flag the out of bound reads.

Being sanitizer friendly is a goal of the libc, and indeed our current implementation of strlen is “slow” but sanitizer friendly. @michaelrj-google has an unsubmitted patch for a faster (though not microarchitecture optimized) implementation in ⚙ D129808 [libc] add unsafe mode to strlen.

It makes me curious then whether the fault suppressing loads/stores from sve (svldff/svldnf/etc) will make sanitizers happy?

I have to admit that sanitizers are not my ‘forte’, but I have seen some hints that it is possible to tell a sanitizer to ignore or interpret things differently. Is there anyway we can instrument libc strlen & friends with such information such that we can use these optimizations in architectures where this is safe?

AFAIU the sanitizer will bail out if it can’t reason about the code (e.g. asm volatile).

  1. This can be fixed by using the sanitizer API to instruct the sanitizer what to do in such cases but then the libc code becomes dependent on the sanitizer code which brings a set of challenges (cyclic dependencies).

  2. We can also work the other way round by teaching the sanitizer to use the “slow but safe” version of strlen when the code is instrumented and the “fast but unsafe” when it is not. In this case the sanitizer becomes aware of the llvm-libc and we never get to test the “unsafe” version that will eventually run in production. Because strlen is used as an implementation detail of a bunch of other libc functions this lowers the actual coverage of the libc.

Again AFAIU we cannot have the unsafe version instrumented in production. One possibility is to have a special compilation mode of the libc where we enable the sanitizer API (via preprocessor) and run the instrumented unit and integration tests. This gives some confidence that the unsafe code is OK. Then in “sanitized production” mode, we would disable the unsafe version when the sanitizer is enabled. It’s not ideal but probably the closer we can get to correct.

Let me also add some info about prior art we are aware of and the problems we faced when we tried to do this ourselves:

  1. To make memory and string functions sanitizer friendly, we have seen some examples from other libcs where they switch to safe but performance compromised versions when compiled with sanitizers. As @gchatelet has pointed out, this does not really build much confidence for the production environment.
  2. Using the sanitizer API to hand-craft instrumentation is fine in unavoidable scenarios. As @gchatelet has pointed out, we probably do not have other options when using inline assembly. But, inserting hand-crafted instrumentation for pure C/C++ does not seem like a scalable option.
  3. In Google, we are rolling out LLVM’s libc in a gradual fashion. We can choose to rollout a sanitizer instrumented strlen for sanitizer builds. However, one problem we have hit is this: strlen is called from few other sanitizer intercepted functions. If we instrument strlen, then we end up with a situation wherein, when the sanitizers call the real function from their interceptor, that real function ends up calling an instrumented strlen and throws the sanitizers off.

Thanks @gchatelet and @sivachandra.

Sounds to me like it’s a bad idea to have a non-sanitized version. If you aren’t testing your production code then why test in the first place?

I would like to see if I can find a middle way, where we do size check’s (very much like in memcpy), so that we can have an optimized loop for larger strlen’s. It will not be as fast as the ‘just do 32-bytes’-anyway approach, but it should be faster than a byte-sized cmp loop? Which seems to be what the current C++ implementation would lead to?

So that leaves me the benchmarking question, just like with mempcy we can benchmark strlen in many different ways, making different choices for ordering of cases look better than others. AoR has a benchmark that uses data captured from SPEC CPU 2017. Does anyone here have other data sets they’d like to have benchmarked?

I don’t know where I was with my thoughts when I wrote the above, but there’s not any middle-ground to be had with strlen, since there’s no ‘length’ to check as you are looking for the first terminating character… so ignore that.