[RFC] Vector math function loop idiom recognition

Hello all,

A patch is created in Phabricator to demonstrate: ⚙ D150120 [RFC] Vector math function loop idiom recognition.

This patch extends the loop idiom recognize pass by recognizing a scalar math function call in the loop and transforming it to a vector math function call.

For example, to transform from

  for (int i = 0; i < 100; i++)
    y[i] = exp(x[i]); // scalar math

to

  vexp(y, x, 100); // vector math

A vector math function computes the same mathematical function for a vector of operands stored in contiguous memory. By design, the average computation time per compute element in vector math function is less than that in the equivalent scalar math function when the number of compute elements is greater than a threshold value, resulting in better performance overall. There are a number of math libraries supporting vector math functions, including IBM MASS library, Intel Math Kernel Library, etc. As an example, the vector math functions from IBM MASS library are approximately 30 times faster per compute element on geometric mean than the scalar equivalent on IBM Power10, measured by computing 1000 elements with valid but random input values.

The threshold value is dependent on the specific math function and library. The values may be generated from a heuristic, and then provided to the pass in a table from TargetLibraryInfo.

The motivation of this patch is to achieve this performance benefit for various math libraries on different targets. Hence, the design is to transform the idioms to a new set of LLVM intrinsics for vector math functions, which is designed to be general for all math libraries on all targets. Then a new pass added in the back-end will lower the intrinsics to the actual vector math functions on its target. To demonstrate, a new pass is added in the PowerPC target to lower the intrinsics to the MASS library vector math functions in this patch.

In a more complex loop, preparation passes may be required before this patch can recognize the idiom. For example, when there are data dependencies on the input or output of the scalar math function in the loop, loop distribution may be required to split the dependencies into separate loops.

As a demonstration, currently this patch accepts the threshold value for profitability from a command line option, and only evaluates it in loops with known trip count at compile time. For loops with unknown trip count at compile time, one solution is to version the loop and insert condition checking code to evaluate at run time.

Regarding adding the new vector math functions, another idea is to expand the existing VecFunc.def to include them, instead of adding a new VectorMathFunc.def shown in this patch. With that, the VecDesc structure in TLI may be expanded and changed as below.

  struct VecDesc {
    StringRef ScalarFnName; // scalar math function (e.g. exp)
    StringRef SIMDFnName; // rename to SIMD math function (e.g. expd2)
    ElementCount VectorizationFactor; // vectorization factor for the SIMD math function
    StringRef VectorFnName; // new vector math function (e.g. vexp)
    Intrinsic::ID IntrinID; // new LLVM intrinsic for vector math function
  }

The new name “SIMD” replaces the old name “vector” to represent the type of math functions which takes a vector data type as the parameter and return type, such as expd2. And the name “vector” is used to represent the new type of math functions that are introduced in this patch, such as vexp. The names and definitions of the query functions such as isFunctionVectorizable will also need to be changed accordingly. This approach may help better distinguishing these two types of math functions in LLVM.

Any comments would be appreciated.

Using LoopIdiomRecognize here seems impractical. Most practical use-cases are going to involve some mix of math library calls, arithmetic, and non-trivial indexing, and LoopIdiom is very hard to extend in that direction. It doesn’t have the necessary framework to reason about it. I expect it makes more sense to integrate with the loop vectorizer.

More importantly, VPlan should learn how to user vector math libraries.

1 Like

@efriedma-quic I think the motivation is to match the functions in library such as MASS, MKL, etc. that have been hand-optimized. Those only implement a specific kernel, not a mix of library calls, non-trivial indexing, etc.

Support in LoopVectorize would allow vectorizing user kernels, but not use of such library calls. I am not sure how exactly these pointwise functions are hand-optimized (maybe GPU offloading with a sufficient number of elements), and finding such kernels in the wild without the computation using the result in the same loop might be rare, but code can be written specifically with such an idiom recognition in mind which remains portable even if the compiler does recognize them, so there is some use.

The approach to match math kernels (maybe even something more complex such as gemm in the future) makes sense to me. It would be nice to have another library other than MASS make use of it as well which shows it does not only apply to IBM PowerPC. E.g. Intel’s equivalent might be v?Exp

Say you have a function that efficiently computes exp() for 128 floats, and you have a loop like for (int i = 0; i < n; i++) a[i] = x * exp(a[i]);. You can transform that to something along the lines of for (int i = 0; i < n; i+= 128) { float tempbuffer[128]; vec_exp(tempbuffer, a+i, 128); for (int j = 0; j < 128; ++j) a[i+j] = x * tempbuffer[j]; }.

I can’t see LoopIdiom ever growing to handle that sort of thing… so I don’t want to add code we’re clearly going to have to abandon later when we want to handle real loops. And I don’t really want to encourage people to rewrite their code specifically for the sake of satisfying this very narrow LoopIdiom transform when we clearly should have the capability to optimize loops written normally.

@efriedma-quic Thank you for the suggestions!
Just to clarify, the proposed approach is not going to grow LoopIdiom to handle the more complicated loop like the one that you showed. To handle it, the plan is to use/improve other preparation passes such as loop distribution to create this simpler code pattern for LoopIdiom to process.
Do you think this plan sounds feasible compared to expanding LoopVectorize to handle it?

I can’t see how the cost modeling would work. If the point of distributing the loop is to unlock those transforms, we need to prove that they’re profitable when we distribute the loop… and if we’ve already done that, we might as well just do the transform at that point.

(Also, we currently don’t run LoopIdiom late enough in the pipeline that it would run after loop distribution.)

The vector function can take a run-time value as the size parameter; for example, vec_exp(tempbuffer, a, n). I’m not very sure how LoopVectorize handles this case. Do you know if LoopVectorize has the utility to handle it?
For example, by using “n” as the size parameter, the output code for the same example could look like:

vec_exp(a, a, n);
for (int i = 0; i < n; ++i)
    a[i] = x * a[i];

And this code may perform better than the case of having vec_exp(tempbuffer, a[i], 128) inside a loop.

In a different example with more statements in the loop body, such as:

for (int i = 0; i < n; i++) {
    b[i] = c[i] + k;
    a[i] = x * exp(a[i] + y);
    e[i] = b[i] * z;
}

If LoopVectorize is going to generate something like:

for (int i = 0; i < n; i+= 128) {
    float tempbuffer1[128];
    float tempbuffer2[128];
    for (int j = 0; j < 128; ++j) {
        temp = a[i+j] + y;
        tempbuffer1[j] = temp;
        b[i+j] = c[i+j] + k;
        e[i+j] = b[i+j] * z;
    }

    vec_exp(tempbuffer2, tempbuffer1, 128);

    for (int j = 0; j < 128; ++j) {
        temp = x * tempbuffer2[j];
        a[i+j] = temp;
    }
}

It seems LoopVectorize needs to learn distributing the loop anyway if I understand correctly.

That’s roughly what I’m thinking, yes. (I’m not actively involved with VPlan stuff, so I can’t really give you any guidance there.)

It might make sense to do the computation in chunks even if you don’t need a temporary buffer. If you do all the exp() operations, then all the multiply operations, you’re pushing the values out of the cache, so you might end up bottlenecked on memory accesses instead of the actual computation.