[RFC] Add struct-returning intrinsics for math functions with output pointers

Proposal

Math library functions with output pointers should be represented in LLVM as intrinsics that return structures. These intrinsics could then be emitted by clang when -fno-math-errno is set.

When emitting these intrinsics clang would insert explicit stores for the output pointer results.

The initial candidates for this would be:

  • void sincos(T val, T* sin_out, T* cos_out)
    • Becomes: { T, T } @llvm.sincos.*(T %val)
  • void sincospi(T val, T* sin_out, T* cos_out)
    • Becomes: { T, T } @llvm.sincospi.*(T %val)
  • T modf(T val, T* int_part_out)
    • Becomes: { T, T } @llvm.modf.*(T %val)

Note: The implementation of each of these intrinsics would likely be similar to the recently added llvm.frexp.* intrinsic (patch) and its associated clang builtin.

Motivation

Vectorization

Currently, sincos and sincospi can be vectorized (with -fveclib=ArmPL/sleefgnuabi), but aliasing issues can be introduced as LoopAccessAnalysis does not track the pointer operands and assumes vectorizing library calls is safe.

Rather than update LoopAccessAnalysis to track pointer operands, modeling the out pointers with explicit stores in the IR would allow this analysis to work unchanged, solving the aliasing issues.

The vectorization would not be free as the vectorizer would need to be updated to handle widening calls with struct results, but this is likely to be useful for more complex types in future.

Note: This would also mean disallowing vectorizing library calls with in/out pointers, and likely require libraries to provide _stret variants.

New canonicalizations

When safe llvm.sin and llvm.cos intrinsics could be combined into a single call to llvm.sincos. On targets that support sincos this could provide some performance uplift.

This could also be done for sincospi (though currently there are no sinpi or cospi intrinsics).

Struct returns

Targets that implement struct-returning variants of these functions could lower to those directly, allowing the memory for the results to be elided.

Existing implementations (canonicalizations)

Both the merging of sin + cos to sincos and lowering to _stret (of sincos[pi]) variants exist today, but it is done within the SelectionDAG.

The main difference here is this could be brought up to the IR level.

Questions

Are struct returns the best path for avoiding the aliasing issues when vectorizing library functions with output pointer parameters?

Should llvm.sincos be the canonical form of llvm.sin + llvm.cos of the same value?

1 Like

An initial implementation of the llvm.sincos intrinsic and clang builtin is available here: Commits · MacDue/llvm-project · GitHub

This seems reasonable to me, for the cases listed. At least for scalar floating-point types, the lowering to machine code should be straightforward, and I guess if the target indicates that it can handle vectorized forms of the functions it must know how to lower those.

The presence of the structure return values makes me a bit nervous, because the ABI for handling floating point structures can be tricky, but since these are intrinsics I guess we don’t need to worry about the real ABI and since there are a small number of them and the functions are known to the backends anyway that should be fine.

Is there any guarantee that sincos, when present, will return the same values as sin and cos would have individually? If not, I don’t think we should canonicalize to llvm.sincos without the afn fast-math flag.

1 Like

Yep, the intention is only to lower to a function that returns a structure if there’s a known library call for that (that returns the results in registers), otherwise, some memory on the stack will be allocated and a library call using output pointers will be emitted.

Just so I’m on the same page, do you have any examples of floating-point structure specific ABI issues to look out for?

I’m not sure if the values are guaranteed to exactly match calling sin and cos individually (the sincos library call is an extension, so there’s probably nothing in the spec to ensure that). The idea for the fold comes from SDAG folding FSIN and FCOS to FSINCOS. The FSIN/COS ISD nodes come from the llvm.sin/cos intrinsics which are emitted when -fno-math-errno is set (which does not require fast-math/afn to be enabled).

Folding to llvm.sincos is not required for this RFC though, my main concerns are correctly vectorizing math functions with output pointers, and lowering to _stret variants when possible.