Complex proposal v3 + roundtable agenda

Ahead of the Wednesday’s roundtable at the developers’ conference, here is version three of

the proposal for first-class complex types in LLVM. I was not able to add Krzysztof Parzyszek’s

suggestion of a “cunzip” intrinsic returning two vectors as I could not find examples of intrinsics

that return two values at the IR level. The Hexagon intrinsics declared to return two values do

not actually have both of their values used at the IR level as far as I can determine. We can

discuss this more at the roundtable.

Following is a general outline for Wednesday’s roundtable. Please have a look and make

any suggestions you’d like about topics we should cover. Feel free to add to the list of

questions to discuss as well.

LLVM Complex Types Roundtable

Introductions (name/affiliation if any/interest)

Reasons for a first-class type

  • Reasoning about algebraic optimization

  • Preserve semantics through vectorization and into target-specific lowering
    Different targets have support for different algorithms

  • Take advantage of faster & less precise algorithms with options/pragmas

  • Better diagnostics for users

  • Other motivations?

Open questions

  • A cunzip intrinsic would need to work at the IR level, returning two
    separate SSA values. How can this be done? The example given was a
    Hexagon-specific intrinsic that doesn’t appear to make use of the two
    destinations at the IR level.

  • Are separate extratreal/extractimag intrinsics sufficient for targets that
    support such operations (e.g. NEON’s VUZP)?

  • The proposal allows bitcasts of vector of complex, even though bitcasts of
    aggregates in general are disallowed. Is this special case reasonable?

  • If we allow such bitcasts, is czip necessary, or is shufflevector + bitcast
    to vector of complex sufficient?

  • Some frontends will likely want to communicate specific algorithms for
    computing complex values (e.g. C Annex G). What is the best way to do this?
    User compiler options? Pragmas? Function attributes? Something else?

  • What TTI interfaces would be useful to inform the optimizer how best to
    lower complex operations?

  • When should lowering be done?

  • Other questions?

I am looking forward to our discussion!

                                 -David

Proposal to Support Complex Operations in LLVM

Revision History

v1 - Initial proposal [1]

v2 - 2nd draft [2]

  • Added complex of all existing floating point types
  • Made complex a special aggregate
  • Specified literal syntax
  • Added special index values “real” and “imag” for insertvalue/extractvalue
    of complex3
  • Added czip intrinsic to create vector of complex from vectors of
    real/imaginary
  • Added extractreal and extractimag intrinsics for vectors of complex
  • Added masked vector intrinsics

v3 - This proposal

  • Added vector-of-complex types
  • Added scalable vector support
  • Added bitcasts of vector of complex

Abstract

Several vendors and individuals have proposed first-class complex support in
LLVM. Goals of this proposal include better optimization, diagnostics and
general user experience.

Introduction and Motivation

Recently the topic of complex numbers arose on llvm-dev with several developers
expressing a desire for first-class IR support for complex [3] [4]. Interest in
complex numbers in LLVM goes back much further [5].

Currently clang chooses to represent standard types like “double complex” and
“std::complex” as structure types containing two scalar fields, for
example {double, double}. Consequently, arrays of complex type are represented
as, for example, [8 x {double, double}]. This has consequences for how clang
converts complex operations to LLVM IR. In general, clang emits loads of the
individual real and imaginary parts and feeds them into arithmetic operations.
Vectorization results in many shufflevector operations to massage the data into
sequences suitable for vector arithmetic.

All of the real/imaginary data manipulation obscures the underlying arithmetic.
It makes it difficult to reason about the algebraic properties of expressions.
For expressiveness and optimization ability, it will be nice to have a
higher-level representation for complex in LLVM IR. In general, it is desirable
to defer lowering of complex until the optimizer has had a reasonable chance to
exploit its properties.

First-class support for complex can also improve the user experience.
Diagnostics could express concepts in the complex domain instead of referring to
expressions containing shuffles and other low-level data manipulation. Users
that wish to examine IR directly will see much less gobbbledygook and can more
easily reason about the IR.

Types

This proposal introduces new aggregate types to represent complex numbers.

c16 - Complex of 16-bit float
c32 - like float complex or std::complex
c64 - like double complex or std::complex
x86_c80 - Complex of x86_fp80
c128 - like long double complex or std::complex
ppc_c128 - Complex of ppc_fp128

Note that the references to C and C++ types above are simply explanatory.
Nothing in this proposal assumes any particular high-level language type will
map to the above LLVM types.

The “underlying type” of a complex type is the type of its real and imaginary
components.

The sizes of the complex types are twice that of their underlying types. The
real part of the complex shall appear first in the layout of the types. The
format of the real and imaginary parts is the same as for the complex type’s
underlying type. This should map to most common data representations of complex
in various languages.

These types are not considered floating point types for the purposes of
Type::isFloatTy and friends, llvm_anyfloat_ty, etc. in order to limit surprises
when introducing these types. New APIs will allow querying and creation of
complex types:

bool Type::isComplexTy() const;
bool Type::isComplex16Ty() const;
bool Type::isComplex32Ty() const;
bool Type::isComplex64Ty() const;
bool Type::isComplexX86_FP80Ty() const;
bool Type::isComplex128Ty() const;
bool Type::isComplexPPC_FP128Ty() const;

The types are a special kind of aggregate, giving them access to the insertvalue
and extractvalue operations with special notation (see below).

We can define vectors of complex:

<8 x c16> - Vector of eight complex of 16-bit float (128 bits total)
<4 x c32> - Vector of four complex of 32-bit float (128 bits total)
<4 x c64> - Vector of four complex of 64-bit float (512 bits total)

Such vectors may be scalable:

<vscale x 8 x c16>
<vscale x 4 x c32>
<vscale x 1 x c64>
<vscale x 2 x c64>

Analogous ValueTypes will be used by intrinsics.

vdef c16 : ValueType<32, uuu>
def c32 : ValueType<64, vvv>
def c64 : ValueType<128, www>
def x86c80 : ValueType<160, xxx>
def c128 : ValueType<256, yyy>
def ppcc128 : ValueType<256, zzz>

def v8c16 : ValueType<128, aaa>
def v4c32 : ValueType<128, bbb>
def v4c64 : ValueType<512, ccc>

def nxv8c16 : ValueType<128, ddd>
def nxv4c32 : ValueType<128, eee>
def nxv1c64 : ValueType<128, fff>
def nxv2c64 : ValueType<256, ggg>

def llvm_anycomplex_ty : LLVMType;
def llvm_c16_ty : LLVMType;
def llvm_c32_ty : LLVMType;
def llvm_c64_ty : LLVMType;
def llvm_x86c80_ty : LLVMType;
def llvm_c128_ty : LLVMType;
def llvm_ppcc128_ty : LLVMType;

def llvm_v8c16_ty : LLVMType;
def llvm_v4c32_ty : LLVMType;
def llvm_v4c64_ty : LLVMType;

The numbering of the ValueTypes will be determined after discussion. It may be
desirable to insert the scalar types before the existing vector types, grouping
them with the other scalar types or we may want to put them somewhere else.
Similarly, the vector types may be grouped with the other vector types or
somewhere else.

Literals

Literal complex values have special spellings ‘(’ ‘+’|’-’
‘i’ ‘)’:

%v1 = c64 ( 5.67 + 1.56i )
%v2 = c64 ( 55.87 - 4.23i )
%v3 = c64 ( 55.87 + -4.23i )
%v4 = c32 ( 8.24e+2 + 0.0i )
%v5 = c16 ( 0.0 + 0.0i )

Note that the literal representation requires an explicit specification of the
imaginary part, even if zero. A “redundant” <+ negative imaginary> is allowed
to facilitate reuse of floating point constants.

Operations

This proposal overloads existing floating point instructions for complex types
in order to leverage existing expression optimizations:

c64 %res = fadd c64 %a, c64 %b
v8c64 %res = fsub v8c64 %a, v8c64 %b
c128 %res = fmul c128 %a, c128 %b
v4c32 %res = fdiv v4c64 %a, v4c64 %b

The only valid comparisons of complex values shall be equality:

i1 %res = eq c32 %a, c32 %b
i8 %res = eq v8c32 %a, v8c32 %b
i1 %res = ne c64 %a, c64 %b
i8 %res = ne v8c64 %a, v8c64 %b

select is defined for complex:

c32 %res = select i1 %cmp, c32 %a, c32 %b
v4c64 %res = select i4 %cmp, v4c64 %a, v4c64 %b

Complex values may be casted to other complex types:

c32 %res = fptrunc c64 %a to c32
c64 %res = fpext c32 %a to c64

As a special case, vectors of complex may be bitcasted to vectors of their
underlying type:

v8f32 %res = bitcast <4 x c32> to <8 x float>

Complex types were defined as aggregates above, but special ones. One aspect of
their specialness is allowing bitcasts of vector of complex to equal-width
vectors of their underlying type.

insertvalue and extractvalue may be used with the special index values “real”
and “imag”:

%real = f32 extractvalue c32 %a, real
%real = c64 insertvalue c64 undef, f64 %r, real
%cplx = c64 insertvalue c64 %real, f64 %i, imag

The pseudo-value “real” shall evaluate to the integer constant zero and the
pseudo-valid “imag” shall evaluate to the integer constant one, as if
extractvalue/insertvalue were written with 0/1. The use of any other index with
a complex value is undefined.

We also overload existing intrinsics:

declare c16 @llvm.sqrt.c16(c16 %val)
declare c32 @llvm.sqrt.c32(c32 %val)
declare c64 @llvm.sqrt.c64(c64 %val)
declare x86_c80 @llvm.sqrt.x86_c80(x86_c80 %val)
declare c128 @llvm.sqrt.c128(c128 %val)
declare ppc_c128 @llvm.sqrt.ppc_c128(ppc_c128 %val)

declare c16 @llvm.pow.c16(c16 %val, c16 %power)
declare c32 @llvm.pow.c32(c32 %val, c32 %power)
declare c64 @llvm.pow.c64(c64 %val, c64 %power
declare x86_c86 @llvm.pow.x86_c80(x86_c80 %val, x86_c80 %power
declare c128 @llvm.pow.c128(c128 %val, c128 %power)
declare ppc_c128 @llvm.pow.ppc_c128(ppc_c128 %val, ppc_c128 %power)

declare c16 @llvm.sin.c16(c16 %val)
declare c32 @llvm.sin.c32(c32 %val)
declare c64 @llvm.sin.c64(c64 %val)
declare x86_c80 @llvm.sin.x86_c80(x86_c80 %val)
declare c128 @llvm.sin.c128(c128 %val)
declare ppc_c128 @llvm.sin.ppc_c128(ppc_c128 %val)

declare c16 @llvm.cos.c16(c16 %val)
declare c32 @llvm.cos.c32(c32 %val)
declare c64 @llvm.cos.c64(c64 %val)
declare x86_c80 @llvm.cos.x86_c80(x86_c80 %val)
declare c128 @llvm.cos.c128(c128 %val)
declare ppc_c128 @llvm.cos.ppc_c128(ppc_c128 %val)

declare c16 @llvm.log.c16(c16 %val)
declare c32 @llvm.log.c32(c32 %val)
declare c64 @llvm.log.c64(c64 %val)
declare x86_c80 @llvm.log.x86_c80(x86_c80 %val)
declare c128 @llvm.log.c128(c128 %val)
declare ppc_c128 @llvm.log.ppc_c128(ppc_c128 %val)

declare half @llvm.fabs.c16(c16 %val)
declare double @llvm.fabs.c64(c64 %val)
declare x86_fp80 @llvm.fabs.x86_c80(x86_c80 %val)
declare fp128 @llvm.fabs.c128(c128 %val)
declare ppc_fp128 @llvm.fabs.ppc_c128(ppc_c128 %val)

Conversion to/from half-precision overloads the existing intrinsics.

llvm.convert.to.c16.* - Overloaded intrinsic to convert to c16.

declare c16 @llvm.convert.to.c16.c32(c32 %val)
declare c16 @llvm.convert.to.c16.c64(c64 %val)

llvm.convert.from.c16.* - Overloaded intrinsic to convert from c16.

declare c32 @llvm.convert.from.c16.c32(c16 %val)
declare c64 @llvm.convert.from.c16.c64(c16 %val)

In addition, new intrinsics will be used for complex-specific operations:

llvm.cconj.* - Overloaded intrinsic to compute the conjugate of a
complex value

declare c16 @llvm.cconj.c16(c16 %val)
declare c32 @llvm.cconj.c32(c32 %val)
declare c64 @llvm.cconj.c64(c64 %val)
declare x86_c80 @llvm.cconj.x86_c80(x86_c80 %val)
declare c128 @llvm.cconj.c128(c128 %val)
declare ppc_c128 @llvm.cconj.ppc_c128(ppc_c128 %val)

llvm.czip.* - Overloaded intrinsic to create a vector of complex from two
vectors of floating-point type (not all variants shown)

declare v4c32 @llvm.czip.v4c32(v4f32 %real, v4f32 %imag)
declare v4c64 @llvm.czip.v4c32(v4f64 %real, v4f64 %imag)

llvm.extractreal.* - Overloaded intrinsic to create a vector of floating-point
type from the real portions of a vector of complex (not all
variants shown)

declare v4f32 @llvm.extractreal.v4c32(v4c32 %val)
declare v4f64 @llvm.extractreal.v4c64(v4c64 %val)

llvm.extractimag.* - Overloaded intrinsic to create a vector of floating-point
type from the imaginary portions of a vector of complex
(not all variants shown)

declare v4f32 @llvm.extractimag.v4c32(v4c32 %val)
declare v4f64 @llvm.extractimag.v4c64(v4c64 %val)

Masked intrinsics are also overloaded. The complex types are considered a
single logical entity and thus the mask bits correspond to the complex value as
a whole, not the individual real and imaginary parts:

llvm.masked.load.* - Overloaded intrinsic to load complex under mask
(not all variants shown)

declare v4c32 @llvm.masked.load.v4c32.p0v4c32(<4 x c32>* %ptr,
i32 %alignment,
<4 x i1> %mask,
<4 x c32> %passthrough)

declare v8c32 @llvm.masked.load.v8c64.p0v8c64(<8 x c64>* %ptr,
i32 %alignment,
<8 x i1> %mask,
<8 x c64> %passthrough)

llvm.masked.store.* - Overloaded intrinsic to store complex under mask (not all
variants shown)

declare void @llvm.masked.store.v4c32.p0v4c32(<4 x c32> %val,
<4 x c32>* %ptr,
i32 %alignment,
<4 x i1> %mask)

declare void @llvm.masked.store.v8c64.p0v8c64(<8 x c64> %val,
<8 x c64>* %ptr,
i32 %alignment,
<8 x i1> %mask)

llvm.masked.gather.* - Overloaded intrinsic to gather complex under mask (not
all variants shown)

declare v4c32 @llvm.masked.gather.v4c32.p0v4c32(<4 x c32 *> %ptrs,
i32 %alignment,
<4 x i1> %mask,
<4 x c32> %passthrough)

declare v8c32 @llvm.masked.gather.v8c64.p0v8c64(<8 x c64*> %ptrs,
i32 %alignment,
<8 x i1> %mask,
<8 x c64> %passthrough)

llvm.masked.scatter.* - Overloaded intrinsic to scatter complex under mask (not
all variants shown)

declare void @llvm.masked.scatter.v4c32.p0v4c32(<4 x c32> %val,
<4 x c32*> %ptrs,
i32 %alignment,
<4 x i1> %mask)

declare void @llvm.masked.scatter.v8c64.p0v8c64(<8 x c64> %val,
<8 x c64*> %ptrs,
i32 %alignment,
<8 x i1> %mask)

llvm.masked.expandload.* - Overloaded intrinsic to expandload complex under mask
(not all variants shown)

declare v4c32 @llvm.masked.expandload.v4c32.p0v4c32(c32* %ptr,
<4 x i1> %mask,
<4 x c32> %passthrough)

declare v8c32 @llvm.masked.expandload.v8c64.p0v8c64(c64* %ptr,
<8 x i1> %mask,
<8 x c64> %passthrough)

llvm.masked.compressstore.* - Overloaded intrinsic to compressstore complex
under mask (not all variants shown)

declare void @llvm.masked.compressstore.v4c32.p0v4c32(<4 x c32> %val,
c32* %ptr,
<4 x i1> %mask)

declare void @llvm.masked.compressstore.v8c64.p0v8c64(<8 x c64> %val,
c64* %ptr,
<8 x i1> %mask)

Conclusion

This proposal introduces new complex types and overloads existing floating point
instructions and intrinsics for common complex operations and introduces new
intrinsics for complex-specific operations.

Goals of this work include better reasoning about complex operations within
LLVM, leading to better optimization, reporting and overall user experience.

This is a draft and subject to change.

[1] [llvm-dev] RFC: Complex in LLVM
[2] [llvm-dev] Complex proposal v2
[3] [llvm-dev] EuroLLVM Numerics issues
[4] [llvm-dev] EuroLLVM Numerics issues
[5] [LLVMdev] complex numbers with LLVM

Hi,

There’s growing interest among our users to make better use of dedicated hardware instructions for complex math and I would like to re-start the discussion on the topic. Given that this original thread was started a while ago apologies if I missed anything already discussed earlier on the list or the round-table. The original mail is quoted below.

In particular, I’m interested in the AArch64 side of things, like using FCMLA [1] for complex multiplications to start with.

To get the discussion going, I’d like to share an alternative pitch. Instead of starting with adding complex types, we could start with adding a set of intrinsics that operate on complex values packed into vectors instead.

Starting with intrinsics would allow us to bring up the lowering of those intrinsics to target-specific nodes incrementally without having to make substantial changes across the codebase, as adding new types would require. Initially, we could try and match IR patterns that correspond to complex operations late in the pipeline. We can then work on incrementally moving the point where the intrinsics are introduced earlier in the pipeline, as we adopt more passes to deal with them. This way, we won’t have to teach all passes about complex types at once or risk loosing out all the existing combines on the corresponding floating point operations.

I think if we introduce a small set of intrinsics for complex math (like @llvm.complex.multiply) we could use them to improve code-generation in key passes like the vectorizers and deliver large improvements to our users fairly quickly. There might be some scenarios which require a dedicated IR type, but I think we can get a long way with a set of specialized intrinsics at a much lower cost. If we later decide that dedicated IR types are needed, replacing the intrinsics should be easy and we will benefit of having already updated various passes to deal with the intrinsics.

We took a similar approach when adding matrix support to LLVM and I think that worked out very well in the end. The implementation upstream generates equivalent or better code than our earlier implementation using dedicated IR matrix types, while being simpler and impacting a much smaller area of the codebase.

An independent issue to discuss is how to generate complex math intrinsics.
As part of the initial bring-up, I’d propose matching the code Clang generates for operations on std::complex<> & co to introduce the complex math intrinsics. This won’t be perfect and will miss cases, but allows us to deliver initial improvements without requiring extensive updates to existing libraries or frontends. I don’t think either the intrinsic only or the complex type variants are inherently more convenient for frontends to emit.

To better illustrate what this approach could look like, I put up a set of rough patches that introduce a @llvm.complex.multiply intrinsic (https://reviews.llvm.org/D91347), replace a set of fadd/fsub/fmul instructions with @llvm.complex.multiply (https://reviews.llvm.org/D91353) and lower the intrinsic for FCMLA on AArch64 (https://reviews.llvm.org/D91354). Note that those are just rough proof-of-concept patches.

Cheers,
Florian

[1] https://developer.arm.com/docs/ddi0596/h/simd-and-floating-point-instructions-alphabetic-order/fcmla-floating-point-complex-multiply-accumulate

Hi Florian,

The proposed experimental intrinsics are a difficult detour to accept
for performance reasons. With a complex type, the usual algebraic
simplifications fall out for free (or close to it). Teaching existing
optimizations how to handle the new complex intrinsics seems like a
LOT of unnecessary work.

That said, we recently had this same conversation at Simon Moll's
native predication sync-up meeting. Simon had some convincing ways to
workaround predicated intrinsic optimization (e.g. the
PredicatedInstruction class). Maybe we should explore a more
generalized solution that would cover complex intrinsics too?

Digressing a bit, have we ever discussed using a branch to develop
something like complex support? That way we would avoid an
experimental intrinsic implementation, but also not disturb the
codebase until the implementation is complete.

-Cameron

Hi,

There’s growing interest among our users to make better use of dedicated hardware instructions for complex math and I would like to re-start the discussion on the topic. Given that this original thread was started a while ago apologies if I missed anything already discussed earlier on the list or the round-table. The original mail is quoted below.

In particular, I’m interested in the AArch64 side of things, like using FCMLA [1] for complex multiplications to start with.

To get the discussion going, I’d like to share an alternative pitch. Instead of starting with adding complex types, we could start with adding a set of intrinsics that operate on complex values packed into vectors instead.

Starting with intrinsics would allow us to bring up the lowering of those intrinsics to target-specific nodes incrementally without having to make substantial changes across the codebase, as adding new types would require. Initially, we could try and match IR patterns that correspond to complex operations late in the pipeline. We can then work on incrementally moving the point where the intrinsics are introduced earlier in the pipeline, as we adopt more passes to deal with them. This way, we won’t have to teach all passes about complex types at once or risk loosing out all the existing combines on the corresponding floating point operations.

I think if we introduce a small set of intrinsics for complex math (like @llvm.complex.multiply) we could use them to improve code-generation in key passes like the vectorizers and deliver large improvements to our users fairly quickly. There might be some scenarios which require a dedicated IR type, but I think we can get a long way with a set of specialized intrinsics at a much lower cost. If we later decide that dedicated IR types are needed, replacing the intrinsics should be easy and we will benefit of having already updated various passes to deal with the intrinsics.

We took a similar approach when adding matrix support to LLVM and I think that worked out very well in the end. The implementation upstream generates equivalent or better code than our earlier implementation using dedicated IR matrix types, while being simpler and impacting a much smaller area of the codebase.

An independent issue to discuss is how to generate complex math intrinsics.
As part of the initial bring-up, I’d propose matching the code Clang generates for operations on std::complex<> & co to introduce the complex math intrinsics. This won’t be perfect and will miss cases, but allows us to deliver initial improvements without requiring extensive updates to existing libraries or frontends. I don’t think either the intrinsic only or the complex type variants are inherently more convenient for frontends to emit.

To better illustrate what this approach could look like, I put up a set of rough patches that introduce a @llvm.complex.multiply intrinsic (https://reviews.llvm.org/D91347), replace a set of fadd/fsub/fmul instructions with @llvm.complex.multiply (https://reviews.llvm.org/D91353) and lower the intrinsic for FCMLA on AArch64 (https://reviews.llvm.org/D91354). Note that those are just rough proof-of-concept patches.

Cheers,
Florian

The proposed experimental intrinsics are a difficult detour to accept
for performance reasons. With a complex type, the usual algebraic
simplifications fall out for free (or close to it). Teaching existing
optimizations how to handle the new complex intrinsics seems like a
LOT of unnecessary work.

Thanks for taking a look!

Could you expand a bit more on what kind of unnecessary work you expect? I would expect most of the code to deal with intrinsics to be easily migrated once/if we decide to switch to a dedicated type.

Concretely, for the lowering code, it should hopefully just boil down to updating the patterns that get matched in the backends (matching the complex multiply instruction instead of @llvm.complex.multiply). For supporting complex math in the vectorizers, we have to add support for cost-modeling and support widening the intrinsics. Again, wouldn’t’ changing from intrinsics to a type just mean adjusting from dealing with intrinsic calls to the corresponding instructions?

There certainly is some pieces around the edges that will need adjusting or become obsolete, but I would hope/expect the majority of the work to be re-usable.

As for getting the usual algebraic simplifications for free, even with a new type I suppose we would have to teach instcombine/InstructionSimplify about them. This is indeed an area where a dedicated type is probably quite a bit easier to deal with. But I think part of the vector-predication proposal includes some generic matchers for the different intrinsic, which should be relatively straight-forward to update as well.

I’ll admit that the scope of my pitch is much more limited than the original proposal and very focused on allowing LLVM to use specialized instructions. But it should be relatively simple to implement (without impacting anything unrelated to the passes/backends that are adjusted) and easy to extend to meet additional needs by other people & backends. It also allows to reuse the existing instructions for insert/extract/shuffles and the corresponding folds.

Also I’ve been looking at this through an AArch64 lens, so this proposal should be easy to map to the available instructions there. I would appreciate any thoughts on how this might clash with the available instructions on other targets.

That said, we recently had this same conversation at Simon Moll’s
native predication sync-up meeting. Simon had some convincing ways to
workaround predicated intrinsic optimization (e.g. the
PredicatedInstruction class). Maybe we should explore a more
generalized solution that would cover complex intrinsics too?

Digressing a bit, have we ever discussed using a branch to develop
something like complex support? That way we would avoid an
experimental intrinsic implementation, but also not disturb the
codebase until the implementation is complete.

Personally I would be cautious to go down that road. I think there is a big benefit to being able to bring up support on the main branch, in a way that the code is exercised by default. It provides substantial test coverage and allows us to spot regressions early on, hence avoiding a ‘big switch’ at the end. And it allows people to collaborate more easily then on a separate branch, which also needs to be kept up-to-date.

I would expect adding dedicated complex types to be a big project, which will require a substantial amount of work to be done up front before yielding tangible benefits. If there’s a clear indication that dedicated types make our lives substantially easier, there’s buy-in and people willing to tackle this, I am all for it. Historically however, getting new types in has been challenging and there are quite a few pieces in LLVM that have been added but were never pushed across the finish line. Whatever direction we take, this is something I would like to avoid.

Cheers,
Florian

Some architectures have instructions that assist with complex arithmetic. Without intrinsics it may be hard to use such instructions especially because of the arithmetic simplifications. Perhaps, depending on TTI, those intrinsics could be expanded into the explicit arithmetic?

Hi,

There’s growing interest among our users to make better use of dedicated hardware instructions for complex math and I would like to re-start the discussion on the topic. Given that this original thread was started a while ago apologies if I missed anything already discussed earlier on the list or the round-table. The original mail is quoted below.

In particular, I’m interested in the AArch64 side of things, like using FCMLA [1] for complex multiplications to start with.

To get the discussion going, I’d like to share an alternative pitch. Instead of starting with adding complex types, we could start with adding a set of intrinsics that operate on complex values packed into vectors instead.

Starting with intrinsics would allow us to bring up the lowering of those intrinsics to target-specific nodes incrementally without having to make substantial changes across the codebase, as adding new types would require. Initially, we could try and match IR patterns that correspond to complex operations late in the pipeline. We can then work on incrementally moving the point where the intrinsics are introduced earlier in the pipeline, as we adopt more passes to deal with them. This way, we won’t have to teach all passes about complex types at once or risk loosing out all the existing combines on the corresponding floating point operations.

I think if we introduce a small set of intrinsics for complex math (like @llvm.complex.multiply) we could use them to improve code-generation in key passes like the vectorizers and deliver large improvements to our users fairly quickly. There might be some scenarios which require a dedicated IR type, but I think we can get a long way with a set of specialized intrinsics at a much lower cost. If we later decide that dedicated IR types are needed, replacing the intrinsics should be easy and we will benefit of having already updated various passes to deal with the intrinsics.

We took a similar approach when adding matrix support to LLVM and I think that worked out very well in the end. The implementation upstream generates equivalent or better code than our earlier implementation using dedicated IR matrix types, while being simpler and impacting a much smaller area of the codebase.

An independent issue to discuss is how to generate complex math intrinsics.
As part of the initial bring-up, I’d propose matching the code Clang generates for operations on std::complex<> & co to introduce the complex math intrinsics. This won’t be perfect and will miss cases, but allows us to deliver initial improvements without requiring extensive updates to existing libraries or frontends. I don’t think either the intrinsic only or the complex type variants are inherently more convenient for frontends to emit.

To better illustrate what this approach could look like, I put up a set of rough patches that introduce a @llvm.complex.multiply intrinsic (https://reviews.llvm.org/D91347), replace a set of fadd/fsub/fmul instructions with @llvm.complex.multiply (https://reviews.llvm.org/D91353) and lower the intrinsic for FCMLA on AArch64 (https://reviews.llvm.org/D91354). Note that those are just rough proof-of-concept patches.

Cheers,
Florian

The proposed experimental intrinsics are a difficult detour to accept
for performance reasons. With a complex type, the usual algebraic
simplifications fall out for free (or close to it). Teaching existing
optimizations how to handle the new complex intrinsics seems like a
LOT of unnecessary work.

Thanks for taking a look!

Could you expand a bit more on what kind of unnecessary work you expect? I would expect most of the code to deal with intrinsics to be easily migrated once/if we decide to switch to a dedicated type.

Concretely, for the lowering code, it should hopefully just boil down to updating the patterns that get matched in the backends (matching the complex multiply instruction instead of @llvm.complex.multiply). For supporting complex math in the vectorizers, we have to add support for cost-modeling and support widening the intrinsics. Again, wouldn’t’ changing from intrinsics to a type just mean adjusting from dealing with intrinsic calls to the corresponding instructions?

There certainly is some pieces around the edges that will need adjusting or become obsolete, but I would hope/expect the majority of the work to be re-usable.

As for getting the usual algebraic simplifications for free, even with a new type I suppose we would have to teach instcombine/InstructionSimplify about them. This is indeed an area where a dedicated type is probably quite a bit easier to deal with. But I think part of the vector-predication proposal includes some generic matchers for the different intrinsic, which should be relatively straight-forward to update as well.

Yes, the IR passes are where I see complex types win. For example,
pretty much all of InstCombine's visitFMul(...) transformations should
not depend on type.*** So there's no need to duplicate all that code
for @llvm.complex.multiply.

It's true that teaching the pattern matcher to recognize intrinsics
gets us a long way. But we'd also need to update IR passes in places
where the Opcode is explicitly checked. E.g.:

  switch (Opcode) {
  <snipped>
  case Instruction::FAdd: <== ***HERE***
    return SimplifyFAddInst(LHS, RHS, FastMathFlags(), Q, MaxRecurse);

And when a transformation succeeds, we'd need to update how the result
is returned. E.g.:

  // (-X) + Y --> Y - X
  Value *X, *Y;
  if (match(&I, m_c_FAdd(m_FNeg(m_Value(X)), m_Value(Y))))
    return BinaryOperator::CreateFSubFMF(Y, X, &I); <== ***HERE***

Both of those come for free with a complex type, along with the
pattern matching code.

I'm not opposed to finding solutions for these problems, but this
pitch has been around for a while now, and it hasn't really taken off
yet. It's hard to rally behind it again.

(***) Let me point out that my number theory and abstract algebra are
*rusty*. So don't be fooled into thinking my initial assumptions about
complex numbers are correct.

I’ll admit that the scope of my pitch is much more limited than the original proposal and very focused on allowing LLVM to use specialized instructions. But it should be relatively simple to implement (without impacting anything unrelated to the passes/backends that are adjusted) and easy to extend to meet additional needs by other people & backends. It also allows to reuse the existing instructions for insert/extract/shuffles and the corresponding folds.

Generating the special instructions is good, but not if it costs other
optimizations. Let's get both. :smiley:

But in all seriousness, my opinion on this topic has hit stark
resistance before. I won't harp on it if others don't feel the same.
It's just pretty clear to me that experimental intrinsics are a major
hurdle to performant code for larger projects.

Hi,

Hi,

There’s growing interest among our users to make better use of dedicated hardware instructions for complex math and I would like to re-start the discussion on the topic. Given that this original thread was started a while ago apologies if I missed anything already discussed earlier on the list or the round-table. The original mail is quoted below.

In particular, I’m interested in the AArch64 side of things, like using FCMLA [1] for complex multiplications to start with.

To get the discussion going, I’d like to share an alternative pitch. Instead of starting with adding complex types, we could start with adding a set of intrinsics that operate on complex values packed into vectors instead.

Starting with intrinsics would allow us to bring up the lowering of those intrinsics to target-specific nodes incrementally without having to make substantial changes across the codebase, as adding new types would require. Initially, we could try and match IR patterns that correspond to complex operations late in the pipeline. We can then work on incrementally moving the point where the intrinsics are introduced earlier in the pipeline, as we adopt more passes to deal with them. This way, we won’t have to teach all passes about complex types at once or risk loosing out all the existing combines on the corresponding floating point operations.

I think if we introduce a small set of intrinsics for complex math (like @llvm.complex.multiply) we could use them to improve code-generation in key passes like the vectorizers and deliver large improvements to our users fairly quickly. There might be some scenarios which require a dedicated IR type, but I think we can get a long way with a set of specialized intrinsics at a much lower cost. If we later decide that dedicated IR types are needed, replacing the intrinsics should be easy and we will benefit of having already updated various passes to deal with the intrinsics.

We took a similar approach when adding matrix support to LLVM and I think that worked out very well in the end. The implementation upstream generates equivalent or better code than our earlier implementation using dedicated IR matrix types, while being simpler and impacting a much smaller area of the codebase.

An independent issue to discuss is how to generate complex math intrinsics.
As part of the initial bring-up, I’d propose matching the code Clang generates for operations on std::complex<> & co to introduce the complex math intrinsics. This won’t be perfect and will miss cases, but allows us to deliver initial improvements without requiring extensive updates to existing libraries or frontends. I don’t think either the intrinsic only or the complex type variants are inherently more convenient for frontends to emit.

To better illustrate what this approach could look like, I put up a set of rough patches that introduce a @llvm.complex.multiply intrinsic (https://reviews.llvm.org/D91347), replace a set of fadd/fsub/fmul instructions with @llvm.complex.multiply (https://reviews.llvm.org/D91353) and lower the intrinsic for FCMLA on AArch64 (https://reviews.llvm.org/D91354). Note that those are just rough proof-of-concept patches.

Cheers,
Florian

Hi Florian,

The proposed experimental intrinsics are a difficult detour to accept
for performance reasons. With a complex type, the usual algebraic
simplifications fall out for free (or close to it). Teaching existing
optimizations how to handle the new complex intrinsics seems like a
LOT of unnecessary work.

That said, we recently had this same conversation at Simon Moll's
native predication sync-up meeting. Simon had some convincing ways to
workaround predicated intrinsic optimization (e.g. the
PredicatedInstruction class). Maybe we should explore a more
generalized solution that would cover complex intrinsics too?

The generalized pattern matching in the VP reference patch is not
VP-specific, eg it is parameter-ized in the abstraction. That means we
can lift InstCombine,InstSimplify once on top of that abstraction and
than instantiate that (it's literally a template parameter) to (0)
regular LLVM instructions, (1) constrained fp intrinsics, (2) complex
intrinsics, (3) VP.. hypothetically even (4) constrained/complex/vp
intrinsics.

I'll send out a separate RFC on how that generalized pattern match works
- it's about time we get working on this since use cases keep piling up..

- Simon

Krzysztof Parzyszek via llvm-dev <llvm-dev@lists.llvm.org> writes:

Some architectures have instructions that assist with complex
arithmetic. Without intrinsics it may be hard to use such
instructions especially because of the arithmetic simplifications.
Perhaps, depending on TTI, those intrinsics could be expanded into the
explicit arithmetic?

Can you provide some examples of what you mean?

           -David

Examples of complex instructions? Hexagon has a cmpy instruction for complex multiplication, although it works on int16-based values. There is a whole set of these instructions (with various saturation/rounding behaviors), but each one of them can multiply two complex numbers (where a complex number is represented as two 16-bit values in a 32-bit register). That's a fairly complex pattern to match, and it could be quite easily disturbed by various arithmetic simplifications.

What I meant with the second part was that if we start with intrinsics, then on targets that don't support complex arithmetic, those intrinsics could be expanded into the explicit real-number-based arithmetic in instcombine. This could be communicated via a function in TargetTransformInfo. Instcombine already allows targets to handle their own intrinsics in it, so maybe this would still fall under that category. This could improve performance in the short term, while the knowledge of the intrinsics is gradually added to the rest of the code.

There certainly is some pieces around the edges that will need adjusting or become obsolete, but I would hope/expect the majority of the work to be re-usable.

As for getting the usual algebraic simplifications for free, even with a new type I suppose we would have to teach instcombine/InstructionSimplify about them. This is indeed an area where a dedicated type is probably quite a bit easier to deal with. But I think part of the vector-predication proposal includes some generic matchers for the different intrinsic, which should be relatively straight-forward to update as well.

Yes, the IR passes are where I see complex types win. For example,
pretty much all of InstCombine’s visitFMul(…) transformations should
not depend on type.*** So there’s no need to duplicate all that code
for @llvm.complex.multiply.

It’s true that teaching the pattern matcher to recognize intrinsics
gets us a long way. But we’d also need to update IR passes in places
where the Opcode is explicitly checked. E.g.:

switch (Opcode) {
<snipped>
case Instruction::FAdd: <== ***HERE***
return SimplifyFAddInst(LHS, RHS, FastMathFlags(), Q, MaxRecurse);

And when a transformation succeeds, we’d need to update how the result
is returned. E.g.:

// (-X) + Y --> Y - X
Value *X, *Y;
if (match(&I, m_c_FAdd(m_FNeg(m_Value(X)), m_Value(Y))))
return BinaryOperator::CreateFSubFMF(Y, X, &I); <== ***HERE***

Both of those come for free with a complex type, along with the
pattern matching code.

Thanks for expanding on this! I think for InstructionSimplify it should be fairly easy to treat @llvm.complex.multiply & co as their instruction counter parts. InstCombine (or other passes that create new instructions as they combine) will require more additional changes.

This is certainly a trade-off compared to adding dedicated types. But I given the additional complexity and overhead adding new types entails, I think the intrinsics are a good alternative that can get us quite far, especially if we can re-use/build on some of the techniques Simon mentioned.

I’m not opposed to finding solutions for these problems, but this
pitch has been around for a while now, and it hasn’t really taken off
yet. It’s hard to rally behind it again.

If there’s been an earlier discussion about introducing a small set of intrinsics, I think I missed that. The references linked from the complex type proposal seem to all focus on adding a new type.

(***) Let me point out that my number theory and abstract algebra are
rusty. So don’t be fooled into thinking my initial assumptions about
complex numbers are correct.

I’ll admit that the scope of my pitch is much more limited than the original proposal and very focused on allowing LLVM to use specialized instructions. But it should be relatively simple to implement (without impacting anything unrelated to the passes/backends that are adjusted) and easy to extend to meet additional needs by other people & backends. It also allows to reuse the existing instructions for insert/extract/shuffles and the corresponding folds.

Generating the special instructions is good, but not if it costs other
optimizations. Let’s get both. :smiley:

But in all seriousness, my opinion on this topic has hit stark
resistance before. I won’t harp on it if others don’t feel the same.
It’s just pretty clear to me that experimental intrinsics are a major
hurdle to performant code for larger projects.

That’s an interesting point. I am expecting to get most benefits in terms of performance from better vectorizing complex math by using specialized hardware instructions and would expect benefits from better combines to be smaller. But besides reports from our users, I do not have anything to back this up and I suppose this highly depends on the particulars of the codebases.

As discussed earlier, the intrinsics pitch comes with some trade-offs when it comes to re-using existing combines and simplifications. But the key benefit I see is that it is much more lightweight (and hence hopefully easier to get started with) and gives us a path forward to make incremental and noticeable improvements in a relatively short time-span. Even if that doesn’t get us to a perfect state directly, it should make it much easier to start moving in the right direction. And once we hit the point at which we really need dedicated types we already laid much of the ground work with the intrinsics.

Cheers,
Florian

That’s a good point! I think one challenge will be that the hardware instructions might be quite different from target to target. I was thinking that we could just introduce/use those intrinsics on targets where the can be lowered efficiently.

Once we have support in the vectorizers, it might be beneficial to use the intrinsics even for targets that do not natively support them. I think for that cases, adding a pass that lowers them to regular IR instructions for targets that do not have dedicated instructions would be a great idea. I think we did something similar for the experimental reduction intrinsics.

Cheers,
Florian

There certainly is some pieces around the edges that will need adjusting or become obsolete, but I would hope/expect the majority of the work to be re-usable.

As for getting the usual algebraic simplifications for free, even with a new type I suppose we would have to teach instcombine/InstructionSimplify about them. This is indeed an area where a dedicated type is probably quite a bit easier to deal with. But I think part of the vector-predication proposal includes some generic matchers for the different intrinsic, which should be relatively straight-forward to update as well.

Yes, the IR passes are where I see complex types win. For example,
pretty much all of InstCombine's visitFMul(...) transformations should
not depend on type.*** So there's no need to duplicate all that code
for @llvm.complex.multiply.

It's true that teaching the pattern matcher to recognize intrinsics
gets us a long way. But we'd also need to update IR passes in places
where the Opcode is explicitly checked. E.g.:

 switch (Opcode) {
 <snipped>
 case Instruction::FAdd: <== ***HERE***
   return SimplifyFAddInst(LHS, RHS, FastMathFlags(), Q, MaxRecurse);

And when a transformation succeeds, we'd need to update how the result
is returned. E.g.:

 // (-X) + Y --> Y - X
 Value *X, *Y;
 if (match(&I, m_c_FAdd(m_FNeg(m_Value(X)), m_Value(Y))))
   return BinaryOperator::CreateFSubFMF(Y, X, &I); <== ***HERE***

Both of those come for free with a complex type, along with the
pattern matching code.

Thanks for expanding on this! I think for InstructionSimplify it should be fairly easy to treat @llvm.complex.multiply & co as their instruction counter parts. InstCombine (or other passes that create new instructions as they combine) will require more additional changes.

This is certainly a trade-off compared to adding dedicated types. But I given the additional complexity and overhead adding new types entails, I think the intrinsics are a good alternative that can get us quite far, especially if we can re-use/build on some of the techniques Simon mentioned.

I'm not opposed to finding solutions for these problems, but this
pitch has been around for a while now, and it hasn't really taken off
yet. It's hard to rally behind it again.

If there’s been an earlier discussion about introducing a small set of intrinsics, I think I missed that. The references linked from the complex type proposal seem to all focus on adding a new type.

Sorry, I wasn't clear. I meant the pitch to optimize experimental
intrinsics in general. That general plan to optimize experimental
intrinsics has been around since the constrained FP intrinsics project
started. Probably 2016 or 2017.

(***) Let me point out that my number theory and abstract algebra are
*rusty*. So don't be fooled into thinking my initial assumptions about
complex numbers are correct.

I’ll admit that the scope of my pitch is much more limited than the original proposal and very focused on allowing LLVM to use specialized instructions. But it should be relatively simple to implement (without impacting anything unrelated to the passes/backends that are adjusted) and easy to extend to meet additional needs by other people & backends. It also allows to reuse the existing instructions for insert/extract/shuffles and the corresponding folds.

Generating the special instructions is good, but not if it costs other
optimizations. Let's get both. :smiley:

But in all seriousness, my opinion on this topic has hit stark
resistance before. I won't harp on it if others don't feel the same.
It's just pretty clear to me that experimental intrinsics are a major
hurdle to performant code for larger projects.

That’s an interesting point. I am expecting to get most benefits in terms of performance from better vectorizing complex math by using specialized hardware instructions and would expect benefits from better combines to be smaller. But besides reports from our users, I do not have anything to back this up and I suppose this highly depends on the particulars of the codebases.

As discussed earlier, the intrinsics pitch comes with some trade-offs when it comes to re-using existing combines and simplifications. But the key benefit I see is that it is much more lightweight (and hence hopefully easier to get started with) and gives us a path forward to make incremental and noticeable improvements in a relatively short time-span. Even if that doesn’t get us to a perfect state directly, it should make it much easier to start moving in the right direction. And once we hit the point at which we really need dedicated types we already laid much of the ground work with the intrinsics.

That seems fair.

Simon Moll <Simon.Moll@EMEA.NEC.COM> writes:

The generalized pattern matching in the VP reference patch is not
VP-specific, eg it is parameter-ized in the abstraction. That means we
can lift InstCombine,InstSimplify once on top of that abstraction and
than instantiate that (it's literally a template parameter) to (0)
regular LLVM instructions, (1) constrained fp intrinsics, (2) complex
intrinsics, (3) VP.. hypothetically even (4) constrained/complex/vp
intrinsics.

I'll send out a separate RFC on how that generalized pattern match works
- it's about time we get working on this since use cases keep piling up..

I think this has some real promise for getting complex support up and
running more quickly. Since the vector predication intrinsics are
apparently already in
(http://llvm.org/docs/LangRef.html#vector-predication-intrinsics), it
makes sense to me to get these generic matchers in and build complex on
top of that, at least to get something working.

Do we have any idea of a timeline for such matchers?

                -David

Krzysztof Parzyszek via llvm-dev <llvm-dev@lists.llvm.org> writes:

Examples of complex instructions?

Sorry, I was referring specifically to this statement:

Without intrinsics it may be hard to use such instructions especially
because of the arithmetic simplifications.

I was asking the question in the context of intrinsics vs. a first-class
complex type. Matching things like hardware support for complex
multiply is doable with either mechanism. Your statement made it sound
like intrinsics were needed to *avoid* simplifcations in order to match
them to hardware instructions and that a first-class complex type (using
"normal" LLVM arithmetic instructions) would not be matchable to some
hardware instructions. I was curious about what those cases would be.

                -David

Florian Hahn <florian_hahn@apple.com> writes:

Once we have support in the vectorizers, it might be beneficial to use
the intrinsics even for targets that do not natively support them. I
think for that cases, adding a pass that lowers them to regular IR
instructions for targets that do not have dedicated instructions would
be a great idea. I think we did something similar for the experimental
reduction intrinsics.

We don't *need* intrinsics to do such lowering. "Normal" LLVM
instructions with a first-class complex type is similarly lowerable.

I'm not necessarily opposed to intrinsics (especially if we have Simon's
generic matchers available) but I think it's important to be clear what
we're talking about. Intrtinaics vs. first-class complex types have
different tradeoffs. With Simon's generic matchers those trade-offs are
less stark to me though.

                  -David

Florian Hahn <florian_hahn@apple.com> writes:

Once we have support in the vectorizers, it might be beneficial to use
the intrinsics even for targets that do not natively support them. I
think for that cases, adding a pass that lowers them to regular IR
instructions for targets that do not have dedicated instructions would
be a great idea. I think we did something similar for the experimental
reduction intrinsics.

We don't *need* intrinsics to do such lowering. "Normal" LLVM
instructions with a first-class complex type is similarly lowerable.

Yes, I did not mean to imply that. The snippet above was not intended to compare intrinsics/first-class type versions. It was only meant to state that a generic lowering pass would be useful, so we can use the intrinsics even for targets that cannot lower them natively.

I'm not necessarily opposed to intrinsics (especially if we have Simon's
generic matchers available) but I think it's important to be clear what
we're talking about. Intrtinaics vs. first-class complex types have
different tradeoffs. With Simon's generic matchers those trade-offs are
less stark to me though.

Agreed. I think various places in the thread explore some trade-offs in detail.

Cheers,
Florian

Oh I see, thanks for the pointers! Constraint FP support is another interesting example.

Cheers,
Florian

I missed the proposal for a first-class complex type, so I was thinking about working directly on pairs of numbers.

I still think that intrinsics are better than a complex type. Intrinsics can be type-agnostic (i.e. overloaded), which would allow them to operate on complex number with any underlying type. Most applications want floating-point values, but on Hexagon we may want to use int16.

Complex type would pose another issue for vectorization: in general it's better to have a vector of the real parts and a vector of the imaginary parts for the purpose of arithmetic, than having vectors of complex elements (i.e. real and imaginary parts interleaved).

Simon Moll <Simon.Moll@EMEA.NEC.COM> writes:

The generalized pattern matching in the VP reference patch is not
VP-specific, eg it is parameter-ized in the abstraction. That means we
can lift InstCombine,InstSimplify once on top of that abstraction and
than instantiate that (it's literally a template parameter) to (0)
regular LLVM instructions, (1) constrained fp intrinsics, (2) complex
intrinsics, (3) VP.. hypothetically even (4) constrained/complex/vp
intrinsics.

I'll send out a separate RFC on how that generalized pattern match works
- it's about time we get working on this since use cases keep piling up..

I think this has some real promise for getting complex support up and
running more quickly. Since the vector predication intrinsics are
apparently already in
(http://llvm.org/docs/LangRef.html#vector-predication-intrinsics), it
makes sense to me to get these generic matchers in and build complex on
top of that, at least to get something working.

Do we have any idea of a timeline for such matchers?

I am currently extracting/refactoring the generalized pattern matching
stuff from the VP reference patch. This patch should be on Phabricator
by end of next week.

- Simon

Is that universally true? I think it depends on the target. Let's take
Florian's FCMLA example. The inputs and output are interleaved. And if
you need just the reals/imags from an interleaved vector for something
else, LD2/ST2 should be pretty fast on recent chips.

On the other hand, if we had a non-interleaved complex representation
and wanted to use FCMLA, we'd need some number of zips and unzips to
interleave and deinterleave between the load and store. Those probably
aren't cheap in aggregate.

I haven't studied this across all targets, but my intuition says we
should leave the representation decision up to the targets. Maybe we
should have a larger discussion about it.

Although, it's worth noting that predication would likely be *much*
easier with a non-interleaved representation. I think. Again, I
haven't thought this completely through, but it's probably worth
talking about.