how experimental are the llvm.experimental.vector.reduce.* functions?

I'm interested in using @llvm.experimental.vector.reduce.smax/umax to
implement runtime overflow checking for vectors. Here's an example
checked addition, without vectors, and then I'll follow the example with
what I would do for checked addition with vectors.

Frontend code (zig):

export fn entry() void {
    var a: i32 = 1;
    var b: i32 = 2;
    var x = a + b;
}

LLVM IR code:

define void @entry() #2 !dbg !41 {
Entry:
  %a = alloca i32, align 4
  %b = alloca i32, align 4
  %x = alloca i32, align 4
  store i32 1, i32* %a, align 4, !dbg !52
  call void @llvm.dbg.declare(metadata i32* %a, metadata !45, metadata
!DIExpression()), !dbg !52
  store i32 2, i32* %b, align 4, !dbg !53
  call void @llvm.dbg.declare(metadata i32* %b, metadata !48, metadata
!DIExpression()), !dbg !53
  %0 = load i32, i32* %a, align 4, !dbg !54
  %1 = load i32, i32* %b, align 4, !dbg !55
  %2 = call { i32, i1 } @llvm.sadd.with.overflow.i32(i32 %0, i32 %1),
!dbg !56
  %3 = extractvalue { i32, i1 } %2, 0, !dbg !56
  %4 = extractvalue { i32, i1 } %2, 1, !dbg !56
  br i1 %4, label %OverflowFail, label %OverflowOk, !dbg !56

OverflowFail: ; preds = %Entry
  tail call fastcc void @panic(%"[]u8"* @2, %StackTrace* null), !dbg !56
  unreachable, !dbg !56

OverflowOk: ; preds = %Entry
  store i32 %3, i32* %x, align 4, !dbg !57
  call void @llvm.dbg.declare(metadata i32* %x, metadata !50, metadata
!DIExpression()), !dbg !57
  ret void, !dbg !58
}

You can see this takes advantage of @llvm.sadd.with.overflow, which is
not available with vectors. So here is a different approach (pseudocode):

%a_zext = zext %a to i33 # 1 more bit
%b_zext = zext %b to i33 # 1 more bit
%result_zext = add %a_zext, %b_zext
%max_result = @llvm.experimental.vector.reduce.umax(%result_zext)
%overflow = icmp %max_result > @max_i32_value
%result = trunc %result_zext to i32

You can imagine how this would work for signed integers, replacing zext
with sext and umax with smax.

This depends on an "experimental" API. Can anyone advise on depending on
this API? Is it a bad idea? Is it about to be promoted to
non-experimental soon? Can anyone advise on how to best achieve my goal?

Kind regards,
Andrew

I don’t think I understand your pseudocode using llvm.experimental.vector.reduce.umax. All of the types you showed are scalar, but that intrinsic doesn’t work on scalars so I’m having a hard time understanding what you’re trying to do with it. llvm.experimental.vector.reduce.umax takes a vector input and returns a scalar result. Are you wanting to find if any of the additions overflowed or a mask of which addition overflowed?

The sadd.with.overflow intrinsics are in the process of gaining vector support if not already complete. Simon Pilgrim made some commits recently. I know the documentation in the LangRef hasn’t been updated. It will return a vector for overflow instead i1 when vectors are used.

The IR update to allow vector types was here:
https://reviews.llvm.org/D57090
…we didn’t update the docs at that time because it was not clear what the backend would do with that, but that might’ve changed with some of the more recent patches.

The add/sub (+mul) overflow intrinsics are being updated to support vectors to match the related add/sub saturation intrinsics. We haven’t updated the docs yet as legalization, vectorization and various minor bits of plumbing still need to be finished before it can be officially supported (Nikita Popov has been looking at the legalization recently).

Regarding the reduction functions - I think the integer intrinsics at least are relatively stable and we can probably investigate dropping the experimental tag before the next release (assuming someone has the time to take on the work) - it’d be nice to have the SLP vectorizer emit reduction intrinsics directly for these.

The floating point intrinsics are trickier as they (may) have stricter ordering constraints that is still causing issues and may need tweaking (e.g. see PR36734).

The add/sub (+mul) overflow intrinsics are being updated to support vectors to match the related add/sub saturation intrinsics. We haven’t updated the docs yet as legalization, vectorization and various minor bits of plumbing still need to be finished before it can be officially supported (Nikita Popov has been looking at the legalization recently).

Regarding the reduction functions - I think the integer intrinsics at least are relatively stable and we can probably investigate dropping the experimental tag before the next release (assuming someone has the time to take on the work) - it’d be nice to have the SLP vectorizer emit reduction intrinsics directly for these.

The floating point intrinsics are trickier as they (may) have stricter ordering constraints that is still causing issues and may need tweaking (e.g. see PR36734).

The vector reduction intrinsics still need quite a lot of work. Apart from SplitVecOp, all legalizations are currently missing. This is only noticeable on AArch64 right now, because all other targets expand vector reductions prior to codegen.

Nikita

        I don't think I understand your pseudocode using
        llvm.experimental.vector.reduce.umax. All of the types you
        showed are scalar, but that intrinsic doesn't work on scalars
        so I'm having a hard time understanding what you're trying to
        do with it. llvm.experimental.vector.reduce.umax takes a
        vector input and returns a scalar result. Are you wanting to
        find if any of the additions overflowed or a mask of which
        addition overflowed?

Apologies for the confusion - let me try to clarify. Here is frontend
code that works now:

export fn entry() void {
    var a: @Vector(4, i32) = []i32{ 1, 2, 3, 4 };
    var b: @Vector(4, i32) = []i32{ 5, 6, 7, 8 };
    var x = a +% b;
}

This generates the following LLVM IR code:

define void @entry() #2 !dbg !41 {
Entry:
  %a = alloca <4 x i32>, align 16
  %b = alloca <4 x i32>, align 16
  %x = alloca <4 x i32>, align 16
  store <4 x i32> <i32 1, i32 2, i32 3, i32 4>, <4 x i32>* %a, align 16,
!dbg !55
  call void @llvm.dbg.declare(metadata <4 x i32>* %a, metadata !45,
metadata !DIExpression()), !dbg !55
  store <4 x i32> <i32 5, i32 6, i32 7, i32 8>, <4 x i32>* %b, align 16,
!dbg !56
  call void @llvm.dbg.declare(metadata <4 x i32>* %b, metadata !51,
metadata !DIExpression()), !dbg !56
  %0 = load <4 x i32>, <4 x i32>* %a, align 16, !dbg !57
  %1 = load <4 x i32>, <4 x i32>* %b, align 16, !dbg !58
  %2 = add <4 x i32> %0, %1, !dbg !59
  store <4 x i32> %2, <4 x i32>* %x, align 16, !dbg !60
  call void @llvm.dbg.declare(metadata <4 x i32>* %x, metadata !53,
metadata !DIExpression()), !dbg !60
  ret void, !dbg !61
}

However I used the +% operator, which in Zig is wrapping addition. Now I
want to implement the + operator for vectors, which Zig defines to panic
if any of the elements overflowed. Here is how the IR could look for this:

define void @entry() #2 !dbg !41 {
Entry:
  %a = alloca <4 x i32>, align 16
  %b = alloca <4 x i32>, align 16
  %x = alloca <4 x i32>, align 16
  store <4 x i32> <i32 1, i32 2, i32 3, i32 4>, <4 x i32>* %a, align 16,
!dbg !55
  store <4 x i32> <i32 5, i32 6, i32 7, i32 8>, <4 x i32>* %b, align 16,
!dbg !56
  %0 = load <4 x i32>, <4 x i32>* %a, align 16, !dbg !57
  %1 = load <4 x i32>, <4 x i32>* %b, align 16, !dbg !58
  %2 = call { <4 x i32>, <4 x i1> } @llvm.sadd.with.overflow.i32(i32 %0,
i32 %1)
  %3 = extractvalue { <4 x i32>, <4 x i1> } %2, 0, !dbg !56
  %4 = extractvalue { <4 x i32>, <4 x i1> } %2, 1, !dbg !56
  %5 = call i1 @llvm.experimental.vector.reduce.umax.i1.v4i1(%4)
  br i1 %5, label %OverflowFail, label %OverflowOk, !dbg !56

OverflowFail: ; preds = %Entry
  tail call fastcc void @panic(%"[]u8"* @2, %StackTrace* null), !dbg !56
  unreachable, !dbg !56

OverflowOk: ; preds = %Entry
  store <4 x i32> %3, <4 x i32>* %x, align 16, !dbg !60
  ret void, !dbg !61
}

You can see that it depends on @llvm.sadd.with.overflow working on
vector types, and it relies on @llvm.experimental.vector.reduce.umax. I
will note that my strategy with sign extension and icmp would be a
semantically equivalent alternative to @llvm.sadd.with.overflow.

Something like this should work I think.

; ModuleID = ‘test.ll’
source_filename = “test.ll”

define void @entry(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>* %x) {
Entry:
%tmp = load <4 x i32>, <4 x i32>* %a, align 16
%tmp1 = load <4 x i32>, <4 x i32>* %b, align 16
%tmp2 = add <4 x i32> %tmp, %tmp1
%tmpsign = icmp slt <4 x i32> %tmp, zeroinitializer
%tmp1sign = icmp slt <4 x i32> %tmp1, zeroinitializer
%sumsign = icmp slt <4 x i32> %tmp2, zeroinitializer
%signsequal = icmp eq <4 x i1> %tmpsign, %tmp1sign
%summismatch = icmp ne <4 x i1> %sumsign, %tmpsign
%overflow = and <4 x i1> %signsequal, %summismatch
%tmp5 = bitcast <4 x i1> %overflow to i4
%tmp6 = icmp ne i4 %tmp5, 0
br i1 %tmp6, label %OverflowFail, label %OverflowOk

OverflowFail: ; preds = %Entry
tail call fastcc void @panic()
unreachable

OverflowOk: ; preds = %Entry
store <4 x i32> %tmp2, <4 x i32>* %x, align 16
ret void
}

declare fastcc void @panic()

Something like this should work I think.

; ModuleID = 'test.ll'
source_filename = "test.ll"

define void @entry(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>* %x) {
Entry:
%tmp = load <4 x i32>, <4 x i32>* %a, align 16
%tmp1 = load <4 x i32>, <4 x i32>* %b, align 16
%tmp2 = add <4 x i32> %tmp, %tmp1
%tmpsign = icmp slt <4 x i32> %tmp, zeroinitializer
%tmp1sign = icmp slt <4 x i32> %tmp1, zeroinitializer
%sumsign = icmp slt <4 x i32> %tmp2, zeroinitializer
%signsequal = icmp eq <4 x i1> %tmpsign, %tmp1sign
%summismatch = icmp ne <4 x i1> %sumsign, %tmpsign
%overflow = and <4 x i1> %signsequal, %summismatch
%tmp5 = bitcast <4 x i1> %overflow to i4
%tmp6 = icmp ne i4 %tmp5, 0
br i1 %tmp6, label %OverflowFail, label %OverflowOk

OverflowFail: ; preds = %Entry
tail call fastcc void @panic()
unreachable

OverflowOk: ; preds = %Entry
store <4 x i32> %tmp2, <4 x i32>* %x, align 16
ret void
}

declare fastcc void @panic()

Thanks! I was able to get it working with your hint:

  %tmp5 = bitcast <4 x i1> %overflow to i4

(Thanks also to LebedevRI who pointed this out on IRC)

Until LLVM 9 when the llvm.*.with.overflow.* intrinsics gain vector
support, here's what I ended up with:

  %a = alloca <4 x i32>, align 16
  %b = alloca <4 x i32>, align 16
  %x = alloca <4 x i32>, align 16
  store <4 x i32> <i32 1, i32 2, i32 3, i32 4>, <4 x i32>* %a, align 16,
!dbg !55
  store <4 x i32> <i32 5, i32 6, i32 7, i32 8>, <4 x i32>* %b, align 16,
!dbg !56
  %0 = load <4 x i32>, <4 x i32>* %a, align 16, !dbg !57
  %1 = load <4 x i32>, <4 x i32>* %b, align 16, !dbg !58
  %2 = sext <4 x i32> %0 to <4 x i33>, !dbg !59
  %3 = sext <4 x i32> %1 to <4 x i33>, !dbg !59
  %4 = add <4 x i33> %2, %3, !dbg !59
  %5 = trunc <4 x i33> %4 to <4 x i32>, !dbg !59
  %6 = sext <4 x i32> %5 to <4 x i33>, !dbg !59
  %7 = icmp ne <4 x i33> %4, %6, !dbg !59
  %8 = bitcast <4 x i1> %7 to i4, !dbg !59
  %9 = icmp ne i4 %8, 0, !dbg !59
  br i1 %9, label %OverflowFail, label %OverflowOk, !dbg !59

Idea being: extend and do the operation with more bits. Truncate to get
the result. Re-extend the result and check if it is the same as the
pre-truncated result.

This works pretty well unless the vector integer size is as big or
larger than the native vector register. Here's a quick performance test:

https://gist.github.com/andrewrk/b9734f9c310d8b79ec7271e7c0df4023

Summary: safety-checked integer addition with no optimizations

<4 x i32>:
scalar = 893 MiB/s
vector = 3.58 GiB/s

<16 x i128>:
scalar = 3.6 GiB/s
vector = 2.5 GiB/s

I should mention the code I provided is equivalent to how the vector version of sadd with overflow is expanded in the backend by default. Having the vector intrinsic probably won’t improve the generated code over what you can achieve with IR. Unless the target has some native way of detecting overflow on vectors like the overflow flag on scalars.