Hi all,
We would like to propose a new VPlan transform that will generate conditional VPBasic block in the VPlan that can bypass masked operations when mask is inactive.
Summary
This RFC proposes an optimization to improve performance in loop vectorization by conditionally executing vector basic blocks based on mask activity. The optimization targets to improve the performance when conditional scalar blocks are infrequently taken.
E.g.
Sclar loop.
for.body:
%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.inc ]
%arrayidx = getelementptr inbounds i64, ptr %reg.24.val, i64 %indvars.iv
%2 = load i64, ptr %arrayidx, align 8
%3 = and i64 %2, %1
%or.cond.not = icmp eq i64 %3, %1
br i1 %or.cond.not, label %if.then9, label %for.inc
if.then9:
%xor = xor i64 %2, %shl11
store i64 %xor, ptr %arrayidx, align 8
br label %for.inc
for.inc:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
br i1 %exitcond.not, label %for.end.loopexit, label %for.body
for.end.loopexit:
br label %for.end
to conditional vector loop
vector.body: ; preds = %if.then9.split, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %if.then9.split ]
%evl.based.iv = phi i64 [ 0, %vector.ph ], [ %index.evl.next, %if.then9.split ]
%avl = sub i64 %wide.trip.count, %evl.based.iv
%7 = call i32 @llvm.experimental.get.vector.length.i64(i64 %avl, i32 2, i1 true)
%8 = getelementptr i64, ptr %reg.24.val, i64 %evl.based.iv
%9 = getelementptr inbounds i64, ptr %8, i32 0
%vp.op.load = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 %9, <vscale x 2 x i1> splat (i1 true), i32 %7)
%10 = and <vscale x 2 x i64> %vp.op.load, %broadcast.splat2
%11 = icmp eq <vscale x 2 x i64> %10, %broadcast.splat2
%12 = call i1 @llvm.vector.reduce.or.nxv2i1(<vscale x 2 x i1> %11)
%13 = icmp eq i1 %12, false
br i1 %13, label %if.then9.split, label %vector.if.bb
vector.if.bb: ; preds = %vector.body
%14 = xor <vscale x 2 x i64> %vp.op.load, %broadcast.splat
%15 = getelementptr i64, ptr %8, i32 0
call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> %14, ptr align 8 %15, <vscale x 2 x i1> %11, i32 %7)
br label %if.then9.split
if.then9.split: ; preds = %vector.body, %vector.if.bb
%16 = zext i32 %7 to i64
%index.evl.next = add nuw i64 %16, %evl.based.iv
%index.next = add nuw i64 %index, %6
%17 = icmp eq i64 %index.next, %n.vec
br i1 %17, label %middle.block, label %vector.body, !llvm.loop !0
Current Status
In the current loop vectorizer implementation, control flow within predicated blocks is flattened using masks. These masked operations are then merged into a single vector basic block.
For example, consider a scalar loop that contains conditional (predicated) blocks. After vectorization, the conditions are translated into masks, and the corresponding operations are inserted into the main loop body, regardless of whether any lanes are active.
This approach leads to inefficiencies when the conditional blocks are rarely executed.
E.g. Flatten vector loop.
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%evl.based.iv = phi i64 [ 0, %vector.ph ], [ %index.evl.next, %vector.body ]
%avl = sub i64 %wide.trip.count, %evl.based.iv
%7 = call i32 @llvm.experimental.get.vector.length.i64(i64 %avl, i32 2, i1 true)
%8 = getelementptr i64, ptr %reg.24.val, i64 %evl.based.iv
%9 = getelementptr inbounds i64, ptr %8, i32 0
%vp.op.load = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 %9, <vscale x 2 x i1> splat (i1 true), i32 %7)
%10 = and <vscale x 2 x i64> %vp.op.load, %broadcast.splat2
%11 = icmp eq <vscale x 2 x i64> %10, %broadcast.splat2
%12 = xor <vscale x 2 x i64> %vp.op.load, %broadcast.splat
%13 = getelementptr i64, ptr %8, i32 0
call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> %12, ptr align 8 %13, <vscale x 2 x i1> %11, i32 %7)
%14 = zext i32 %7 to i64
%index.evl.next = add nuw i64 %14, %evl.based.iv
%index.next = add nuw i64 %index, %6
%15 = icmp eq i64 %index.next, %n.vec
br i1 %15, label %middle.block, label %vector.body, !llvm.loop !0
middle.block: ; preds = %vector.body
br label %for.end.loopexit
Motivation
We observed in several benchmarks that certain conditional scalar blocks are seldom taken. As a result, the generated vector masks often contain no active lanes. Despite this, masked operations are still executed, which consumes cycles unnecessarily.
Our experiments indicate that skipping such masked operations entirely—when no lanes are active—can lead to measurable performance improvements. This optimization aims to eliminate redundant execution of masked instructions when their masks are entirely inactive.
Proposed Transformation
- Collect masked operations bottom-up from masked stores.
- Split the original vector loop and insert a new VPBB in the middle. And move all of masked operations into the new VPBB.
- Insert mask-checks: Using
any-of (%mask)to check if any active lane in the mask. - Fix up branches of the new VPBB.
VPlan changes.
vector.region {
vector.loop:
...
vp %any.active.mask = any-of(%Mask)
BranchOnCount %any.active.lane, 0
Successors: vector.loop.split, vector.if.bb
vector.if.bb:
masked operations ...
masked.store(...)
Successors: vector.loop.split
vector.loop.split:
...
}