[RFC][LV] Generating conditional VPBB that will be skip when the mask is inactive in VPlan

Hi all,
We would like to propose a new VPlan transform that will generate conditional VPBasic block in the VPlan that can bypass masked operations when mask is inactive.

Summary

This RFC proposes an optimization to improve performance in loop vectorization by conditionally executing vector basic blocks based on mask activity. The optimization targets to improve the performance when conditional scalar blocks are infrequently taken.
E.g.

Sclar loop.

for.body:
  %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.inc ]
  %arrayidx = getelementptr inbounds i64, ptr %reg.24.val, i64 %indvars.iv
  %2 = load i64, ptr %arrayidx, align 8
  %3 = and i64 %2, %1
  %or.cond.not = icmp eq i64 %3, %1
  br i1 %or.cond.not, label %if.then9, label %for.inc

if.then9:
  %xor = xor i64 %2, %shl11
  store i64 %xor, ptr %arrayidx, align 8
  br label %for.inc
for.inc:
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond.not, label %for.end.loopexit, label %for.body

for.end.loopexit:
  br label %for.end

to conditional vector loop

vector.body:                                      ; preds = %if.then9.split, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %if.then9.split ]
  %evl.based.iv = phi i64 [ 0, %vector.ph ], [ %index.evl.next, %if.then9.split ]
  %avl = sub i64 %wide.trip.count, %evl.based.iv
  %7 = call i32 @llvm.experimental.get.vector.length.i64(i64 %avl, i32 2, i1 true)
  %8 = getelementptr i64, ptr %reg.24.val, i64 %evl.based.iv
  %9 = getelementptr inbounds i64, ptr %8, i32 0
  %vp.op.load = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 %9, <vscale x 2 x i1> splat (i1 true), i32 %7)
  %10 = and <vscale x 2 x i64> %vp.op.load, %broadcast.splat2
  %11 = icmp eq <vscale x 2 x i64> %10, %broadcast.splat2
  %12 = call i1 @llvm.vector.reduce.or.nxv2i1(<vscale x 2 x i1> %11)
  %13 = icmp eq i1 %12, false
  br i1 %13, label %if.then9.split, label %vector.if.bb

vector.if.bb:                                     ; preds = %vector.body
  %14 = xor <vscale x 2 x i64> %vp.op.load, %broadcast.splat
  %15 = getelementptr i64, ptr %8, i32 0
  call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> %14, ptr align 8 %15, <vscale x 2 x i1> %11, i32 %7)
  br label %if.then9.split

if.then9.split:                                   ; preds = %vector.body, %vector.if.bb
  %16 = zext i32 %7 to i64
  %index.evl.next = add nuw i64 %16, %evl.based.iv
  %index.next = add nuw i64 %index, %6
  %17 = icmp eq i64 %index.next, %n.vec
  br i1 %17, label %middle.block, label %vector.body, !llvm.loop !0

Current Status

In the current loop vectorizer implementation, control flow within predicated blocks is flattened using masks. These masked operations are then merged into a single vector basic block.

For example, consider a scalar loop that contains conditional (predicated) blocks. After vectorization, the conditions are translated into masks, and the corresponding operations are inserted into the main loop body, regardless of whether any lanes are active.

This approach leads to inefficiencies when the conditional blocks are rarely executed.

E.g. Flatten vector loop.

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %evl.based.iv = phi i64 [ 0, %vector.ph ], [ %index.evl.next, %vector.body ]
  %avl = sub i64 %wide.trip.count, %evl.based.iv
  %7 = call i32 @llvm.experimental.get.vector.length.i64(i64 %avl, i32 2, i1 true)
  %8 = getelementptr i64, ptr %reg.24.val, i64 %evl.based.iv
  %9 = getelementptr inbounds i64, ptr %8, i32 0
  %vp.op.load = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 %9, <vscale x 2 x i1> splat (i1 true), i32 %7)
  %10 = and <vscale x 2 x i64> %vp.op.load, %broadcast.splat2
  %11 = icmp eq <vscale x 2 x i64> %10, %broadcast.splat2
  %12 = xor <vscale x 2 x i64> %vp.op.load, %broadcast.splat
  %13 = getelementptr i64, ptr %8, i32 0
  call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> %12, ptr align 8 %13, <vscale x 2 x i1> %11, i32 %7)
  %14 = zext i32 %7 to i64
  %index.evl.next = add nuw i64 %14, %evl.based.iv
  %index.next = add nuw i64 %index, %6
  %15 = icmp eq i64 %index.next, %n.vec
  br i1 %15, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body
  br label %for.end.loopexit

Motivation

We observed in several benchmarks that certain conditional scalar blocks are seldom taken. As a result, the generated vector masks often contain no active lanes. Despite this, masked operations are still executed, which consumes cycles unnecessarily.

Our experiments indicate that skipping such masked operations entirely—when no lanes are active—can lead to measurable performance improvements. This optimization aims to eliminate redundant execution of masked instructions when their masks are entirely inactive.

Proposed Transformation

  1. Collect masked operations bottom-up from masked stores.
  2. Split the original vector loop and insert a new VPBB in the middle. And move all of masked operations into the new VPBB.
  3. Insert mask-checks: Using any-of (%mask) to check if any active lane in the mask.
  4. Fix up branches of the new VPBB.

VPlan changes.

vector.region {
vector.loop:
  ...
  vp %any.active.mask = any-of(%Mask)
  BranchOnCount %any.active.lane, 0
Successors: vector.loop.split, vector.if.bb


vector.if.bb:
  masked operations ...
  masked.store(...)
Successors: vector.loop.split


vector.loop.split:
  ...
}

Patch