[RFC][LV] Generating conditional VPBB that will be skip when the mask is inactive in VPlan

ElvisWang123 · May 29, 2025, 6:20am

Hi all,
We would like to propose a new VPlan transform that will generate conditional VPBasic block in the VPlan that can bypass masked operations when mask is inactive.

Summary

This RFC proposes an optimization to improve performance in loop vectorization by conditionally executing vector basic blocks based on mask activity. The optimization targets to improve the performance when conditional scalar blocks are infrequently taken.
E.g.

Sclar loop.

for.body:
  %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.inc ]
  %arrayidx = getelementptr inbounds i64, ptr %reg.24.val, i64 %indvars.iv
  %2 = load i64, ptr %arrayidx, align 8
  %3 = and i64 %2, %1
  %or.cond.not = icmp eq i64 %3, %1
  br i1 %or.cond.not, label %if.then9, label %for.inc

if.then9:
  %xor = xor i64 %2, %shl11
  store i64 %xor, ptr %arrayidx, align 8
  br label %for.inc
for.inc:
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond.not, label %for.end.loopexit, label %for.body

for.end.loopexit:
  br label %for.end

to conditional vector loop

vector.body:                                      ; preds = %if.then9.split, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %if.then9.split ]
  %evl.based.iv = phi i64 [ 0, %vector.ph ], [ %index.evl.next, %if.then9.split ]
  %avl = sub i64 %wide.trip.count, %evl.based.iv
  %7 = call i32 @llvm.experimental.get.vector.length.i64(i64 %avl, i32 2, i1 true)
  %8 = getelementptr i64, ptr %reg.24.val, i64 %evl.based.iv
  %9 = getelementptr inbounds i64, ptr %8, i32 0
  %vp.op.load = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 %9, <vscale x 2 x i1> splat (i1 true), i32 %7)
  %10 = and <vscale x 2 x i64> %vp.op.load, %broadcast.splat2
  %11 = icmp eq <vscale x 2 x i64> %10, %broadcast.splat2
  %12 = call i1 @llvm.vector.reduce.or.nxv2i1(<vscale x 2 x i1> %11)
  %13 = icmp eq i1 %12, false
  br i1 %13, label %if.then9.split, label %vector.if.bb

vector.if.bb:                                     ; preds = %vector.body
  %14 = xor <vscale x 2 x i64> %vp.op.load, %broadcast.splat
  %15 = getelementptr i64, ptr %8, i32 0
  call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> %14, ptr align 8 %15, <vscale x 2 x i1> %11, i32 %7)
  br label %if.then9.split

if.then9.split:                                   ; preds = %vector.body, %vector.if.bb
  %16 = zext i32 %7 to i64
  %index.evl.next = add nuw i64 %16, %evl.based.iv
  %index.next = add nuw i64 %index, %6
  %17 = icmp eq i64 %index.next, %n.vec
  br i1 %17, label %middle.block, label %vector.body, !llvm.loop !0

Current Status

In the current loop vectorizer implementation, control flow within predicated blocks is flattened using masks. These masked operations are then merged into a single vector basic block.

For example, consider a scalar loop that contains conditional (predicated) blocks. After vectorization, the conditions are translated into masks, and the corresponding operations are inserted into the main loop body, regardless of whether any lanes are active.

This approach leads to inefficiencies when the conditional blocks are rarely executed.

E.g. Flatten vector loop.

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %evl.based.iv = phi i64 [ 0, %vector.ph ], [ %index.evl.next, %vector.body ]
  %avl = sub i64 %wide.trip.count, %evl.based.iv
  %7 = call i32 @llvm.experimental.get.vector.length.i64(i64 %avl, i32 2, i1 true)
  %8 = getelementptr i64, ptr %reg.24.val, i64 %evl.based.iv
  %9 = getelementptr inbounds i64, ptr %8, i32 0
  %vp.op.load = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr align 8 %9, <vscale x 2 x i1> splat (i1 true), i32 %7)
  %10 = and <vscale x 2 x i64> %vp.op.load, %broadcast.splat2
  %11 = icmp eq <vscale x 2 x i64> %10, %broadcast.splat2
  %12 = xor <vscale x 2 x i64> %vp.op.load, %broadcast.splat
  %13 = getelementptr i64, ptr %8, i32 0
  call void @llvm.vp.store.nxv2i64.p0(<vscale x 2 x i64> %12, ptr align 8 %13, <vscale x 2 x i1> %11, i32 %7)
  %14 = zext i32 %7 to i64
  %index.evl.next = add nuw i64 %14, %evl.based.iv
  %index.next = add nuw i64 %index, %6
  %15 = icmp eq i64 %index.next, %n.vec
  br i1 %15, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body
  br label %for.end.loopexit

Motivation

We observed in several benchmarks that certain conditional scalar blocks are seldom taken. As a result, the generated vector masks often contain no active lanes. Despite this, masked operations are still executed, which consumes cycles unnecessarily.

Our experiments indicate that skipping such masked operations entirely—when no lanes are active—can lead to measurable performance improvements. This optimization aims to eliminate redundant execution of masked instructions when their masks are entirely inactive.

Proposed Transformation

Collect masked operations bottom-up from masked stores.
Split the original vector loop and insert a new VPBB in the middle. And move all of masked operations into the new VPBB.
Insert mask-checks: Using any-of (%mask) to check if any active lane in the mask.
Fix up branches of the new VPBB.

VPlan changes.

vector.region {
vector.loop:
  ...
  vp %any.active.mask = any-of(%Mask)
  BranchOnCount %any.active.lane, 0
Successors: vector.loop.split, vector.if.bb


vector.if.bb:
  masked operations ...
  masked.store(...)
Successors: vector.loop.split


vector.loop.split:
  ...
}

Patch

github.com/llvm/llvm-project

[LV][RFC] Generating conditional VPBB that will be skip when the mask is inactive in VPlan.

main ← ElvisWang123:RFC-Cond-VPBB

opened 06:21AM - 29 May 25 UTC

ElvisWang123

+350 -0

This patch create an overview of VPlan transformation of creating conditional VP…BB and will be split off into multiple patches later. RFC: https://discourse.llvm.org/t/rfc-lv-generating-conditional-vpbb-that-will-be-skip-when-the-mask-is-inactive-in-vplan/86591 This patch add the transformation that convert flatten control flow with conditional vector basic block. This transformation can help program skip masked operations without any active lane. First, this transformation will collect all masked stores and operands bottom-up. And put these masked operations into a new vector basic block. Second, this transformation will split original vector loop and insert the new basic block between split blocks. And update the conditional branch in the original blocks. E.g. Before: { vector.loop: ... BranchOnCount %IV, %TC Successors middle.block, vector.loop } After: { vector.loop: ... %any.active.mask = any-of(%mask) BranchOnCount %any.active.mask, 0 Successors vector.loop.split, vector.if.bb vector.if.bb: ... (Masked operations) Successors vector.loop.split vector.loop.split: ... BranchOnCount %IV, %TC Successors middle.block, vector.loop } Co-authored-by: nikolaypanchenko <kolya.panchenko@sifive.com>

Topic		Replies	Views
Ideas for representing vector gather/scatter and masks in LLVM IR LLVM Dev List Archives	20	117	August 13, 2008
Vectorization of loops with conditional dereferencing LLVM Dev List Archives	16	203	November 26, 2013
Handling Masked Vector Operations LLVM Dev List Archives	16	193	May 9, 2013
LoopVectorizer with ifconversion LLVM Dev List Archives	6	61	March 21, 2017
determine the basic_block inside if and else LLVM Dev List Archives	0	95	May 11, 2010