Why does LLVM not emit a gather / masked load / vector load when reading in this loop?

PhilippeGSK · April 8, 2026, 12:00am

Hello everyone,

I’m not sure if this is the right place to ask, but I’m quite desperate and have been trying to figure this out for days now. I’m writing a brush engine that works by JIT compiling graphs using LLVM (inkwell bindings.) Here’s the graph in my own IR that I then lower to LLVM IR

        let pen_pressure = ir.*add_op*(ir::*Op*::*PenPressure*);

        let pen_x = ir.*add_op*(ir::*Op*::*PenX*);

        let pen_y = ir.*add_op*(ir::*Op*::*PenY*);

        let ten = ir.*add_op*(ir::*Op*::*ConstF32*(200.0));



        let radius = ir.*add_op*(ir::*Op*::*MulF32* {

            lhs: pen_pressure,

            rhs: ten,

        });

        let blur = ir.*add_op*(ir::*Op*::*ConstF32*(70.0 \* 70.0));

        let circle_mask = ir.*add_op*(ir::*Op*::*GetPrecomputedMask*(

            ir::*PrecomputedMaskKind*::*Circle* { radius, blur },

        ));

        let offset_circle_mask = ir.*add_op*(ir::*Op*::*OffsetPixelAligned* {

            source: circle_mask,

            offset_x: pen_x,

            offset_y: pen_y,

        });



        let visible_subkey = ir.*add_op*(ir::*Op*::*ConstU64*(0));

        let layer_delta = ir.*add_op*(ir::*Op*::*GetLayerDelta* {

            subkey: visible_subkey,

        });



        let result = ir.*add_op*(ir::*Op*::*Blend* {

            source: offset_circle_mask,

            dest: layer_delta,

            blending_mode: crate::pixel_format::*BlendingMode*::*Normal*,

        });

Right now the core action of the JITted code is :

Loop over y, x coordinates within a 64x64 tile
Read from the destination layer
Read from the source layer at an offset
blend the two with Source Over / Porter Duff
write result to output buffer

It reads from the source layer with vmovdqu64 and writes to the output buffer with vmovdqu64, which is good. It seems to completely scalarize the reads from the source layer.

The destination layer is provided as a sparse grid of pointers to 64x64 tiles of rgba pixels, with 8 bit channels. Since the output buffer completely aligns with one of those 64x64 tiles, it makes sense that LLVM can vectorize the transfer fully.

The source layer is provided as a padded array of u8 alpha values such that for any index i in the array, it’s legal to load at index i ± 64, so I’d hope LLVM could either emit a branch that either loads a full vector or zeroes out a full vector, or at least emit a masked load.

Things I’ve tried :

Setting no wrap flags
Adding loop metadata to have LLVM vectorize it
Use selects on the pointers to the elements of the array (instead of doing *&array[i], do *safe_ptr if i is out of bounds)
Use selects on the pointers of the elements of the array AND selects on the results, in case LLVM thinks safe_ptr doesn’t have 0 inside. So I’d also have alpha = alpha if i is in bounds else 0.
Having LLVM give me optimization remarks to see what’s happening (no clue)

Here is my backend code. Note : the backend code is an absolute mess, I will clean it up once I get the vectorization working. llvm_cpu_backend.rs · GitHub

Here is the resulting LLVM IR and assembly after optimization passes

gist.github.com

https://gist.github.com/PhilippeGSK/9bb0cf59329a7c5d491f47901270e256

output

; ModuleID = 'brush_jit_module'
source_filename = "brush_jit_module"

; Function Attrs: nofree norecurse nosync nounwind memory(read, argmem: readwrite, inaccessiblemem: none)
define void @tile_entrypoint_0(ptr noalias readonly captures(none) %0, i32 %1, i32 %2, ptr none noalias writeonly captures(none) %3) local_unnamed_addr #0 {
entry:
  %base_world_x = shl i32 %1, 6
  %base_world_y = shl i32 %2, 6
  %pen_x.i.i.i = load float, ptr %0, align 4
  %pen_y_ptr.i.i.i = getelementptr inbounds nuw i8, ptr %0, i64 4

This file has been truncated. show original

I don’t think those are particularly nice to read so if you’d rather ask questions instead feel free to.

If anyone could help me or give me any pointers for as to what is happening, I would really appreciate. Thank you for your time.

Edit : I should probably mention as well : coordinates are kept as i32 as much as possible to help with vectorization. Also, due to the architecture of the IR I won’t be able to directly emit vectorized instructions. I have to rely on the automatic vectorization pass

PhilippeGSK · April 8, 2026, 2:43pm

Update : I got it to emit a vectorized load by adding some assume that made it so simplifycfg doesn’t break some pattern that the vectorizer needs.

I’ll upload the actual code when I get home

PhilippeGSK · April 9, 2026, 1:15pm

Alright, I’m a bit late but here’s a minimal reproductible example.
bug.ll
there are four files, two input files bug.ll and fixed.ll, and two corresponding output files. As you can see, adding the simple llvm.assume made it so the load is completely vectorized (look for the FIX line in fixed.ll). I had an AI write these examples and test them based on my actual code. It says it tested it for LLVM 18 to 20 but I also tested it for LLVM 21 myself and had the same behavior happen

C:\LLVM-21\bin\opt -O3 -mtriple=x86_64-pc-windows-msvc -mcpu=native bug.ll -S -o output_bug.ll
C:\LLVM-21\bin\opt -O3 -mtriple=x86_64-pc-windows-msvc -mcpu=native fixed.ll -S -o output_fixed.ll

My cpu name being “tigerlake”

I might post an issue about that on github, it doesn’t feel like behavior I’d expect.

PhilippeGSK · April 9, 2026, 2:30pm

I created an issue
Missed optimization after SimplifyCFG sinks pointer load into conditional block, preventing LoopVectorize from emitting masked load · Issue #191205 · llvm/llvm-project

Topic		Replies	Views
What's the most efficient way to load a vector of pointers? LLVM Dev List Archives	1	121	April 8, 2020
masked-load endpoints optimization LLVM Dev List Archives	6	122	April 10, 2016
Handling Masked Vector Operations LLVM Dev List Archives	16	259	May 9, 2013
Adding masked vector load and store intrinsics LLVM Dev List Archives	64	770	October 28, 2014
Vectorizing with gather/scatter instrinsics IR & Optimizations	6	745	May 5, 2022

Why does LLVM not emit a gather / masked load / vector load when reading in this loop?

Related topics