Why does LLVM not emit a gather / masked load / vector load when reading in this loop?

Hello everyone,

I’m not sure if this is the right place to ask, but I’m quite desperate and have been trying to figure this out for days now. I’m writing a brush engine that works by JIT compiling graphs using LLVM (inkwell bindings.) Here’s the graph in my own IR that I then lower to LLVM IR

        let pen_pressure = ir.*add_op*(ir::*Op*::*PenPressure*);

        let pen_x = ir.*add_op*(ir::*Op*::*PenX*);

        let pen_y = ir.*add_op*(ir::*Op*::*PenY*);

        let ten = ir.*add_op*(ir::*Op*::*ConstF32*(200.0));



        let radius = ir.*add_op*(ir::*Op*::*MulF32* {

            lhs: pen_pressure,

            rhs: ten,

        });

        let blur = ir.*add_op*(ir::*Op*::*ConstF32*(70.0 \* 70.0));

        let circle_mask = ir.*add_op*(ir::*Op*::*GetPrecomputedMask*(

            ir::*PrecomputedMaskKind*::*Circle* { radius, blur },

        ));

        let offset_circle_mask = ir.*add_op*(ir::*Op*::*OffsetPixelAligned* {

            source: circle_mask,

            offset_x: pen_x,

            offset_y: pen_y,

        });



        let visible_subkey = ir.*add_op*(ir::*Op*::*ConstU64*(0));

        let layer_delta = ir.*add_op*(ir::*Op*::*GetLayerDelta* {

            subkey: visible_subkey,

        });



        let result = ir.*add_op*(ir::*Op*::*Blend* {

            source: offset_circle_mask,

            dest: layer_delta,

            blending_mode: crate::pixel_format::*BlendingMode*::*Normal*,

        });

Right now the core action of the JITted code is :

  • Loop over y, x coordinates within a 64x64 tile
  • Read from the destination layer
  • Read from the source layer at an offset
  • blend the two with Source Over / Porter Duff
  • write result to output buffer

It reads from the source layer with vmovdqu64 and writes to the output buffer with vmovdqu64, which is good. It seems to completely scalarize the reads from the source layer.

The destination layer is provided as a sparse grid of pointers to 64x64 tiles of rgba pixels, with 8 bit channels. Since the output buffer completely aligns with one of those 64x64 tiles, it makes sense that LLVM can vectorize the transfer fully.

The source layer is provided as a padded array of u8 alpha values such that for any index i in the array, it’s legal to load at index i ± 64, so I’d hope LLVM could either emit a branch that either loads a full vector or zeroes out a full vector, or at least emit a masked load.

Things I’ve tried :

  • Setting no wrap flags
  • Adding loop metadata to have LLVM vectorize it
  • Use selects on the pointers to the elements of the array (instead of doing *&array[i], do *safe_ptr if i is out of bounds)
  • Use selects on the pointers of the elements of the array AND selects on the results, in case LLVM thinks safe_ptr doesn’t have 0 inside. So I’d also have alpha = alpha if i is in bounds else 0.
  • Having LLVM give me optimization remarks to see what’s happening (no clue)

Here is my backend code. Note : the backend code is an absolute mess, I will clean it up once I get the vectorization working. llvm_cpu_backend.rs · GitHub

Here is the resulting LLVM IR and assembly after optimization passes

I don’t think those are particularly nice to read so if you’d rather ask questions instead feel free to.

If anyone could help me or give me any pointers for as to what is happening, I would really appreciate. Thank you for your time.

Edit : I should probably mention as well : coordinates are kept as i32 as much as possible to help with vectorization. Also, due to the architecture of the IR I won’t be able to directly emit vectorized instructions. I have to rely on the automatic vectorization pass

Update : I got it to emit a vectorized load by adding some assume that made it so simplifycfg doesn’t break some pattern that the vectorizer needs.

I’ll upload the actual code when I get home

1 Like

Alright, I’m a bit late but here’s a minimal reproductible example.
bug.ll
there are four files, two input files bug.ll and fixed.ll, and two corresponding output files. As you can see, adding the simple llvm.assume made it so the load is completely vectorized (look for the FIX line in fixed.ll). I had an AI write these examples and test them based on my actual code. It says it tested it for LLVM 18 to 20 but I also tested it for LLVM 21 myself and had the same behavior happen

C:\LLVM-21\bin\opt -O3 -mtriple=x86_64-pc-windows-msvc -mcpu=native bug.ll -S -o output_bug.ll
C:\LLVM-21\bin\opt -O3 -mtriple=x86_64-pc-windows-msvc -mcpu=native fixed.ll -S -o output_fixed.ll

My cpu name being “tigerlake”

I might post an issue about that on github, it doesn’t feel like behavior I’d expect.

I created an issue
Missed optimization after SimplifyCFG sinks pointer load into conditional block, preventing LoopVectorize from emitting masked load · Issue #191205 · llvm/llvm-project