Why is the loop vectorizer not working on my function?

My function implements a simple loop:

void bar( int start, int end, float* A, float* B, float* C)
{
     for (int i=start; i<end;++i)
        A[i] = B[i] * C[i];
}

This looks pretty much like the standard example. However, I built the function
with the IRBuilder, thus not coming from C and clang. Also I changed slightly
the function's signature:

define void @bar([8 x i8]* %arg_ptr) {
entrypoint:
   %0 = bitcast [8 x i8]* %arg_ptr to i32*
   %1 = load i32* %0
   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
   %3 = bitcast [8 x i8]* %2 to i32*
   %4 = load i32* %3
   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
   %6 = bitcast [8 x i8]* %5 to float**
   %7 = load float** %6
   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
   %9 = bitcast [8 x i8]* %8 to float**
   %10 = load float** %9
   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
   %12 = bitcast [8 x i8]* %11 to float**
   %13 = load float** %12
   br label %L0

L0: ; preds = %L0, %entrypoint
   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
   %15 = getelementptr float* %10, i32 %14
   %16 = load float* %15
   %17 = getelementptr float* %13, i32 %14
   %18 = load float* %17
   %19 = fmul float %18, %16
   %20 = getelementptr float* %7, i32 %14
   store float %19, float* %20
   %21 = add i32 %14, 1
   %22 = icmp sge i32 %21, %4
   br i1 %22, label %L1, label %L0

L1: ; preds = %L0
   ret void
}

As you can see, I use the phi instruction for the loop index. I notice
that clang prefers stack allocation. So, I am not sure what's the
problem that the loop vectorizer is not working here.
I tried many things, like specifying an architecture with vector
units, enforcing the vector width. No success.

opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll

The only explanation I have is the use of the phi instruction. Is this
preventing to vectorize the loop?

Frank

Hi Frank,

My function implements a simple loop:

void bar( int start, int end, float* A, float* B, float* C)
{
   for (int i=start; i<end;++i)
      A[i] = B[i] * C[i];
}

This looks pretty much like the standard example. However, I built the function
with the IRBuilder, thus not coming from C and clang. Also I changed slightly
the function's signature:

define void @bar([8 x i8]* %arg_ptr) {
entrypoint:
%0 = bitcast [8 x i8]* %arg_ptr to i32*
%1 = load i32* %0
%2 = getelementptr [8 x i8]* %arg_ptr, i32 1
%3 = bitcast [8 x i8]* %2 to i32*
%4 = load i32* %3
%5 = getelementptr [8 x i8]* %arg_ptr, i32 2
%6 = bitcast [8 x i8]* %5 to float**
%7 = load float** %6
%8 = getelementptr [8 x i8]* %arg_ptr, i32 3
%9 = bitcast [8 x i8]* %8 to float**
%10 = load float** %9
%11 = getelementptr [8 x i8]* %arg_ptr, i32 4
%12 = bitcast [8 x i8]* %11 to float**
%13 = load float** %12
br label %L0

L0: ; preds = %L0, %entrypoint
%14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
%15 = getelementptr float* %10, i32 %14
%16 = load float* %15
%17 = getelementptr float* %13, i32 %14
%18 = load float* %17
%19 = fmul float %18, %16
%20 = getelementptr float* %7, i32 %14
store float %19, float* %20
%21 = add i32 %14, 1

Try
%21 = add nsw i32 %14, 1
instead for no-signed wrapping arithmetic.

If that is not working please post the output of opt ... -debug-only=loop-vectorize ...

Hi Arnold,

adding '-debug-only=loop-vectorize' to the command gives:

LV: Checking a loop in "bar"
LV: Found a loop: L0
LV: Found an induction variable.
LV: Found an unidentified write ptr: %7 = load float** %6
LV: Found an unidentified read ptr: %10 = load float** %9
LV: Found an unidentified read ptr: %13 = load float** %12
LV: We need to do 2 pointer comparisons.
LV: We can't vectorize because we can't find the array bounds.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.

It can't find the loop bounds if we use the overflow version of add. That's a good point. I should mark this addition to not overflow.

When using the non-overflow version I get:

LV: Checking a loop in "bar"
LV: Found a loop: L0
LV: Found an induction variable.
LV: Found an unidentified write ptr: %7 = load float** %6
LV: Found an unidentified read ptr: %10 = load float** %9
LV: Found an unidentified read ptr: %13 = load float** %12
LV: Found a runtime check ptr: %20 = getelementptr float* %7, i32 %14
LV: Found a runtime check ptr: %15 = getelementptr float* %10, i32 %14
LV: Found a runtime check ptr: %17 = getelementptr float* %13, i32 %14
LV: We need to do 2 pointer comparisons.
LV: We can perform a memory runtime check if needed.
LV: We need a runtime memory check.
LV: We can vectorize this loop (with a runtime bound check)!
LV: Found trip count: 0
LV: The Widest type: 32 bits.
LV: The Widest register is: 32 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction: %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %15 = getelementptr float* %10, i32 %14
LV: Found an estimated cost of 1 for VF 1 For instruction: %16 = load float* %15
LV: Found an estimated cost of 0 for VF 1 For instruction: %17 = getelementptr float* %13, i32 %14
LV: Found an estimated cost of 1 for VF 1 For instruction: %18 = load float* %17
LV: Found an estimated cost of 1 for VF 1 For instruction: %19 = fmul float %18, %16
LV: Found an estimated cost of 0 for VF 1 For instruction: %20 = getelementptr float* %7, i32 %14
LV: Found an estimated cost of 1 for VF 1 For instruction: store float %19, float* %20
LV: Found an estimated cost of 1 for VF 1 For instruction: %21 = add nsw i32 %14, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %22 = icmp sge i32 %21, %4
LV: Found an estimated cost of 1 for VF 1 For instruction: br i1 %22, label %L1, label %L0
LV: Scalar loop costs: 7.
LV: Selecting VF = : 1.
LV: The target has 8 vector registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 2
LV(REG): At #4 Interval # 3
LV(REG): At #5 Interval # 3
LV(REG): At #6 Interval # 2
LV(REG): At #8 Interval # 1
LV(REG): At #9 Interval # 1
LV(REG): Found max usage: 3
LV(REG): Found invariant usage: 5
LV(REG): LoopSize: 11
LV: Vectorization is possible but not beneficial.
LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
LV: Unroll Factor is 1

It's not beneficial? I didn't expect that. Do you have a descriptive explanation why it's not beneficial?

Frank

Hi Arnold,

adding '-debug-only=loop-vectorize' to the command gives:

LV: Checking a loop in "bar"
LV: Found a loop: L0
LV: Found an induction variable.
LV: Found an unidentified write ptr: %7 = load float** %6
LV: Found an unidentified read ptr: %10 = load float** %9
LV: Found an unidentified read ptr: %13 = load float** %12
LV: We need to do 2 pointer comparisons.
LV: We can't vectorize because we can't find the array bounds.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.

It can't find the loop bounds if we use the overflow version of add.
That's a good point. I should mark this addition to not overflow.

When using the non-overflow version I get:

LV: Checking a loop in "bar"
LV: Found a loop: L0
LV: Found an induction variable.
LV: Found an unidentified write ptr: %7 = load float** %6
LV: Found an unidentified read ptr: %10 = load float** %9
LV: Found an unidentified read ptr: %13 = load float** %12
LV: Found a runtime check ptr: %20 = getelementptr float* %7, i32
%14
LV: Found a runtime check ptr: %15 = getelementptr float* %10, i32
%14
LV: Found a runtime check ptr: %17 = getelementptr float* %13, i32
%14
LV: We need to do 2 pointer comparisons.
LV: We can perform a memory runtime check if needed.
LV: We need a runtime memory check.
LV: We can vectorize this loop (with a runtime bound check)!
LV: Found trip count: 0
LV: The Widest type: 32 bits.
LV: The Widest register is: 32 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction: %14 =
phi
i32 [ %21, %L0 ], [ %1, %entrypoint ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %15 =
getelementptr float* %10, i32 %14
LV: Found an estimated cost of 1 for VF 1 For instruction: %16 =
load
float* %15
LV: Found an estimated cost of 0 for VF 1 For instruction: %17 =
getelementptr float* %13, i32 %14
LV: Found an estimated cost of 1 for VF 1 For instruction: %18 =
load
float* %17
LV: Found an estimated cost of 1 for VF 1 For instruction: %19 =
fmul
float %18, %16
LV: Found an estimated cost of 0 for VF 1 For instruction: %20 =
getelementptr float* %7, i32 %14
LV: Found an estimated cost of 1 for VF 1 For instruction: store
float
%19, float* %20
LV: Found an estimated cost of 1 for VF 1 For instruction: %21 =
add
nsw i32 %14, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %22 =
icmp
sge i32 %21, %4
LV: Found an estimated cost of 1 for VF 1 For instruction: br i1
%22,
label %L1, label %L0
LV: Scalar loop costs: 7.
LV: Selecting VF = : 1.
LV: The target has 8 vector registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 2
LV(REG): At #4 Interval # 3
LV(REG): At #5 Interval # 3
LV(REG): At #6 Interval # 2
LV(REG): At #8 Interval # 1
LV(REG): At #9 Interval # 1
LV(REG): Found max usage: 3
LV(REG): Found invariant usage: 5
LV(REG): LoopSize: 11
LV: Vectorization is possible but not beneficial.
LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
LV: Unroll Factor is 1

It's not beneficial? I didn't expect that. Do you have a descriptive
explanation why it's not beneficial?

It looks like the vectorizer is not picking up a TTI implementation from a target with vector registers (likely, you're just seeing the basic cost model). For what target is this?

-Hal

Hi Hal!

I am using the 'x86_64' target. Below the complete module dump and here the command line:

opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S test.ll

Frank

; ModuleID = 'test.ll'

target datalayout = "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"

target triple = "x86_64-unknown-linux-elf"

define void @bar([8 x i8]* %arg_ptr) {
entrypoint:
   %0 = bitcast [8 x i8]* %arg_ptr to i32*
   %1 = load i32* %0
   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
   %3 = bitcast [8 x i8]* %2 to i32*
   %4 = load i32* %3
   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
   %6 = bitcast [8 x i8]* %5 to float**
   %7 = load float** %6
   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
   %9 = bitcast [8 x i8]* %8 to float**
   %10 = load float** %9
   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
   %12 = bitcast [8 x i8]* %11 to float**
   %13 = load float** %12
   br label %L0

L0: ; preds = %L0, %entrypoint
   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
   %15 = getelementptr float* %10, i32 %14
   %16 = load float* %15
   %17 = getelementptr float* %13, i32 %14
   %18 = load float* %17
   %19 = fmul float %18, %16
   %20 = getelementptr float* %7, i32 %14
   store float %19, float* %20
   %21 = add nsw i32 %14, 1
   %22 = icmp sge i32 %21, %4
   br i1 %22, label %L1, label %L0

L1: ; preds = %L0
   ret void
}

LV: The Widest type: 32 bits.
LV: The Widest register is: 32 bits.

Yep, we don’t pick up the right TTI.

Try -march=x86-64 (or leave it out) you already have this info in the triple.

Then it should work (does for me with your example below).

>>> LV: The Widest type: 32 bits.
>>> LV: The Widest register is: 32 bits.

Yep, we don’t pick up the right TTI.

Try -march=x86-64 (or leave it out) you already have this info in the
triple.

Then it should work (does for me with your example below).

That may depend on what CPU is picks by default; Frank, if it does not work for you, try specifying a target CPU (-mcpu=whatever).

-Hal

If I leave '-march=x86-64' out, it works for me too.

Regards this message:

LV: We need a runtime memory check.
LV: We can vectorize this loop (with a runtime bound check)!

I know, my pointers are neither aliasing each other nor do they point to overlapping memory regions.

Is there a way to mark my pointers 'noalias' after loading, e.g. after

   %7 = load float** %6

in order to avoid the runtime check?

Frank

I would need this to work when calling the vectorizer through
the function pass manager. Unfortunately I am having the same
problem there:

LV: The Widest type: 32 bits.
LV: The Widest register is: 32 bits.

It's not picking the target information, although I tried with and
without the target triple in the module.

Any idea what could be wrong?

Frank

Hi Frank,

I would need this to work when calling the vectorizer through
the function pass manager. Unfortunately I am having the same
problem there:

I am not sure which function pass manager you are referring here. I assume you create your own (you are not using opt but configure your own pass manager)?

Here is what opt does when it sets up its pass pipeline.

You need to have your/the target add the target’s analysis passes:

opt.cpp:

  PassManager Passes;

  … // Add target library info and data layout.

  Triple ModuleTriple(M->getTargetTriple());
  TargetMachine *Machine = 0;
  if (ModuleTriple.getArch())
    Machine = GetTargetMachine(Triple(ModuleTriple));
  OwningPtr<TargetMachine> TM(Machine);

  // Add internal analysis passes from the target machine.
  if (TM.get())
    TM->addAnalysisPasses(Passes); // <<<
  …
  TM->addAnalysisPasses(FPManager);

Here is what the target does:

void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) {
  // Add first the target-independent BasicTTI pass, then our X86 pass. This
  // allows the X86 pass to delegate to the target independent layer when
  // appropriate.
  PM.add(createBasicTargetTransformInfoPass(this));
  PM.add(createX86TargetTransformInfoPass(this));
}

LV: The Widest type: 32 bits.
LV: The Widest register is: 32 bits.

This strongly looks like no target has added a target transform info pass so we default to NoTTI.

But, it could also be that you don’t have the right sub target (in which case you need to set the right cpu, “-mcpu” in opt, when the target machine is created):

unsigned X86TTI::getRegisterBitWidth(bool Vector) const {
  if (Vector) {
    if (ST->hasAVX()) return 256;
    if (ST->hasSSE1()) return 128;
    return 0;
  }

  if (ST->is64Bit())
    return 64;
  return 32;

}

Hi Arnold,

thanks for the detailed setup. Still, I haven't figured out the right thing to do.

I would need only the native target since all generated code will execute on the JIT execution machine (right now, the old JIT interface). There is no need for other targets.

Maybe it would be good to ask specific questions:

How do I get the triple for the native target?

How do I setup the target transform info pass?

I guess I need the target machine for all of this. 'opt' defines the GetTargetMachine() function which takes a Triple (my native triple). It calls GetTargetOptions(). This function, as implemented in 'opt', looks very static to me and surely doesn't fit all machines. Do I need this for my setup?

Frank

Frank,

I am not familiar with the JITs.

Basically, somehow you need to create the TargetMachine. There is the TargetRegistry that can be queried for the Target (for example, using TargetRegistry::lookupTarget) using the triple/march. The Target can then create a target machine for you.
The Triple is typically picked up from the IR file (Module->getTargetTriple()), and can be overridden using -march,-mtriple.

I think the (old) JIT engine just uses the triple of the host llvm was compiled on (sys::getProcessTriple(), see lib/ExecutionEngine/TargetSelect.cpp).

After you have the target you use it to create the target machine:

  Target *Target = … lookup the target
  TargetMachine *TM = Target->createTargetMachine( …);

Using this target machine you can then add the target transform info pass to a pass manager by calling:

  TM->addAnalysisPasses(FunPassManager); // This will add the target’s “TargetTransformInfo” implementation if it has one.

Hi Arnold,

thanks for the detailed setup. Still, I haven't figured out the right thing to do.

I would need only the native target since all generated code will execute on the JIT execution machine (right now, the old JIT interface). There is no need for other targets.

I am not familiar with the JIT infrastructure I am afraid. I think the old JIT’s native target is just the host that llvm was compiled on.

Maybe it would be good to ask specific questions:

How do I get the triple for the native target?

How do I setup the target transform info pass?

I guess I need the target machine for all of this. 'opt' defines the GetTargetMachine() function which takes a Triple (my native triple). It calls GetTargetOptions(). This function, as implemented in 'opt', looks very static to me and surely doesn't fit all machines. Do I need this for my setup?

Yes you need the target machine. There ought to be a JIT api that should return the target machine (since the JIT needs to create it when it creates machine code). But since I am not familiar with the JITs I am afraid I can’t help you further than this.

Frank,

I ran into the same issue getting started with the JIT execution engine and
vectorization. You might find this email thread useful:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-May/062039.html

My code for this looks like:

InitializeNativeTarget();
Module *module = new Module("foo", getGlobalContext());
module->setTargetTriple(sys::getProcessTriple());
EngineBuilder engineBuilder(module);
engineBuilder.setMCPU(sys::getHostCPUName());
engineBuilder.setEngineKind(EngineKind::JIT);
engineBuilder.setOptLevel(CodeGenOpt::Aggressive);
ExecutionEngine *executionEngine = engineBuilder.create();

For a more complete example, starting at line 1175:

https://github.com/biometrics/likely/blob/gh-pages/likely.cpp

Hope that helps!
-Josh