Vectorizer using Instruction, not opcodes

Hi folks,

I’ve been thinking on how to implement some of the costs and there is a lot of instructions which cost depend on other instructions around. Casts are one obvious case, since arithmetic and memory instructions can, sometimes, cast values for free.

The cost model receives Opcodes, which lose the info on the history of the values being vectorized, and I thought we could pass the whole Instruction instead. But this is the easy part… Below are some ideas which I’d love to hear from you guys which makes sense and which doesn’t…

A. How to find context?

Suppose you have a cast instruction, and you want to find out if there is an arithmetic operations before (or after) on any of the values it touches…

%3 = sext %type1 %2 to %type2

You could iterate through all uses of %3 and %2 and see if any of them is a mul/add/sub, and return ‘true’, so that the cost would add (or not) an extra cycle.

Problems:

  1. This space is multi-dimensional, ie. there are many types in and many out, and the cost table is not linear on any of the dimensions (hence, cost tables). An idea is to create a multi-dimension cost table, but that could get confusing really quickly.

  2. We’re essentially reproducing the logic of Inst Combine, which is not sensible nor appropriate.

Any ideas on direction for this one?

B. How to find resulting code?

Another alternative is to run some lower level passes on the code, to reduce the chances that we’re guessing in the dark, but that’d be extremely expensive. Maybe we can only run the vectorizer at the very end, and only if code changed, that we’d re-ruin again the clean-up passes. I’m not convinced…

C. Settle for guess in the dark

We can continue guessing in the dark, but do it consistently. This is our approach, and I’m, wondering if it makes sense to deviate from it.

My main issue is that the costs for ARM instructions are almost random when seen from an IR perspective, we’re bound to make as many mistakes as not.

Another issue is that the number of vector IR instructions (extract/select) is a lot more than the resulting vector code, and that can count negatively on the benefits of vectorizing.

It’s also quite possible that I’m over-thinking…

Thoughts?

cheers,
–renato

Hi all,

My take on this is that, as you state below, at the IR level we are only roughly estimating cost, at best (or we would have to lower the code and then estimate cost - something we don't want to do).

I would propose for estimating the "worst case costs" and see how far we get with this. My rational here is that we don't want vectorization to decrease performance relative to scalar code. If we miss an opportunity this is bad, but not as bad as degrading relative to scalar code.

For cases where this approach breaks really badly we could consider adding a specialized api or parameters (like the type of a user/use). But we should do so only as a last resort and backed by actual code that would benefit from doing so.

- Arnold

Very sensible, more or less what I had in mind. I think we could go one
step further and just get some high-level decisions like: is this cast
associated with an (any) arithmetic operation? It won't be perfect, but it
adds a bit more information at very reduced price. Though, this would
require us to pass the Instruction, not the Opcode.

Do you have an example where this happening?

The example below is not stopping the vectorizer, but it does add a lot of
costs where there are none...

** C code:

int direct (int k) {
  int i;
  int a[256], b[256], c[256];

  for (i=0; i<256; i++){
    a[i] = b[i] * c[i];
  }
  return a[k];
}

** ASM vectorized result:

        adr r5, .LCPI0_0
        vdup.32 q9, r1
        vld1.64 {d16, d17}, [r5, :128]
        add r1, r1, #4
        vadd.i32 q8, q9, q8
        cmp r3, r1
        vmov.32 r5, d16[0]
        add r6, lr, r5, lsl #2
        add r7, r2, r5, lsl #2
        vld1.32 {d16, d17}, [r6]
        add r5, r4, r5, lsl #2
        vld1.32 {d18, d19}, [r7]
        vmul.i32 q8, q9, q8
        vst1.32 {d16, d17}, [r5]
        bne .LBB0_2

** Vectorized IR (just the loop):

vector.body: ; preds = %vector.body, %
vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x
i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %0 = extractelement <4 x i32> %induction, i32 0
  %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0
  %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
  %3 = extractelement <4 x i32> %induction, i32 1
  %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3
  %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
  %6 = extractelement <4 x i32> %induction, i32 2
  %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6
  %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
  %9 = extractelement <4 x i32> %induction, i32 3
  %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9
  %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
  %12 = extractelement <4 x i32> %induction, i32 0
  %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12
  %14 = getelementptr i32* %13, i32 0
  %15 = bitcast i32* %14 to <4 x i32>*
  %wide.load = load <4 x i32>* %15, align 4
  %16 = extractelement <4 x i32> %induction, i32 0
  %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16
  %18 = insertelement <4 x i32*> undef, i32* %17, i32 0
  %19 = extractelement <4 x i32> %induction, i32 1
  %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19
  %21 = insertelement <4 x i32*> %18, i32* %20, i32 1
  %22 = extractelement <4 x i32> %induction, i32 2
  %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22
  %24 = insertelement <4 x i32*> %21, i32* %23, i32 2
  %25 = extractelement <4 x i32> %induction, i32 3
  %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25
  %27 = insertelement <4 x i32*> %24, i32* %26, i32 3
  %28 = extractelement <4 x i32> %induction, i32 0
  %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28
  %30 = getelementptr i32* %29, i32 0
  %31 = bitcast i32* %30 to <4 x i32>*
  %wide.load3 = load <4 x i32>* %31, align 4
  %32 = mul nsw <4 x i32> %wide.load3, %wide.load
  %33 = extractelement <4 x i32> %induction, i32 0
  %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33
  %35 = insertelement <4 x i32*> undef, i32* %34, i32 0
  %36 = extractelement <4 x i32> %induction, i32 1
  %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36
  %38 = insertelement <4 x i32*> %35, i32* %37, i32 1
  %39 = extractelement <4 x i32> %induction, i32 2
  %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39
  %41 = insertelement <4 x i32*> %38, i32* %40, i32 2
  %42 = extractelement <4 x i32> %induction, i32 3
  %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42
  %44 = insertelement <4 x i32*> %41, i32* %43, i32 3
  %45 = extractelement <4 x i32> %induction, i32 0
  %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45
  %47 = getelementptr i32* %46, i32 0
  %48 = bitcast i32* %47 to <4 x i32>*
  store <4 x i32> %32, <4 x i32>* %48, align 4
  %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1>
  %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32 256>
  %index.next = add i32 %index, 4
  %51 = icmp eq i32 %index.next, %end.idx.rnd.down
  br i1 %51, label %middle.block, label %vector.body

** Cost analysis:

Cost Model: Found an estimated cost of 1 for instruction: %induction =
add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
Cost Model: Found an estimated cost of 1 for instruction: %0 =
extractelement <4 x i32> %induction, i32 0
Cost Model: Unknown cost for instruction: %1 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %0
Cost Model: Found an estimated cost of 1 for instruction: %2 =
insertelement <4 x i32*> undef, i32* %1, i32 0
Cost Model: Found an estimated cost of 1 for instruction: %3 =
extractelement <4 x i32> %induction, i32 1
Cost Model: Unknown cost for instruction: %4 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %3
Cost Model: Found an estimated cost of 1 for instruction: %5 =
insertelement <4 x i32*> %2, i32* %4, i32 1
Cost Model: Found an estimated cost of 1 for instruction: %6 =
extractelement <4 x i32> %induction, i32 2
Cost Model: Unknown cost for instruction: %7 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %6
Cost Model: Found an estimated cost of 1 for instruction: %8 =
insertelement <4 x i32*> %5, i32* %7, i32 2
Cost Model: Found an estimated cost of 1 for instruction: %9 =
extractelement <4 x i32> %induction, i32 3
Cost Model: Unknown cost for instruction: %10 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %9
Cost Model: Found an estimated cost of 1 for instruction: %11 =
insertelement <4 x i32*> %8, i32* %10, i32 3
Cost Model: Found an estimated cost of 1 for instruction: %12 =
extractelement <4 x i32> %induction, i32 0
Cost Model: Unknown cost for instruction: %13 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %12
Cost Model: Unknown cost for instruction: %14 = getelementptr i32* %13,
i32 0

and so on...

From: "Renato Golin" <renato.golin@linaro.org>
To: "Arnold Schwaighofer" <aschwaighofer@apple.com>
Cc: "LLVM Dev" <llvmdev@cs.uiuc.edu>, "Nadav Rotem" <nrotem@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
Sent: Monday, February 4, 2013 1:38:03 PM
Subject: Re: Vectorizer using Instruction, not opcodes

For cases where this approach breaks really badly we could consider
adding a specialized api or parameters (like the type of a
user/use). But we should do so only as a last resort and backed by
actual code that would benefit from doing so.

Very sensible, more or less what I had in mind. I think we could go
one step further and just get some high-level decisions like: is
this cast associated with an (any) arithmetic operation? It won't be
perfect, but it adds a bit more information at very reduced price.
Though, this would require us to pass the Instruction, not the
Opcode.

We probably don't want to end up in a situation where we're speculatively creating instructions just to test their cost using the cost model. For simple things, passing an existing instruction and its widening factor will work, but I worry that won't really be sufficiently general. Would a tree of <opcode, return-type, input-type>s work? How deep would it need to be? We also need to decide on some self-consistent way of dealing with cost folding. What I mean is that if an instruction can be folded with its use, do we register that cost savings for the use or the operand? We'd also need to pass hasOneUse() information.

-Hal

The loop vectorized does not estimate the cost of vectorization by looking at the IR you list below. It does not vectorize and then run the CostAnalysis pass. It estimates the cost itself before it even performs the vectorization.

The way it works is that it looks at all the scalar instructions and asks: What is the cost if I execute the scalar instruction as a vector instruction. Therefore, it will not consider any of the insert/extractelement instructions you see below (note, that they will all go away any way).

If you run clang++ -O3 -mllvm -debug-only=loop-vectorize you will see which instructions it considers.

E.g:

LV: Found an estimated cost of 0 for VF 1 For instruction: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx = getelementptr inbounds [256 x i32]* @b, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction: %0 = load i32* %arrayidx, align 4, !tbaa !0
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx2 = getelementptr inbounds [256 x i32]* @c, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction: %1 = load i32* %arrayidx2, align 4, !tbaa !0
LV: Found an estimated cost of 1 for VF 1 For instruction: %mul = mul nsw i32 %1, %0
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx4 = getelementptr inbounds [256 x i32]* @a, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction: store i32 %mul, i32* %arrayidx4, align 4, !tbaa !0
LV: Found an estimated cost of 1 for VF 1 For instruction: %indvars.iv.next = add i64 %indvars.iv, 1
LV: Found an estimated cost of 0 for VF 1 For instruction: %lftr.wideiv = trunc i64 %indvars.iv.next to i32
LV: Found an estimated cost of 1 for VF 1 For instruction: %exitcond = icmp ne i32 %lftr.wideiv, 256
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %exitcond, label %for.body, label %for.end
LV: Scalar loop costs: 6.
LV: Found an estimated cost of 0 for VF 2 For instruction: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx = getelementptr inbounds [256 x i32]* @b, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 2 For instruction: %0 = load i32* %arrayidx, align 4, !tbaa !0
LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx2 = getelementptr inbounds [256 x i32]* @c, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 2 For instruction: %1 = load i32* %arrayidx2, align 4, !tbaa !0
LV: Found an estimated cost of 2 for VF 2 For instruction: %mul = mul nsw i32 %1, %0
LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx4 = getelementptr inbounds [256 x i32]* @a, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 2 For instruction: store i32 %mul, i32* %arrayidx4, align 4, !tbaa !0
LV: Found an estimated cost of 1 for VF 2 For instruction: %indvars.iv.next = add i64 %indvars.iv, 1
LV: Found an estimated cost of 0 for VF 2 For instruction: %lftr.wideiv = trunc i64 %indvars.iv.next to i32
LV: Found an estimated cost of 1 for VF 2 For instruction: %exitcond = icmp ne i32 %lftr.wideiv, 256
LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %exitcond, label %for.body, label %for.end

The corresponding code is in LoopVectorizationCostModel::getInstructionCost(). Basically, it just takes the scalar instruction, its type, creates a vector type and asks: How much would an instruction of that vector type cost (Slight simplification).

The loop vectorized does not estimate the cost of vectorization by looking
at the IR you list below. It does not vectorize and then run the
CostAnalysis pass. It estimates the cost itself before it even performs the
vectorization.

Ok, I see my mistake. I thought that they were being taken into account
(even before being created).

(note, that they will all go away any way).

Thus, my worry. :wink:

The corresponding code is in

LoopVectorizationCostModel::getInstructionCost(). Basically, it just takes
the scalar instruction, its type, creates a vector type and asks: How much
would an instruction of that vector type cost (Slight simplification).

And then divide by VF in the end. Ok, things are making more sense, now. :wink:

Thanks!
--renato

Hi Hal,

From: "Renato Golin" <renato.golin@linaro.org>
To: "Arnold Schwaighofer" <aschwaighofer@apple.com>
Cc: "LLVM Dev" <llvmdev@cs.uiuc.edu>, "Nadav Rotem" <nrotem@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
Sent: Monday, February 4, 2013 1:38:03 PM
Subject: Re: Vectorizer using Instruction, not opcodes

For cases where this approach breaks really badly we could consider
adding a specialized api or parameters (like the type of a
user/use). But we should do so only as a last resort and backed by
actual code that would benefit from doing so.

Very sensible, more or less what I had in mind. I think we could go
one step further and just get some high-level decisions like: is
this cast associated with an (any) arithmetic operation? It won't be
perfect, but it adds a bit more information at very reduced price.
Though, this would require us to pass the Instruction, not the
Opcode.

We probably don't want to end up in a situation where we're speculatively creating instructions just to test their cost using the cost model. For simple things, passing an existing instruction and its widening factor will work, but I worry that won't really be sufficiently general. Would a tree of <opcode, return-type, input-type>s work? How deep would it need to be?

Agreed, if we decide on more information than just type and opcode we will need some abstract representation.

We also need to decide on some self-consistent way of dealing with cost folding. What I mean is that if an instruction can be folded with its use, do we register that cost savings for the use or the operand? We'd also need to pass hasOneUse() information.

I think this probably going to far, I don't think we want to model all of target lowering in the cost model. Any attempt would be a lie anyway ;). I don't think we should model at such a fine granularity at the IR level because any attempt of doing so is bound to fail.

- Arnold

Would a tree of <opcode, return-type, input-type>s work? How deep would it
need to be?

Agreed, if we decide on more information than just type and opcode we will
need some abstract representation.

I didn't want to create yet-another-instruction-type, but I now agree
passing instruction is not an option. I think that passing a tree would be
bound to overuse, since you never know how much is enough beforehand and
passing too much will penalise the operations that didn't even need it in
the first place.

What I was thinking was something flat and target-independent, like:

enum VectFlags {
  ConstantShift = 0x0000001;
  VariableShift = 0x0000002;
  ...
  ArithmeticCast = 0x0000100;
  MemoryCast = 0x0000200;
  ...
};

and the user (vectorizers) can set a few flags where appropriate and pass
the flags (which can also be carried on, cached, stored somewhere else,
like metadata).

getCastInstructionCost(opcode, dstTy, srcTy, Flags) {

isArith = Flags & ArithmeticCost;

  costtable = {
    { ISD::SEXT, MVT::Foo, MVT::Bar1, isArith+1 }
    { ISD::SEXT, MVT::Foo, MVT::Bar2, isArith }
    { ISD::SEXT, MVT::Foo, MVT::Bar3, 1 }
    { ISD::SEXT, MVT::Foo, MVT::Bar4, 0 }
    ...
  }

}

> We also need to decide on some self-consistent way of dealing with cost
folding. What I mean is that if an instruction can be folded with its use,
do we register that cost savings for the use or the operand? We'd also need
to pass hasOneUse() information.

Constant folding is a big problem per se. I agree with Arnold that it may
be going too far (as I was going with the Instruction argument).

cheers,
--renato

Hi,

The other thing I've thought about is that the "total cost of conversions"
ought to go through some sort of non-linear mapping based on the number of
instructions, so that the same "conversion cost" for a long inner loop gets
weighted less than for a short inner loop, since the more instructions there
are the more an out-of-order CPU can put them into otherwise unused slots. I
can't think of a way of figuring out such a mapping other than empirically.

Cheers,
Dave

Given the amount of uncertainty on these OOO guesses, I don't think we can
get anything worth trying, even empirically. The noise will always outweigh
the signal.

You can normally save a few cycles on a static micro-benchmark and think
you're in control, but you normally don't evaluate the exact impact that
the same guess had on other benchmarks or real code. Since this is
auto-vectorization, it'll be hard to compare, but more so to evaluate the
overall impact on the rest.

--renato

Hi Renato,