Loop unrolling opportunity in SPEC's libquantum with profile info

I am starting to use the sample profiler to analyze new performance
opportunities. The loop unroller has popped up in several of the
benchmarks I'm running. In particular, libquantum. There is a ~12%
opportunity when the runtime unroller is triggered.

This helps functions like quantum_sigma_x
(http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149).
The function accounts for ~20% of total runtime. By allowing the
runtime unroller, we can speedup the program by about 12%.

I have been poking at the unroller a little bit. Currently, the
runtime unroller is only triggered by a special flag or if the target
states it in the unrolling preferences. We could also consult the
block frequency information here. If the loop header has a higher
relative frequency than the rest of the function, then we'd enable
runtime unrolling.

Chandler also pointed me at the vectorizer, which has its own
unroller. However, the vectorizer only unrolls enough to serve the
target, it's not as general as the runtime-triggered unroller. From
what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
avx targets). Additionally, the vectorizer only unrolls to aid
reduction variables. When I forced the vectorizer to unroll these
loops, the performance effects were nil.

I'm currently looking at changing LoopUnroll::runOnLoop() to consult
block frequency information for the loop header to decide whether to
try runtime triggers for loops that don't have a constant trip count
but could be partially peeled.

Does that sound reasonable?

Thanks. Diego.

From: "Diego Novillo" <dnovillo@google.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Cc: nadav@apple.com
Sent: Wednesday, January 15, 2014 6:13:27 PM
Subject: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

I am starting to use the sample profiler to analyze new performance
opportunities. The loop unroller has popped up in several of the
benchmarks I'm running. In particular, libquantum. There is a ~12%
opportunity when the runtime unroller is triggered.

This helps functions like quantum_sigma_x
(http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149).
The function accounts for ~20% of total runtime. By allowing the
runtime unroller, we can speedup the program by about 12%.

I have been poking at the unroller a little bit. Currently, the
runtime unroller is only triggered by a special flag or if the target
states it in the unrolling preferences. We could also consult the
block frequency information here. If the loop header has a higher
relative frequency than the rest of the function, then we'd enable
runtime unrolling.

Chandler also pointed me at the vectorizer, which has its own
unroller. However, the vectorizer only unrolls enough to serve the
target, it's not as general as the runtime-triggered unroller. From
what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
avx targets). Additionally, the vectorizer only unrolls to aid
reduction variables. When I forced the vectorizer to unroll these
loops, the performance effects were nil.

It may be worth noting, that the vectorizer's unrolling is modulo unrolling (in the sense that the iterations are maximally intermixed), and so is bound by register pressure considerations (especially in the default configuration, where CodeGen does not make use of AA, and so often cannot 'fix' an expensive unrolling that has increased register pressure too much).

The generic unroller, on the other hand, does concatenation unrolling, which has different benefits.

I'm currently looking at changing LoopUnroll::runOnLoop() to consult
block frequency information for the loop header to decide whether to
try runtime triggers for loops that don't have a constant trip count
but could be partially peeled.

Does that sound reasonable?

This sounds good to me; I definitely feel that we should better exploit the generic unroller's capabilities.

The last time that I tried enabling runtime unrolling (and partial unrolling) over the entire test suite on x86, there were many speedups and many slowdowns (although slightly more slowdowns than speedups). You seem to be suggesting that restricting runtime unrolling to known hot loops will eliminate many of the slowdowns. I'm certainly curious to see how that turns out.

-Hal

I just also want to point out that we should really be *vectorizing* this
loop as well. It's a great candidate for it AFAICS....

From: "Diego Novillo" <dnovillo@google.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Cc: nadav@apple.com
Sent: Wednesday, January 15, 2014 6:13:27 PM
Subject: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

I am starting to use the sample profiler to analyze new performance
opportunities. The loop unroller has popped up in several of the
benchmarks I'm running. In particular, libquantum. There is a ~12%
opportunity when the runtime unroller is triggered.

This helps functions like quantum_sigma_x
(http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149).
The function accounts for ~20% of total runtime. By allowing the
runtime unroller, we can speedup the program by about 12%.

I have been poking at the unroller a little bit. Currently, the
runtime unroller is only triggered by a special flag or if the target
states it in the unrolling preferences. We could also consult the
block frequency information here. If the loop header has a higher
relative frequency than the rest of the function, then we'd enable
runtime unrolling.

Chandler also pointed me at the vectorizer, which has its own
unroller. However, the vectorizer only unrolls enough to serve the
target, it's not as general as the runtime-triggered unroller. From
what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
avx targets). Additionally, the vectorizer only unrolls to aid
reduction variables. When I forced the vectorizer to unroll these
loops, the performance effects were nil.

It may be worth noting, that the vectorizer's unrolling is modulo unrolling (in the sense that the iterations are maximally intermixed), and so is bound
by register pressure considerations (especially in the default configuration, where CodeGen does not make use of AA, and so often cannot 'fix' an
expensive unrolling that has increased register pressure too much).

The generic unroller, on the other hand, does concatenation unrolling, which has different benefits.

Thanks.

This sounds good to me; I definitely feel that we should better exploit the generic unroller's capabilities.

The last time that I tried enabling runtime unrolling (and partial unrolling) over the entire test suite on x86, there were many speedups and many
slowdowns (although slightly more slowdowns than speedups). You seem to be suggesting that restricting runtime unrolling to known hot loops will
eliminate many of the slowdowns. I'm certainly curious to see how that turns out.

Right. If I force the runtime unroller, I get a mixed bag of speedups
and slowdowns. Additionally, code size skyrockets. By using it only on
the functions that have hot loops (as per the profile), we only unroll
those that make a difference. In the case of libquantum, there is a
grand total of 3 loops that need to be runtime unrolled.

Diego.

Hi Diego!

Thanks for looking at this!

Chandler also pointed me at the vectorizer, which has its own
unroller. However, the vectorizer only unrolls enough to serve the
target, it’s not as general as the runtime-triggered unroller. From
what I’ve seen, it will get a maximum unroll factor of 2 on x86 (4 on
avx targets). Additionally, the vectorizer only unrolls to aid
reduction variables.

The vectorizer has a heuristics that is used to decide when to unroll. One of the rules that the heuristics has is that reductions are profitable to unroll. The other rule is that small loops should also be unrolled.

When I forced the vectorizer to unroll these
loops, the performance effects were nil.

Was the vectorizer successful in unrolling the loop in quantum_sigma_x? I wonder if 'size’ is typically high or low.

Thanks,
Nadav

I am starting to use the sample profiler to analyze new performance
opportunities. The loop unroller has popped up in several of the
benchmarks I'm running. In particular, libquantum. There is a ~12%
opportunity when the runtime unroller is triggered.

Pardon my ignorance, but what exactly does "runtime unroller" mean? In
particular the "runtime" part of it. Just from the name I'm imagining
JIT'ing an unrolled version on the fly, or choosing an unrolled version at
runtime, but neither of those interpretations seems likely.

-- Sean Silva

Hi Diego!

Thanks for looking at this!

Chandler also pointed me at the vectorizer, which has its own
unroller. However, the vectorizer only unrolls enough to serve the
target, it's not as general as the runtime-triggered unroller. From
what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
avx targets). Additionally, the vectorizer only unrolls to aid
reduction variables.

The vectorizer has a heuristics that is used to decide when to unroll.
One of the rules that the heuristics has is that reductions are profitable
to unroll. The other rule is that small loops should also be unrolled.

When I forced the vectorizer to unroll these
loops, the performance effects were nil.

Was the vectorizer successful in unrolling the loop in quantum_sigma_x? I
wonder if 'size’ is typically high or low.

Yeah, can you produce a histogram of the values of `reg->size`? (or provide
the raw data for me to analyze?).

-- Sean Silva

From: "Sean Silva" <silvas@purdue.edu>
To: "Diego Novillo" <dnovillo@google.com>
Cc: nadav@apple.com, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, January 15, 2014 9:38:32 PM
Subject: Re: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

I am starting to use the sample profiler to analyze new performance
opportunities. The loop unroller has popped up in several of the
benchmarks I'm running. In particular, libquantum. There is a ~12%
opportunity when the runtime unroller is triggered.

Pardon my ignorance, but what exactly does "runtime unroller" mean?
In particular the "runtime" part of it.

He's referring to the code in lib/Transforms/Utils/LoopUnrollRuntime.cpp -- which can be enabled by using the -unroll-runtime flag. The 'runtime' refers to the fact that the trip count is not known at compile time.

-Hal

No. The vectorizer stated that it wasn't going to bother with the loop
because it wasn't profitable. Specifically:

LV: Checking a loop in "quantum_sigma_x"
LV: Found a loop: for.body
LV: Found an induction variable.
LV: Found a write-only loop!
LV: We can vectorize this loop!
LV: Found trip count: 0
LV: The Widest type: 64 bits.
LV: The Widest register is: 128 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction:
%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
%for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %state =
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
%indvars.iv, i32 1, !dbg !58
LV: Found an estimated cost of 1 for VF 1 For instruction: %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 1 For instruction: %xor =
xor i64 %3, %shl, !dbg !58
LV: Found an estimated cost of 1 for VF 1 For instruction: store i64
%xor, i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 1 For instruction:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
LV: Found an estimated cost of 0 for VF 1 For instruction: %4 =
trunc i64 %indvars.iv.next to i32, !dbg !52
LV: Found an estimated cost of 1 for VF 1 For instruction: %cmp =
icmp slt i32 %4, %1, !dbg !52
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1
%cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
LV: Scalar loop costs: 5.
LV: Found an estimated cost of 0 for VF 2 For instruction:
%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
%for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction: %state =
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
%indvars.iv, i32 1, !dbg !58
LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 2 For instruction: %xor =
xor i64 %3, %shl, !dbg !58
LV: Found an estimated cost of 6 for VF 2 For instruction: store i64
%xor, i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 2 For instruction:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
LV: Found an estimated cost of 0 for VF 2 For instruction: %4 =
trunc i64 %indvars.iv.next to i32, !dbg !52
LV: Found an estimated cost of 1 for VF 2 For instruction: %cmp =
icmp slt i32 %4, %1, !dbg !52
LV: Found an estimated cost of 0 for VF 2 For instruction: br i1
%cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
LV: Vector loop of width 2 costs: 7.
LV: Selecting VF = : 1.
LV: The target has 16 vector registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #5 Interval # 2
LV(REG): At #6 Interval # 2
LV(REG): At #7 Interval # 2
LV(REG): Found max usage: 3
LV(REG): Found invariant usage: 3
LV(REG): LoopSize: 9
LV: Found a vectorizable loop (1) in gates.ll
LV: Unroll Factor is 1
LV: Vectorization is possible but not beneficial.

I poked briefly at the vectorizer code to see if there is anything
that the profile data could've told it, but this loop did not meet the
requirements for unrolling. And even if it did, the trip count is not
constant and the unroll factor used by the vectorizer is pretty low.
So, even if we vectorized it (or parts of it) I don't think the
speedup would be significant.

What really helps this loop is to peel it a few times and do the
remaining iterations in the loop.

Diego.

reg->size is a runtime constant. It depends on the inputs to the
program, but once set it's always the same value. For the runs I was
comparing it's 260,000 or so.

Diego.

Hi Diego,

It looks like the problem is with the code in the vectorizer that tries to estimate the most profitable vectorization factor:

LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61

It looks like a cost model problem. The vectorizer thinks that loading %3 (above) is non consecutive and would require scatter/gather. Is that correct? I wonder that SCEV is reporting. Is there an index overflow problem that is preventing us from loading consecutive elements?

Thanks,
Nadav

Hi Diego,

It looks like the problem is with the code in the vectorizer that tries to estimate the most profitable vectorization factor:

LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61

It looks like a cost model problem. The vectorizer thinks that loading %3 (above) is non consecutive and would require scatter/gather. Is that correct? I wonder that SCEV is reporting. Is there an index overflow problem that is preventing us from loading consecutive elements?

Oh, yes, yes.

reg->node[i].state ^= ...

We are accessing a struct with multiple members. This would require strided access because we need to skip the other members of the struct.

Thanks,
Nadav

Yes, I forgot to mention that. The access is non-consecutive:

      for(i=0; i<reg->size; i++)
         {
             /* Flip the target bit of each basis state */
             reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
}

The code is writing to the 'state' field. The data structures look like this:

typedef struct quantum_matrix_struct quantum_matrix;

struct quantum_reg_node_struct
{
  COMPLEX_FLOAT amplitude; /* alpha_j */
  MAX_UNSIGNED state; /* j */
};

typedef struct quantum_reg_node_struct quantum_reg_node;

/* The quantum register */

struct quantum_reg_struct
{
  int width; /* number of qubits in the qureg */
  int size; /* number of non-zero vectors */
  int hashw; /* width of the hash array */
  quantum_reg_node *node;
  int *hash;
};

If you do the trick of writing to a separate array, then the loop can
be vectorized.

Diego.

Hi Diego,

It looks like the problem is with the code in the vectorizer that tries to estimate the most profitable vectorization factor:

LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61

It looks like a cost model problem. The vectorizer thinks that loading %3 (above) is non consecutive and would require scatter/gather. Is that correct? I wonder that SCEV is reporting. Is there an index overflow problem that is preventing us from loading consecutive elements?

Yes, I forgot to mention that. The access is non-consecutive:

     for(i=0; i<reg->size; i++)
        {
            /* Flip the target bit of each basis state */
            reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
}

The code is writing to the 'state' field. The data structures look like this:

typedef struct quantum_matrix_struct quantum_matrix;

struct quantum_reg_node_struct
{
COMPLEX_FLOAT amplitude; /* alpha_j */
MAX_UNSIGNED state; /* j */
};

typedef struct quantum_reg_node_struct quantum_reg_node;

/* The quantum register */

struct quantum_reg_struct
{
int width; /* number of qubits in the qureg */
int size; /* number of non-zero vectors */
int hashw; /* width of the hash array */
quantum_reg_node *node;
int *hash;
};

If you do the trick of writing to a separate array, then the loop can
be vectorized.

Do you mean that we should change the layout of the array? Or is there another trick?

No, sorry. I meant that if I change the code to instead do the xor
operation into a temp array, then I expect the loop to be vectorized.
But the overhead would likely negate the vectorization benefits
(haven't tried it, though). Incidentally, GCC gives up on vectorizing
this loop for essentially the same reason. It decides that the access
pattern is too complex to bother.

Diego.

> Hi Diego,
>
> It looks like the problem is with the code in the vectorizer that tries
to estimate the most profitable vectorization factor:
>
>> LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load
>> i64* %state, align 8, !dbg !58, !tbaa !61
>
>
> It looks like a cost model problem. The vectorizer thinks that loading
%3 (above) is non consecutive and would require scatter/gather. Is that
correct? I wonder that SCEV is reporting. Is there an index overflow
problem that is preventing us from loading consecutive elements?

Yes, I forgot to mention that. The access is non-consecutive:

      for(i=0; i<reg->size; i++)
         {
             /* Flip the target bit of each basis state */
             reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
}

The code is writing to the 'state' field. The data structures look like
this:

typedef struct quantum_matrix_struct quantum_matrix;

struct quantum_reg_node_struct
{
  COMPLEX_FLOAT amplitude; /* alpha_j */
  MAX_UNSIGNED state; /* j */
};

typedef struct quantum_reg_node_struct quantum_reg_node;

/* The quantum register */

struct quantum_reg_struct
{
  int width; /* number of qubits in the qureg */
  int size; /* number of non-zero vectors */
  int hashw; /* width of the hash array */
  quantum_reg_node *node;
  int *hash;
};

If you do the trick of writing to a separate array, then the loop can
be vectorized.

Or use scatter-gather.

-- Sean Silva

Vectorization and partial unrolling (aka runtime unrolling) for ILP should to be the same pass. The profitability analysis required in each case is very closely related, and you never want to do one before or after the other. The analysis makes sense even for targets without vector units. The “vector unroller” has an extra restriction (unlike the LoopUnroll pass) in that it must be able to interleave operations across iterations. This is usually a good thing to check before unrolling, but the compiler’s dependence analysis may be too conservative in some cases.

Currently, the cost model is conservative w.r.t unrolling because we don’t want to increase code size. But minimally, we should unroll until we can saturate the resources/ports. e.g. a loop with a single load should be unrolled x2 so we can do two loads per cycle. If you can come up with improved heuristics without generally impacting code size that’s great.

Where we’re currently looking for a constant trip count to avoid excessive unrolling, we could now look at profiled trip count if it isn’t constant.

-Andy

In addition to tuning the cost model, I found that the vectorizer does not even choose to get that far into its analysis for some loops that I need unrolled. In this particular case, there are three loops that need to be unrolled to get the performance I’m looking for. Of the three, only one gets far enough in the analysis to decide whether we unroll it or not. But I found a bigger issue. The loop optimizers run under the loop pass manager (I am still trying to wrap my head around that. I find it very odd and have not convinced myself why there is a separate manager for loops). Inside the loop pass manager, I am not allowed to call the block frequency analysis. Any attempts I make at scheduling BF analysis, sends the compiler into an infinite loop during initialization. Chandler suggested a way around the problem. I’ll work on that first. Oh, code size will always go up. That’s pretty much unavoidable when you decide to unroll. The trick here is to only unroll select loops. The profiler does not tell you the trip count. What it will do is cause the loop header to be excessively heavy wrt its parent in the block frequency analysis. In this particular case, you get something like: Notice how the head of the loop has weight 682, which is 682x the weight of its parent (the function entry, since this is an outermost loop). With static heuristics, this ratio is significantly lower (about 3x). When we see this, we can decide to unroll the loop. Diego.

In addition to tuning the cost model, I found that the vectorizer does not even choose to get that far into its analysis for some loops that I need unrolled. In this particular case, there are three loops that need to be unrolled to get the performance I’m looking for. Of the three, only one gets far enough in the analysis to decide whether we unroll it or not. But I found a bigger issue. The loop optimizers run under the loop pass manager (I am still trying to wrap my head around that. I find it very odd and have not convinced myself why there is a separate manager for loops). Inside the loop pass manager, I am not allowed to call the block frequency analysis. Any attempts I make at scheduling BF analysis, sends the compiler into an infinite loop during initialization. Chandler suggested a way around the problem. I’ll work on that first. Oh, code size will always go up. That’s pretty much unavoidable when you decide to unroll. The trick here is to only unroll select loops. The profiler does not tell you the trip count. What it will do is cause the loop header to be excessively heavy wrt its parent in the block frequency analysis. In this particular case, you get something like: Notice how the head of the loop has weight 682, which is 682x the weight of its parent (the function entry, since this is an outermost loop). With static heuristics, this ratio is significantly lower (about 3x). When we see this, we can decide to unroll the loop. Diego.

Hi ,
please remove me from this thread, i think you meant to sent it to Nadav Rotem

Nadav Wurembrand
 Apple Inc., HDC | Mail: nadav@apple.com | Mobile: +972-54-975-5016