Landing my new development on the trunk ...

Here is the patch for the new Operator Strength Reduction optimization
pass that I have written. The bulk of the code is in

lib/Transforms/Scalar/OperatorStrengthReduce.cpp

The optimization is based on the algorithm described within the paper
that can be found here:

Keith D. Cooper , L. Taylor Simpson , Christopher A. Vick, Operator
strength reduction, ACM Transactions on Programming Languages and
Systems (TOPLAS), v.23 n.5, p.603-625, September 2001.

I have embedded the paper's pseudo code into comment blocks within the code.

The algorithm finds reduction opportunities in both array accesses and
explicit multiplications within loops.

Next, I plan on writing the regression tests.

thanks,
Brian

Brian West <bnwest <at> rice.edu> writes:

I am currently writing a new optimization pass for LLVM, based on the
paper Operator Strength Reduction (Taylor Simpson et al, 2001). I
noticed from the Developer Policy page that my code will need to be
reviewed before it is committed to the trunk.

The Policy page also indicated that all "major" changes should to be
announced before coding and should be done as a series of incremental
changes. I am not sure a new stand alone optimization pass counts as a
major change or not.

I have run my pass against the test suite (as described here: "How to
test my pass"
http://comments.gmane.org/gmane.comp.compilers.llvm.devel/32875). I
have fixed all of the compilation errors and all but one of the
execution errors.

I have made a pass over my code to conform to the LLVM coding standards.

I have not written the regressions tests yet.

I also still need to work on the performance, but as is it is faster
than the existing -loop-reduce pass.

My question is how to proceed. What steps do I need to take to get my
code reviewed and committed to the trunk?

thanks,
Brian West

Duncan Sands <baldrick <at> free.fr> writes:

osr.patch (79.4 KB)

Comments:
1. Please don't top-post.
2. Instruction::getOpcodeName instead of your OpcodeToString function.
3. LLVM already has a significant amount of infrastructure for loop
passes; why does this pass have its own code for finding loops?
4. What's the use case for this pass? Can this pass improve
performance of generated code beyond the existing -loop-reduce pass?
If not, could it usefully act as a faster replacement for -loop-reduce
for fast code generation? (Perhaps someone doing fast code generation
with LLVM could comment on whether -loop-reduce is useful, and whether
the performance is an issue.)

-Eli

Here is the patch for the new Operator Strength Reduction optimization
pass that I have written. The bulk of the code is in

lib/Transforms/Scalar/OperatorStrengthReduce.cpp

The algorithm finds reduction opportunities in both array accesses and
explicit multiplications within loops.

Next, I plan on writing the regression tests.

Comments:
1. Please don't top-post.
2. Instruction::getOpcodeName instead of your OpcodeToString function.

Can do. I did not see getOpcodeName() in lib/Target/CBackend/CBackend.cpp, so I wrote my own. FWIW I am only using OpcodeToString() for debug output.

3. LLVM already has a significant amount of infrastructure for loop
passes; why does this pass have its own code for finding loops?

I saw the loop infrastructure for CFG loops. This algorithm finds loops in the data flow (more precisely: strongly-connected components in the SSA-graph), e.g.

bb5:
%20 = add nsw i32 %i.0, 1
br label %bb6

bb6:
%i.0 = phi i32 [ 0, %entry ], [ %20, %bb5 ]

There is a data flow loop between %20 and %i.0. The OSR paper has a nice figure showing data flow loops.

Here is small excerpt from the OSR paper:

"OSR is driven by a simple depth-first search of the SSA-graph, using Tarjan’s strongly-connected component finder. The SCC-finder lets OSR discover induction variables in topological order and process them as they are discovered. As the SCCs are discovered, they are processed by a set of mutually-recursive routines that test for induction variables and region constants, and then reduce the appropriate operations."

4. What's the use case for this pass? Can this pass improve
performance of generated code beyond the existing -loop-reduce pass?
If not, could it usefully act as a faster replacement for -loop-reduce
for fast code generation? (Perhaps someone doing fast code generation
with LLVM could comment on whether -loop-reduce is useful, and whether
the performance is an issue.)

-loop-reduce (LSR) only looks at implicit multiplication in array accesses. The following code has an explicit multiplication within a loop that LSR will not reduce:

extern float
loop4(float* a, int start, int stop)
{
int i;
float sum;
float *pEntry;
char *p;
unsigned long offset;
const unsigned int float_size = sizeof(float);

sum = 0.0;
p = (char*)a;
for (i=start-1; i<stop; ++i) {
offset = i * float_size;
pEntry = (float*) (p + offset);
sum += *pEntry;
}

return sum;
}

OSR also finds reductions in some esoteric array accesses that LSR misses. I found these LSR misses while writing less-than-real-world code to break OSR. I can include these examples, if requested.

For array accesses, OSR and LSR produces similar LLVM IR and I am guessing that their execution times should be similar.

Empirically the OSR optimization is compile-time faster than LSR. I have also noticed that OSR has more "analysis" requirements: Induction Variable User, Natural Loop Information, Canonicalize natural loops, and Scalar Evolution Analysis. Both OSR and LSR require the Dominator Tree Construction analysis pass.

I did not mention in the original email (and should have) that OSR needs -instcombine to be run after it for cleanup. Also -licm, -reassociate, -gvn and -sccp can be enabling optimizations for OSR.

-Eli

Eli,

Thanks for your comments.

Brian West

3. LLVM already has a significant amount of infrastructure for loop
passes; why does this pass have its own code for finding loops?

I saw the loop infrastructure for CFG loops. This algorithm finds loops in
the data flow (more precisely: strongly-connected components in the
SSA-graph), e.g.

bb5:
%20 = add nsw i32 %i.0, 1
br label %bb6

bb6:
%i.0 = phi i32 [ 0, %entry ], [ %20, %bb5 ]

There is a data flow loop between %20 and %i.0. The OSR paper has a nice
figure showing data flow loops.

Here is small excerpt from the OSR paper:

"OSR is driven by a simple depth-first search of the SSA-graph, using
Tarjan’s strongly-connected component finder. The SCC-finder lets OSR
discover induction variables in topological order and process them as they
are discovered. As the SCCs are discovered, they are processed by a set of
mutually-recursive routines that test for induction variables and region
constants, and then reduce the appropriate operations."

Ah, okay, that makes sense. I didn't really read the algorithm too carefully...

4. What's the use case for this pass? Can this pass improve
performance of generated code beyond the existing -loop-reduce pass?
If not, could it usefully act as a faster replacement for -loop-reduce
for fast code generation? (Perhaps someone doing fast code generation
with LLVM could comment on whether -loop-reduce is useful, and whether
the performance is an issue.)

[...]

For array accesses, OSR and LSR produces similar LLVM IR and I am guessing
that their execution times should be similar.

For normal compilation, LSR runs rather late (with the code generation
passes), and has some target-specific knowledge. We really need
benchmarks here, I think.

Empirically the OSR optimization is compile-time faster than LSR. I have
also noticed that OSR has more "analysis" requirements: Induction Variable
User, Natural Loop Information, Canonicalize natural loops, and Scalar
Evolution Analysis. Both OSR and LSR require the Dominator Tree Construction
analysis pass.

My primary concern here is that if this code gets committed without
anyone interested in actually using it, it will just end up orphaned,
so there's no point to contributing it.

I did not mention in the original email (and should have) that OSR needs
-instcombine to be run after it for cleanup. Also -licm, -reassociate, -gvn
and -sccp can be enabling optimizations for OSR.

Hmm... perhaps that could be partially fixed using the InstSimplify
infrastructure.

-Eli

Eli Friedman <eli.friedman <at> gmail.com> writes:

> Empirically the OSR optimization is compile-time faster than LSR. I have
> also noticed that OSR has more "analysis" requirements: Induction Variable
> User, Natural Loop Information, Canonicalize natural loops, and Scalar
> Evolution Analysis. Both OSR and LSR require the Dominator Tree Construction
> analysis pass.

My primary concern here is that if this code gets committed without
anyone interested in actually using it, it will just end up orphaned,
so there's no point to contributing it.

This work is part of the PACE project (http://pace.rice.edu/). My team has
funding for another three years and is committed to maintaining this code.

> I did not mention in the original email (and should have) that OSR needs
> -instcombine to be run after it for cleanup. Also -licm, -reassociate, -gvn
> and -sccp can be enabling optimizations for OSR.

Hmm... perhaps that could be partially fixed using the InstSimplify
infrastructure.

OSR can produce induction variables that are not used. -instcombine finds the
unused induction variables and deletes them (which is super cool btw :)). I am
unsure if InstructionSimplify can fix that.

-Eli

thanks,
Brian

Eli Friedman <eli.friedman <at> gmail.com> writes:

> Empirically the OSR optimization is compile-time faster than LSR. I have
> also noticed that OSR has more "analysis" requirements: Induction Variable
> User, Natural Loop Information, Canonicalize natural loops, and Scalar
> Evolution Analysis. Both OSR and LSR require the Dominator Tree Construction
> analysis pass.

My primary concern here is that if this code gets committed without
anyone interested in actually using it, it will just end up orphaned,
so there's no point to contributing it.

This work is part of the PACE project (http://pace.rice.edu/). My team has
funding for another three years and is committed to maintaining this code.

Okay, good to know.

> I did not mention in the original email (and should have) that OSR needs
> -instcombine to be run after it for cleanup. Also -licm, -reassociate, -gvn
> and -sccp can be enabling optimizations for OSR.

Hmm... perhaps that could be partially fixed using the InstSimplify
infrastructure.

OSR can produce induction variables that are not used. -instcombine finds the
unused induction variables and deletes them (which is super cool btw :)). I am
unsure if InstructionSimplify can fix that.

Oh... no, you would need instcombine for that. However, how hard
would it be to just avoid creating unused induction variables?
instcombine doesn't normally run after LSR.

-Eli

Eli is right. We do need to see some benchmark numbers and understand how the pass will fit in the target independent optimizer. While we encourage contribution, we typically don't commit new passes unless it introduce new functionalities that have active clients. It would also help if you provide us with compile time numbers.

Evan