Landing my new development on the trunk ...

Eli Friedman <eli.friedman <at> gmail.com> writes:
> >> > I did not mention in the original email (and should have) that OSR needs
> >> > -instcombine to be run after it for cleanup. Also -licm, -reassociate, -gvn
> >> > and -sccp can be enabling optimizations for OSR.
> >>
> >> Hmm... perhaps that could be partially fixed using the InstSimplify
> >> infrastructure.
> >
> > OSR can produce induction variables that are not used. -instcombine finds the
> > unused induction variables and deletes them (which is super cool btw :)). I am
> > unsure if InstructionSimplify can fix that.
>
> Oh... no, you would need instcombine for that. However, how hard
> would it be to just avoid creating unused induction variables?
> instcombine doesn't normally run after LSR.

Short answer is no.

The OSR adds new induction variables based on existing variables. That way, we
can go from induction variable i to induction variable &a[long(i)] (aka, a +
long(i)). There may be some intermediate induction variables that assist in i
reaching (a + long(i)), like long(i), e.g.

i -> long(i) -> (a + long(i))

after creating (a + long(i)), induction variable long(i) may be unused,
depending on whether long(i) can participate in creating an another induction
variable. Thus, it is not known until the algorithm is done, which of the
created induction variables are unused.

FWIW I noticed that other optimizations (as seen in StandardPasses.h) are
followed by -instcombine for cleanup. I thought requiring or suggesting running
-instcombine after -osr would be SOP.

FWIW2 our intent in creating OSR was not to replace LSR, just provide an
alternative strength reduction optimization.

> -Eli

Brian

Eli Friedman <eli.friedman <at> gmail.com> writes:

>> > I did not mention in the original email (and should have) that OSR
>> > needs
>> > -instcombine to be run after it for cleanup. Also -licm,
>> > -reassociate, -gvn
>> > and -sccp can be enabling optimizations for OSR.
>>
>> Hmm... perhaps that could be partially fixed using the InstSimplify
>> infrastructure.
>
> OSR can produce induction variables that are not used. -instcombine
> finds the
> unused induction variables and deletes them (which is super cool btw
> :)). I am
> unsure if InstructionSimplify can fix that.

Oh... no, you would need instcombine for that. However, how hard
would it be to just avoid creating unused induction variables?
instcombine doesn't normally run after LSR.

Short answer is no.

The OSR adds new induction variables based on existing variables. That way,
we
can go from induction variable i to induction variable &a[long(i)] (aka, a +
long(i)). There may be some intermediate induction variables that assist
in i
reaching (a + long(i)), like long(i), e.g.

i -> long(i) -> (a + long(i))

after creating (a + long(i)), induction variable long(i) may be unused,
depending on whether long(i) can participate in creating an another
induction
variable. Thus, it is not known until the algorithm is done, which of the
created induction variables are unused.

Sure, but you know which induction variables you created; you can just
zap the unused ones at the end of the pass, no?

FWIW I noticed that other optimizations (as seen in StandardPasses.h) are
followed by -instcombine for cleanup. I thought requiring or suggesting
running
-instcombine after -osr would be SOP.

Strength reduction is sort of a special case; it runs extremely late
because it interferes with other optimizations.

FWIW2 our intent in creating OSR was not to replace LSR, just provide an
alternative strength reduction optimization.

Sure, but if it can't be used as an alternative to LSR, how would it be used?

-Eli

Sure, but you know which induction variables you created; you can just
zap the unused ones at the end of the pass, no?

This is feasible. We would have to collect more information during OSR proper pass and add logic to cleanup at the end.

FWIW I noticed that other optimizations (as seen in StandardPasses.h) are
followed by -instcombine for cleanup. I thought requiring or suggesting
running
-instcombine after -osr would be SOP.

Strength reduction is sort of a special case; it runs extremely late
because it interferes with other optimizations.

FWIW2 our intent in creating OSR was not to replace LSR, just provide an
alternative strength reduction optimization.

Sure, but if it can't be used as an alternative to LSR, how would it be used?

The PACE project intends to use LLVM to perform lower level optimizations in its compiler. OSR would be another optimization that it would have in its bag of tricks. Remember that OSR finds reductions that LSR does not.

For code generation, the PACE compiler can use the LLVM C Backend (plus the native compiler) or the LLVM native backend (if available) or just the native compiler. Thus, when the PACE compiler uses LLVM, LSR may not be run.

I suspect that there may be other LLVM C Backend users who could use OSR as well, right?

-Eli

Brian

Sure, but you know which induction variables you created; you can just
zap the unused ones at the end of the pass, no?

This is feasible. We would have to collect more information during OSR
proper pass and add logic to cleanup at the end.

FWIW I noticed that other optimizations (as seen in StandardPasses.h) are
followed by -instcombine for cleanup. I thought requiring or suggesting
running
-instcombine after -osr would be SOP.

Strength reduction is sort of a special case; it runs extremely late
because it interferes with other optimizations.

FWIW2 our intent in creating OSR was not to replace LSR, just provide an
alternative strength reduction optimization.

Sure, but if it can't be used as an alternative to LSR, how would it be used?

The PACE project intends to use LLVM to perform lower level
optimizations in its compiler. OSR would be another optimization that
it would have in its bag of tricks. Remember that OSR finds reductions
that LSR does not.

LSR's primary goal is to reduce integer register pressure, with
strength reduction being just one of its available tools. Also, it aggressively
uses the target addressing modes (which is awkward in LLVM IR, since the
addressing modes aren't explicit, but that's another story). Without
a Target, it currently assumes it can use register+register addressing,
which is probably pretty bad when used with the C backend. We
can probably fix this, if you're interested.

A big downside of the current LSR algorithm is it's slow. I had initially
hoped that some of the heuristics would protect it better, but the problem is
more complex than I had expected. I haven't done any measurements,
but it's likely that OSR is faster, which may interest some people regardless
of how the output compares.

For code generation, the PACE compiler can use the LLVM C Backend (plus
the native compiler) or the LLVM native backend (if available) or just
the native compiler. Thus, when the PACE compiler uses LLVM, LSR may
not be run.

LSR can be run from opt, though as I mentioned above the current
target-independent addressing mode assumptions may need to be tuned.

I suspect that there may be other LLVM C Backend users who could use OSR
as well, right?

Not many people use the C backend.

Dan

Dan Gohman <gohman@apple.com> writes:

I suspect that there may be other LLVM C Backend users who could use OSR
as well, right?

Not many people use the C backend.

Au contraire (just had to jump in). :slight_smile: Bugpoint uses cbe so there are
at least some people like me that use it to debug compiler problems.
I use it when there is no "good" llc to compare to, which unfortunately
happens more often than not here.

                           -Dave

A big downside of the current LSR algorithm is it’s slow. I had initially
hoped that some of the heuristics would protect it better, but the problem is
more complex than I had expected. I haven’t done any measurements,
but it’s likely that OSR is faster, which may interest some people regardless
of how the output compares.

A few years ago, I implemented OSR (with some slight modifications) in GCC, though it was never committed to mainline (it’s on a branch somewhere)
It was significantly faster than ivopts (which does what you guys are using LSR for), and found more cases than ivopts did, I just never integrated the same target dependent stuff so it never made it into the mainline.

Note that as written in the paper, OSR is pretty target independent because of the order of processing. It expects to do it’s processing on each SCC as it completes the SCC, so it doesn’t gather all the possible things it could do before doing them, and then decide what is best.

It is also possible for a “do everything” OSR to completely blow up register pressure if there are a number of conditional iv updates + operations based on them in the loop, since it will have to generate a new variable for each of these cases that will end up live over the entire loop.

So i think you may see good things if you took the OSR code and used it as a basis for LSR.

There is one thing both the original paper, the original MSCP implementation did (too bad the links to this point to ftp.cs.rice.edu, which no longer works, the web files were a great implementation resource) , and my GCC implementation did, which is LFTR (Linear Function Test Replacement). LFTR after OSR can help reduce register pressure since it enables eliminating the IV’s that no longer serve any useful purpose. I don’t see any implementation in this code.

–Dan

Dan,

LFTR (Linear Function Test Replacement) was mentioned in the original paper. I considered including LFTR with OSR, but decided to get OSR to trunk first and then add LFTR (-lftr) as a separate pass later. The LLVM development documentation suggests that new work be committed piecemeal over time.

LLVM does have an optimization pass, -instcombine, which will delete unused induction variables. I recommend that -instcombine be run after OSR. It is my understanding that LFTR would attempt to remove induction variables whose only use is to participate in an end-loop-of-loop test condition.

thanks for your comments,
Brian West

A big downside of the current LSR algorithm is it’s slow. I had initially
hoped that some of the heuristics would protect it better, but the problem is
more complex than I had expected. I haven’t done any measurements,
but it’s likely that OSR is faster, which may interest some people regardless
of how the output compares.

There is one thing both the original paper, the original MSCP implementation did (too bad the links to this point to ftp.cs.rice.edu, which no longer works, the web files were a great implementation resource) , and my GCC implementation did, which is LFTR (Linear Function Test Replacement). LFTR after OSR can help reduce register pressure since it enables eliminating the IV’s that no longer serve any useful purpose. I don’t see any implementation in this code.

–Dan

Dan,

LFTR (Linear Function Test Replacement) was mentioned in the original paper. I considered including LFTR with OSR, but decided to get OSR to trunk first and then add LFTR (-lftr) as a separate pass later.

I’m not sure why you’d add it as a separate pass, it is about 80-150 lines of code, and adding it as as a separate pass requires you to do things like induction variable detection + etc all over again.

See http://gcc.gnu.org/ml/gcc-patches/2007-01/msg01035.html, record_edge, apply_lftr_edge, follow_lftr_edge and perform_lftr

The LLVM development documentation suggests that new work be committed piecemeal over time.

Sure, but that doesn’t mean you should commit something that is going

LLVM does have an optimization pass, -instcombine, which will delete unused induction variables.

LFTR does not delete unused IV’s directly, it does reductions to transform the IV into something else.
instcombine could only do this if it knew the sequence of reductions we applied to strength reduce the IV in the first place, or if it computed equivalence of ivs itself.
Both of these are not cheap operations.

I recommend that -instcombine be run after OSR.

In general, there is a careful balance between leaving the IR in a state that requires expensive cleanup, and doing some of that cleanup yourself where it’s cheap and easy.
If you were to simply fall on the extreme of running cleanup passes after every optimization, your compiler would be much slower.

It is my understanding that LFTR would attempt to remove induction variables whose only use is to participate in an end-loop-of-loop test condition.

Well, no, any IV whose only use is a comparison and a linear function of an existing IV. This is mostly end of loop test conditions, but you’d be surprised where else this pops up.

Logging or progress tracking, for example, where you do things like

if (i % 100 == 0) {
printf(".");
}

etc

Anyway, it looks like the consensus so far is that you need to produce some compile time and benchmark numbers showing OSR is worth it as it’s own pass as opposed to replacing/augmenting the LSR implementation that exists now with it.

–Dan

Sure, but that doesn’t mean you should commit something that is going

Sigh, this got cut at the last second by bad shift selection on my part.

It said
Sure, but that often means committing stuff as multiple patches right after each other more than committing a pass that has no clients/use for months.