Disable loop unroll pass

Hi,

We've a target which has hardware support for zero-overhead loops. Currently, we cannot detect them because the loop unroller is unrolling them before entering into the codegen. Looking at its implementation, it seems that it checks if it is profitable to unroll it or not based on certain parameters.

Given that zero cost loops building is based more or less on the same constraints that loop unroll pass, I wonder if it is reasonable to add yet another target hook to prevent loop unrolling (something like hasZeroOverheadLooping or hasZeroCostLooping) for targets that support zero-cost looping.

Does Hexagon provides the same loop support? How have you addressed this?

Ivan

Yes, Hexagon has hardware support for loops: you set up the loop start address, number of iterations and indicate where the loop ends, and the hardware would execute the code repeatedly until the count goes down to zero.

I'm not aware of any specific changes that had been made to the bitcode-level loop unroller in connection with the hardware loop support. I just glanced over the unrolling code and I didn't see anything that would be aimed at helping the hardware-based loop generation.

We have a pass in our backend that converts "normal" loops into hardware-based loops, and it can handle loops that have been unrolled (but not completely). If a loop has been completely unrolled, it is a straight-line code and it will not be "re-rolled".

-Krzysztof

I just wanted to add to Krzysztof's response. I'm not sure if you're
referring to the case when a compile-time trip count loop is completely
unrolled or for a loop with a run-time trip count, which would be partially
unrolled. For Hexagon, if we partially unroll a loop, we'd also like to use
our hardware loop instructions. That is, unrolling and hardware loops are
complementary.

Thanks,
-- Brendon

From: "Ivan Llopard" <ivanllopard@gmail.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, November 21, 2012 10:31:07 AM
Subject: [LLVMdev] Disable loop unroll pass

Hi,

We've a target which has hardware support for zero-overhead loops.
Currently, we cannot detect them because the loop unroller is
unrolling
them before entering into the codegen. Looking at its implementation,
it
seems that it checks if it is profitable to unroll it or not based on
certain parameters.

Given that zero cost loops building is based more or less on the same
constraints that loop unroll pass, I wonder if it is reasonable to
add
yet another target hook to prevent loop unrolling (something like
hasZeroOverheadLooping or hasZeroCostLooping) for targets that
support
zero-cost looping.

Ivan,

Please feel free to extend the ScalarTargetTransformInfo interface (in include/llvm/TargetTransformInfo.h) to provide target-customizable parameters to the loop unroller. This is on my TODO list, but if you'd like to work on this, that would be great.

Are there any cases in which loop unrolling is beneficial on your target?

-Hal

Hi Brendon, Krzysztof,

Thanks for your responses.

I just wanted to add to Krzysztof's response. I'm not sure if you're
referring to the case when a compile-time trip count loop is completely
unrolled or for a loop with a run-time trip count, which would be partially
unrolled. For Hexagon, if we partially unroll a loop, we'd also like to use
our hardware loop instructions. That is, unrolling and hardware loops are
complementary.

Not sure if they are completely complementaries. What about static trip counts? It is always profitable to use hardware loops for them. My question is: are the estimations to trigger loop unrolling realistic in presence of zero-cost loops?

Given that loops get unrolled if the user is not looking for optimizing the code size but wants faster code instead, hardware loops meet both conditions, size and speed. In addition, they are always discovered for static trip counts.

Thanks,
-- Brendon

--
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum, hosted by
The Linux Foundation

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu] On
Behalf Of Krzysztof Parzyszek
Sent: Wednesday, November 21, 2012 10:29 AM
To: llvmdev@cs.uiuc.edu
Subject: Re: [LLVMdev] Disable loop unroll pass

Does Hexagon provides the same loop support? How have you addressed this?

Yes, Hexagon has hardware support for loops: you set up the loop start
address, number of iterations and indicate where the loop ends, and the
hardware would execute the code repeatedly until the count goes down to
zero.

Same here!

I'm not aware of any specific changes that had been made to the
bitcode-level loop unroller in connection with the hardware loop support. I
just glanced over the unrolling code and I didn't see anything that would be
aimed at helping the hardware-based loop generation.

We have a pass in our backend that converts "normal" loops into
hardware-based loops, and it can handle loops that have been unrolled (but
not completely). If a loop has been completely unrolled, it is a
straight-line code and it will not be "re-rolled".

Thanks for this detailed information.

Ivan

Hi Hal,

From: "Ivan Llopard" <ivanllopard@gmail.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, November 21, 2012 10:31:07 AM
Subject: [LLVMdev] Disable loop unroll pass

Hi,

We've a target which has hardware support for zero-overhead loops.
Currently, we cannot detect them because the loop unroller is
unrolling
them before entering into the codegen. Looking at its implementation,
it
seems that it checks if it is profitable to unroll it or not based on
certain parameters.

Given that zero cost loops building is based more or less on the same
constraints that loop unroll pass, I wonder if it is reasonable to
add
yet another target hook to prevent loop unrolling (something like
hasZeroOverheadLooping or hasZeroCostLooping) for targets that
support
zero-cost looping.

Ivan,

Please feel free to extend the ScalarTargetTransformInfo interface (in include/llvm/TargetTransformInfo.h) to provide target-customizable parameters to the loop unroller. This is on my TODO list, but if you'd like to work on this, that would be great.

Sure! I'll propose a patch ASAP.

Are there any cases in which loop unrolling is beneficial on your target?

I'd say that it's always beneficial to emit hardware loops whenever possible, either for static or dynamic trip counts, whether we look for smaller or faster code.

Ivan

Hi, Ivan:

     My $0.02. hasZeroCostLooping() disabling unrolling dose not seem to be
appropriate for other architectures, at least the one I worked before.

    You mentioned:
>Currently, we cannot detect them because the loop unroller is
>unrolling them before entering into the codegen. Looking at its implementation,
>it.

   Could you please articulate why CG fail to recognize it?
  I remember in gcc, recognizing hw loop is in a RTL pass, and in Open64, one
student(?) added some stuff in Scalar Opt, instead of CodeGen, just for HW loop.
I recalled there is only one reason sounds valid -- prevent the loop become
too big to fit in HW constraint.

    The cost implied by hasZeroCostLoop() highly depends on the underlying architecture;
therefore the higher level opts don't know how to utilize this interface for cost modeling.
Maybe we can add a pretty vague interface, say
    hw-please-advice-unrolling-factor(the loop, current-unrolling-factor),
to encapsulate whatever reasons the arch might have to curtail aggressive unrolling?

    I'm LLVM newbie, so don't take my words seriously.

Have a happy holiday!

Shuxin

Even if hardware loops are zero-cost, it doesn't mean we should never
unroll loops. It just means loop unrolling should be less aggressive,
because avoiding loop iterations isn't itself a performance benefit.
Our current loop unrolling pass doesn't have appropriate heuristics to
detect that, but it's worth keeping in mind while planning the
appropriate interface.

-Eli

Hi Shuxin, Eli,

Hi, Ivan:

    My $0.02. hasZeroCostLooping() disabling unrolling dose not seem to be
appropriate for other architectures, at least the one I worked before.

I appreciate your feed-back. Could you give an example where building a hw loop is not appropriate for your target?

   You mentioned:
>Currently, we cannot detect them because the loop unroller is
>unrolling them before entering into the codegen. Looking at its implementation,
>it.

  Could you please articulate why CG fail to recognize it?

Well, just because the loop unrolling pass runs before the CG is called.

I remember in gcc, recognizing hw loop is in a RTL pass, and in Open64, one
student(?) added some stuff in Scalar Opt, instead of CodeGen, just for HW loop.
I recalled there is only one reason sounds valid -- prevent the loop become
too big to fit in HW constraint.

It sounds very similar to our implementation. We've implemented the hw loop builder at IR level, just before isel, with new intrinsics that provide hw loops semantics. While intrinsics may look a bit tricky and additional isel code is needed to recognize them, it benefits from the current scalar evolution functionalities to detect trip counts. Therefore, it's based on the same interface as loop unroller but, for architectural issues, we have stronger constraints: e.g. we cannot build hw loops on loops with multiple exits.

The loop topology is important and our hw loop builder depends on it. I agree that hasZeroCostLoop may seem too restrictive.
What about something like hasZeroCostLoopTopology(Loop *L, unsigned TripCount) to complement the first one ?

   The cost implied by hasZeroCostLoop() highly depends on the underlying architecture;
therefore the higher level opts don't know how to utilize this interface for cost modeling.
Maybe we can add a pretty vague interface, say
   hw-please-advice-unrolling-factor(the loop, current-unrolling-factor),
to encapsulate whatever reasons the arch might have to curtail aggressive unrolling?

There are already some internals parameters in loop unroller to drive the heuristics. We use -unroll-count to skip unrolling.
But someone may want to enable unrolling even if the target says otherwise. IMHO, each target could provide internal flags to disable hw loop building and let the unroller works "normally".

Ivan

I am the designer for open64 hwloop structure, but I am not a student.

Hope the following helps:

To transform a loop into hwloop, we need the help from optimizer. For example,
`

   while(k3>=10){
     sum+=k1;
     k3 --;
   }

`

into the form:``

`

   zdl_loop(k3-9) {
      sum+=k1;
   }

`

So, we introduce a new ZDLBR whirl(open64 optimizer intermediate) operator, which represents the loop in whirl as:``

`

LABEL L2050 0 {line: 0}
LOOP_INFO 0 1 1
   I4I4LDID 73 <1,2,.preg_I4> T<4,.predef_I4,4> # k3
   I4I4LDID 77 <1,2,.preg_I4> T<4,.predef_I4,4> # <preg> 
 END_LOOP_INFO
   I4I4LDID 74 <1,2,.preg_I4> T<4,.predef_I4,4> # k1
   I4I4LDID 75 <1,2,.preg_I4> T<4,.predef_I4,4> # sum
  I4ADD
 I4STID 75 <1,2,.preg_I4> T<4,.predef_I4,4> # sum {line: 5}
 ZDLBR L2050 {line: 0}

Then, we let cg do things. Such a design abstract the general operations in optimizer, while target specific part in cg, still a simulated op, until cg loop optimization finished. We implement a multi nested level hwloop by this approach. Gcc’s 3 doloop expand names do the same, we believe.
`

More details, please take a look at

http://wiki.open64.net/index.php/Zero_Delay_Loop

Thanks
Gang

在 2012-11-22,19:00,Ivan Llopard <ivanllopard@gmail.com> 写道:

I appreciate your feed-back. Could you give an example where building a hw loop is not appropriate for your target?

In my case, unrolling and hw loop is orthogonal. So long as a loop is countable & size dosen't exceeds
some threshold, it can be converted into a hw loop. So loop is desirable to be unrolled.

One benefit of unrolling is to exposed inter-iteration redundancies, and the downstream redundancy
elimination can *easily* take care them. We would otherwise have to resort to iter-iteration
redundancy eliminators to remove such redundancies in a *HARD* way, sometimes impossible.

As far as I know, LLVM doesn't have iter-iteration redundancy elimination (like predictive commoning),
and it dosen't have scalar replacement to promote subscript variable to register (i.e the 2nd
load in a[i]... a[i-n] is load from register; GVN can promote a[i-n] to register only if n==1).

Thanks
Shuxin

Hi, Gang:

    I remember there were different voices when you check-in the code.
I agree with them although I didn't reply your mail in open64's mailing list.

   In the transformation you illustrate, it involves two operations:
   1) promote WHILE-loop into DO-loop (i.e noncountable loop to countable loop)
   2) get rid of trip-count dec/inc and compare.

  1) is irrelevant to HW loop. Any scalar optimizer should handle 1).
It is not difficult at all to handle 2) in CodeGen and it is unnecessary to
to introduce a Operator just for that purpose.

Shuxin

Hi shuxin,

Promote while-loop to do-loop is the job of loop induction recognized, not this transformation. The scalar transform for hwloop in optimizer is for that it is a trouble to discriminate trip counting code with the real production code stuff and do the elimination in cg, we have to write customized code to handle this general stuff in ervey targets. So, we take the help from optimizer DCE, make the trip count code hidden in emitted whirl, that greatly simply the design, especially interact with cg unroll, you can see the code, we add validity check functionality , but the code reduced, more stable.

Gang

在 2012-11-23,3:17,Shuxin Yang <shuxin.llvm@gmail.com> 写道:

Hi, Gang:

   I don't want to discuss Open64 internal in LLVM mailing list. Let us only focus on the design per se.
As your this mail and your previous mail combined give me a impression that :

    The only reason you introduce the specific operator for HW loop in Scalar Opt simply because
you have hard time in figure out the trip count in CodeGen.

    This might be true for Open64's CodeGen (I don't want to discuss this issue on this mailling list), but
in general it is not true for other compilers.

    I'm dubious about "It greatly simplify the design". The downstream passes need to be fully aware
of this new operator, which doesn't make things any simpler.

Thanks
Shuxin

Hi Shuxin,

Hi, Gang:

  I don't want to discuss Open64 internal in LLVM mailing list. Let us only focus on the design per se.
As your this mail and your previous mail combined give me a impression that :

   The only reason you introduce the specific operator for HW loop in Scalar Opt simply because
you have hard time in figure out the trip count in CodeGen.

   This might be true for Open64's CodeGen (I don't want to discuss this issue on this mailling list), but
in general it is not true for other compilers.

In LLVM, IR and machine code (till reg alloc pipeline) are in SSA, then trip count detection is almost the same on both sides.

   I'm dubious about "It greatly simplify the design". The downstream passes need to be fully aware
of this new operator, which doesn't make things any simpler.

I think we are going off-topic. I'm not proposing new llvm instructions to introduce hw loop semantics. What I'd like to do is to find a good interface between the loop unroller and targets to prevent loop unrolling.
hasZeroCostLoop() and hasLoopZeroCostTopology(Loop *L, unsigend TripCount) has been proposed so far.

Ivan

Hi, Ivan:

   Sorry for deviating the topic a bit. As I told you before I'm a LLVM newbie, I cannot
give you conclusive answer if the proposed interface is ok or not.

   My personal opinion on these two interface is summarized bellow:

- hasZeroCostLoop()
    pro: it is clearly state the HW support.
    con: Having zero cost loop doesn't imply the benefit HW loop could achieve.
            It is not clear as to if HW-loop conflict with unrolling etc. So optimizer
            has no idea how to use this interface. If you just call this interface to disable
            unrolling, that would be overkill on some arch which has HW support, as
            on such arch HW-loop and unrolling are orthogonal.

  - hasLoopZeroCostTopology(Loop *L, unsigend TripCount)
       Why trip-count? It can be derived from the loop itself.
       Which optimizer will call this interface?

  I would suggest following interface:

    /// get unrolling factor of given *INNERMOST* loop from HW's perspective.
    /// Note: this interface is for innermost loop only. Getting the factor
    /// for unroll-and-jam should call other interface.
    virtual int getHwUnrollFactor (Loop*, unsigned CurUnrollingFactor) {
        // by default, no object to the proposed unrolling factor.
        return CurUnrollingFactor;
    }

   I think this interface would completely shield "zero-cost-loop" from higher level
optimizer, and you certainly can achieve whatever you want in your virtual function
implementation.

    How does this sound to you? Eli?

Thanks
Shuxin

I omit your following comment in your previous mail. Please ignore my previous mail.

>There are already some internals parameters in loop unroller to drive the heuristics. We use -unroll-count to skip unrolling.
>But someone may want to enable unrolling even if the target says otherwise. IMHO, each target could provide internal flags to disable hw