fix for loop scale limiting in BFI

I've been trying to get rid of the loop scale limiting problem during
BFI. Initially, this was saturating frequencies to the max side of the
scale, so a double nested loop would get max frequencies in all the
blocks (e.g., llvm/test/CodeGen/X86/lsr-i386.ll). This made the inner
loop no hotter than the outer loop, so block placement would not
bother aligning them.

In convertFloatingToInteger() we are scaling the BFI frequencies so
they fit an integer. The function tries to choose a scaling factor and
warns about being careful so RA doesn't get confused. It chooses a
scaling factor of 1 / Min, which almost always turns up to be 1. This
was causing me grief in the double nested loop case because the inner
loop had a freq of about 6e20 while the outer blocks had a frequency
of 2e19. With a scaling factor of 1, we were saturating everything to
UINT64_MAX.

I changed it so it uses a scaling factor that puts the frequencies in
[1, UINT32_MAX], but only if the Max frequency is outside that range.
This is causing two failures in the testsuite, which seem to be caused
by RA spilling differently. I believe that in CodeGen/X86/lsr-i386.ll
we are hoisting into the wrong loop now, but I'm not sure.

The other failure is in CodeGen/Thumb2/v8_IT_5.ll is different block
placement. Which is a bit odd. The frequencies given by my changes are
certainly different, but the body of the loop is given a
disproportionately larger frequency than the others (much like in the
original case). Though, I think what's going on here is that my
changes are causing the smaller frequencies to be saturated down to 1:

Original:
float-to-int: min = 0.0000004768367035, max = 2047.994141, factor = 16777232.0

Printing analysis 'Block Frequency Analysis' for function 't':
  block-frequency-info: t
   - entry: float = 1.0, int = 16777232
   - if.then: float = 0.0000009536743164, int = 16
   - if.else: float = 0.9999990463, int = 16777216
   - if.then15: float = 0.0000009536734069, int = 16
   - if.else18: float = 0.9999980927, int = 16777200
   - if.then102: float = 0.0000009536734069, int = 16
   - cond.true10.i: float = 0.0000004768367035, int = 8
   - t.exit: float = 0.0000009536734069, int = 16
   - if.then115: float = 0.4999985695, int = 8388592
   - if.else145: float = 0.2499992847, int = 4194296
   - if.else163: float = 0.2499992847, int = 4194296
   - while.body172: float = 2047.994141, int = 34359672832

0001-Remove-4-096-loop-scale-limitation.patch (11 KB)

How about only removing the scaling limit when PGO is on? I don’t see the need for this change without PGO.

David

This is what my patch does, and it's getting into issues. With the
scaling limit gone, the frequencies propagated overflow 64bits, so
they need to be scaled down to a 64bit space.

To be on the safe side, my patch is mapping them down to a 32bit
space, but I am squishing them too much on the lower end. So regions
of the CFG that before had distinct temperatures are now showing up
with frequency == 1.

I need a better smoother for the mapping from the Scale64 floats down
to 64bit (or 32bit) integers.

Diego.

This all makes sense to me.

I think we should do at least one thing and maybe two things:

1) I wonder if there is a reason that we *need* to scale down to a 32-bit
max? Even if we increase the max though, I suspect we'll still see cases
where this doesn't work well. It's interesting that we hit 32-bit limits so
quickly when using synthetic loop estimates.

2) I think we should use a non-linear scale once the range exceeds [1,
MAX]. We were already doing some of this and I think we just need to do
more. I could see either a quadratic or exponential scaling technique
working well... Or one of the interpolation scaling techniques. I think
we'll have to experiment to identify the best one. In particular, I'm
imagining we'll want to play with both quadratic and exponential scales in
both directions: compressing the high frequencies and compressing the low
frequencies. And if neither is good, we may need an interpolation that
compresses both the high and low frequencies.

(Naturally, I'm interested in other ideas for how to handle this as well...)

errr "... and I think we just need to keep doing some or do it
differently." I'm bad at words today.

> How about only removing the scaling limit when PGO is on? I don't see the
> need for this change without PGO.

This is what my patch does, and it's getting into issues. With the
scaling limit gone, the frequencies propagated overflow 64bits, so
they need to be scaled down to a 64bit space.

What I mean is only do this when real profile data is used? I don't see the
patch has that check ?

To be on the safe side, my patch is mapping them down to a 32bit
space, but I am squishing them too much on the lower end. So regions
of the CFG that before had distinct temperatures are now showing up
with frequency == 1.

I need a better smoother for the mapping from the Scale64 floats down
to 64bit (or 32bit) integers.

This seems to show another weakness of the block frequency propagation --
it can not handle infinite loops. We need to think about how to handle it
..

David

I think we should just have a synthetic "max" and not balloon the range due
to it. Then we can pin the max at some fixed multiple of the actual max
when scaling it down.

> How about only removing the scaling limit when PGO is on? I don't see
> the
> need for this change without PGO.

This is what my patch does, and it's getting into issues. With the
scaling limit gone, the frequencies propagated overflow 64bits, so
they need to be scaled down to a 64bit space.

What I mean is only do this when real profile data is used? I don't see the
patch has that check ?

Well, no. It's not really possible at the moment to determine whether
the branch weights you get are coming from real profiles or not.

It has no way of knowing and it would actually produce the same
results. This patch works inside the frequency propagator. We could
have a flag that tells the analysis about it, but there is no way of
doing it automatically.

To be on the safe side, my patch is mapping them down to a 32bit
space, but I am squishing them too much on the lower end. So regions
of the CFG that before had distinct temperatures are now showing up
with frequency == 1.

I need a better smoother for the mapping from the Scale64 floats down
to 64bit (or 32bit) integers.

This seems to show another weakness of the block frequency propagation --
it can not handle infinite loops. We need to think about how to handle it
..

It can, actually. On an infinite loop, the mass out of the loop is 0,
the mass in the back edge is infinite.

Diego.

I've been trying to get rid of the loop scale limiting problem during
BFI. Initially, this was saturating frequencies to the max side of the
scale, so a double nested loop would get max frequencies in all the
blocks (e.g., llvm/test/CodeGen/X86/lsr-i386.ll). This made the inner
loop no hotter than the outer loop, so block placement would not
bother aligning them.

In convertFloatingToInteger() we are scaling the BFI frequencies so
they fit an integer. The function tries to choose a scaling factor and
warns about being careful so RA doesn't get confused. It chooses a
scaling factor of 1 / Min, which almost always turns up to be 1.

This is not necessarily true. If there are any conditional code in a
acyclic region, the scale factor will be greater than 1.

This
was causing me grief in the double nested loop case because the inner
loop had a freq of about 6e20 while the outer blocks had a frequency
of 2e19. With a scaling factor of 1, we were saturating everything to
UINT64_MAX.

I changed it so it uses a scaling factor that puts the frequencies in
[1, UINT32_MAX], but only if the Max frequency is outside that range.
This is causing two failures in the testsuite, which seem to be caused
by RA spilling differently. I believe that in CodeGen/X86/lsr-i386.ll
we are hoisting into the wrong loop now, but I'm not sure.

The other failure is in CodeGen/Thumb2/v8_IT_5.ll is different block
placement. Which is a bit odd. The frequencies given by my changes are
certainly different, but the body of the loop is given a
disproportionately larger frequency than the others (much like in the
original case). Though, I think what's going on here is that my
changes are causing the smaller frequencies to be saturated down to 1:

All these problems are related to infinite loops.

David

>
>
>>
>> > How about only removing the scaling limit when PGO is on? I don't see
>> > the
>> > need for this change without PGO.
>>
>> This is what my patch does, and it's getting into issues. With the
>> scaling limit gone, the frequencies propagated overflow 64bits, so
>> they need to be scaled down to a 64bit space.
>
>
> What I mean is only do this when real profile data is used? I don't see
the
> patch has that check ?

Well, no. It's not really possible at the moment to determine whether
the branch weights you get are coming from real profiles or not.

It has no way of knowing and it would actually produce the same
results. This patch works inside the frequency propagator. We could
have a flag that tells the analysis about it, but there is no way of
doing it automatically.

We will be able to tell very soon once the MD_count is introduced to
represent entry count.

>
>>
>>
>> To be on the safe side, my patch is mapping them down to a 32bit
>> space, but I am squishing them too much on the lower end. So regions
>> of the CFG that before had distinct temperatures are now showing up
>> with frequency == 1.
>>
>> I need a better smoother for the mapping from the Scale64 floats down
>> to 64bit (or 32bit) integers.
>
>
>
> This seems to show another weakness of the block frequency propagation --
> it can not handle infinite loops. We need to think about how to handle
it
> ..

It can, actually. On an infinite loop, the mass out of the loop is 0,
the mass in the back edge is infinite.

It can, but it will be utterly wrong with PGO. No infinite loops will
actually be infinite. It will end somewhere, somehow. The real profile data
will capture the real count, but the propagated value will be so big (and
so wrong) that it will make the 'profile' look really skewed.

David

For PGO, there is a solution to it.

See small example:

#include <stdlib.h>
int g;
attribute((noinline)) void foo()
{
g++;
if (g==1000)
exit(0);
}

int main()
{
while (1) {
foo();
}

}

The profile count for the loop is generated, but Clang simply can not find a conditional branch instruction to attach the MD_prof data. Very simply speaking, this can be fixed by allowing MD_prof to be attached to unconditional branch in the infinite loop (such that taken_weight/exit_weight represents trip count). In block frequency propagation, the trip count for infinite loop can be read from the meta data.

We probably need a different bug tracking the issue.

David

Your change to `convertFloatingToInteger()` looks basically correct to
me. If we need to saturate at one end or the other, it makes more sense
to be saturating at the bottom.

I think you can pick a higher number than UINT32_MAX. I'd start with
something like 2^60 and lower it (or fix backend optimizations to
handle bigger numbers) from there if you hit problems.

I'm also not sure the logic leftover for shifting the scale up makes
sense after your change. Maybe something like this:

    constexpr unsigned MaxBits = 60;
    unsigned SpreadBits = (Max / Min).lg();
    Scaled64 ScalingFactor;
    if (SpreadBits <= MaxBits - 3)
      ScalingFactor = Min.inverse() << 3; // Make the minimum 8.
    else if (SpreadBits <= MaxBits)
      ScalingFactor = Min.inverse(); // Make the minimum 1.
    else
      ScalingFactor = Scaled64(1, MaxBits) / Max; // Saturate at 1.

Infinite loops are a corner case, so I think we should just handle them
specially. One way to do this is to change `computeLoopScale()` to
give an arbitrary finite scale to infinite loops:

    // Choose an arbitrary, finite loop scale for infinite loops.
    constexpr LoopScale Infinite(64, 0);

    // LoopScale == 1 / ExitMass
    // ExitMass == HeadMass - BackedgeMass
    BlockMass ExitMass = BlockMass::getFull() - Loop.BackedgeMass;

    // Block scale stores the inverse of the scale.
    Loop.Scale = ExitMass.isEmpty() ? Infinite
                                    : ExitMass.toScaled().inverse();

How much does the exact loop scale matter anyway for infinite loops?

Special casing infinite loops sounds reasonable for now. Maybe just stick to the current 4096 scale limit for them. Infinite loops do exist (e.g. in server code), but I don’t yet have data to show how important it is to handle them more precisely.

thanks,

David

My preference remains to handle infinite loops as a symbolic entity, and
then pick a value for it only when mapping to the fixed range, and then we
should pick an arbitrary scale higher than the next-highest non-infinite
frequency. I agree that 4096 seems as good of a choice as any other.

Thanks. Special casing infinite loops was the problem here. Saturating
everything else down to 1 in normal circumstances is not a big deal in
the cases I tried.

I've sent http://reviews.llvm.org/D8718 for review. The change should
be performance neutral, but I'm running a batch of SPEC and internal
tests to make sure.

Thanks. Diego.

Thanks. Special casing infinite loops was the problem here. Saturating
everything else down to 1 in normal circumstances is not a big deal in
the cases I tried.

I've sent http://reviews.llvm.org/D8718 for review. The change should
be performance neutral, but I'm running a batch of SPEC and internal
tests to make sure.

Thanks. Diego.