Register pressure mechanism in PRE or Smarter rematerialization/split/spiller/coalescing ?

I thought about a little bit more, I think adding Register pressure control in your patch or PRE may be the only choice.

Because at least for this case I am looking at, what your patch did is created more relatively complex long live range, rematerialization is not smart enough to undo your change or at least without a lot of work, coalescing only create even longer live range not shorter, Spiller can't help since it's the Spiller created Spill/Reloads due to high register pressure, Splitting can shorten the live ranges, but I don't think it can handle your case without a lot of work.

Suggestions are welcome.

1. As I mentioned, it simply fixes a bug in implementation of one of
the two PRE's LLVM has. It does not change the PRE algorithm or add
anything to it. The code had a bug. I fixed the bug :P. PRE is
*not even adding more code in this case*. The code is already there.
  All it is doing is inserting a phi node. If you transformed your
code to use memory, and reverted my patch, you'd get the same result,
because Load PRE will do the same thing. It's what PRE does.

2. GCC and other compilers have PRE's literally the same thing my
patch does (you are welcome to verify, i wrote GCC's :P), and
apparently are smart enough to handle this in RA. So i'm going to
suggest that it is, in fact, possible to do so, and i'm going to
further suggest that if we want to match their performance, we need to
be able to do the same. You can't simply "turn down" any optimization
that RA may have to deal with. It usually doesn't work in practice.
This is one of the reasons good RA is so hard.

3. As I also mentioned, register pressure heuristics in PRE simply do
not work. They've been tried. By many. With little to no success.

PRE is too high in the stack of optimizations to estimate register
pressure in any sane fashion. It's pretty much a fools errand. You
can never tune it to do what you want. *Many* have tried.

Your base level problem here is that all modern PRE algorithms (except
for min-cut PRE, as I mentioned), are based on a notion of lifetime
optimality. That is, they extend lifetimes as minimally as possible to
still eliminate a given redundancy. Ours does the same.

However, this is not an entirely useful metric. Optimizing for some
other metric is what something like min-cut PRE lets you do.
But even then, register pressure heuristics are almost certainly not
the answer.

4. This was actually already discussed when the patch was submitted,
and the consensus was "we should just fix RA". Feel free to look at
the discussion 5 months ago.

I would suggest, if you want to fix this, you take the approach that
was discussed then.

I thought about a little bit more, I think adding Register pressure control in your patch or PRE may be the only choice.

Because at least for this case I am looking at, what your patch did is created more relatively complex long live range, rematerialization is not smart enough to undo your change or at least without a lot of work, coalescing only create even longer live range not shorter, Spiller can't help since it's the Spiller created Spill/Reloads due to high register pressure, Splitting can shorten the live ranges, but I don't think it can handle your case without a lot of work.

1. As I mentioned, it simply fixes a bug in implementation of one of
the two PRE's LLVM has. It does not change the PRE algorithm or add
anything to it. The code had a bug. I fixed the bug :P. PRE is
*not even adding more code in this case*. The code is already there.
   All it is doing is inserting a phi node. If you transformed your
code to use memory, and reverted my patch, you'd get the same result,
because Load PRE will do the same thing. It's what PRE does.

2. GCC and other compilers have PRE's literally the same thing my
patch does (you are welcome to verify, i wrote GCC's :P), and
apparently are smart enough to handle this in RA. So i'm going to
suggest that it is, in fact, possible to do so, and i'm going to
further suggest that if we want to match their performance, we need to
be able to do the same. You can't simply "turn down" any optimization
that RA may have to deal with. It usually doesn't work in practice.
This is one of the reasons good RA is so hard.

3. As I also mentioned, register pressure heuristics in PRE simply do
not work. They've been tried. By many. With little to no success.

PRE is too high in the stack of optimizations to estimate register
pressure in any sane fashion. It's pretty much a fools errand. You
can never tune it to do what you want. *Many* have tried.

Your base level problem here is that all modern PRE algorithms (except
for min-cut PRE, as I mentioned), are based on a notion of lifetime
optimality. That is, they extend lifetimes as minimally as possible to
still eliminate a given redundancy. Ours does the same.

However, this is not an entirely useful metric. Optimizing for some
other metric is what something like min-cut PRE lets you do.
But even then, register pressure heuristics are almost certainly not
the answer.

4. This was actually already discussed when the patch was submitted,
and the consensus was "we should just fix RA". Feel free to look at
the discussion 5 months ago.

+1. If you can show a common pattern created by PRE which our regalloc doesn't handle well, please file a bug.

Hi, Daniel:

Thanks a lot for detailed background information, we are willing to provide the right fix, however it will take time, do you mind if you forward me the discussion you had 5 months ago? I may not be able to access it since I only joined ellvmdev list this week.

I did some performance measurement last night, some of our critical benchmark degraded up to 30% with your patch, so we have to turn it off for now at least.

I posted patch to add a debug option (off by default), so we could turn it off with that option, could you please review it and commit it for me if possible? I don't have commit right yet, will ask soon.
http://reviews.llvm.org/D11234

Thanks again.

Lawrence Hu

IMHO, This doesn't make a lot of sense to turn off this part on it's own.
I would just use the enable-pre flag to turn off scalar PRE, as it
will cause the same issue in other cases as well.
Is there some reason you aren't just doing that?
I suspect if this is a performance win, that would be as well.

Also note that you will have the same problem as GVN/EarlyCSE/etc
becomes smarter, as these are full redundancies being eliminated (IE
there is no insertion happening). It just happens that PRE notices
them and GVN doesn't, because GVN is dominator based and PRE is not.
A slightly smarter GVN/EarlyCSE would do the same thing.

Given what you are saying, you are also suggesting we are not
rematerializing addressing computations where it is cheaper to do so.
That seems pretty critical to good RA :stuck_out_tongue:

Ugh, actually, it should be a win with the following change:

diff --git a/lib/Transforms/Scalar/GVN.cpp b/lib/Transforms/Scalar/GVN.cpp
index 2c47a8a..a3387e3 100644
--- a/lib/Transforms/Scalar/GVN.cpp
+++ b/lib/Transforms/Scalar/GVN.cpp
@@ -1767,7 +1767,7 @@ bool GVN::processNonLocalLoad(LoadInst *LI) {
   }

   // Step 4: Eliminate partial redundancy.
- if (!EnablePRE || !EnableLoadPRE)
+ if (!EnableLoadPRE)
     return false;

   return PerformLoadPRE(LI, ValuesPerBlock, UnavailableBlocks);

This will disable Scalar PRE without disabling load PRE.

(note, again, however, that load PRE can create exactly the same GEP
situation you are referring to)

Given what you are saying, you are also suggesting we are not
rematerializing addressing computations where it is cheaper to do so.
That seems pretty critical to good RA :stuck_out_tongue:

Yep, about 5 months ago I had a conversation about this too… it may even be the one you’re referencing. Our remat is really conservative - it only rematerializes values that have zero input operands (move immediate only, essentially).

James

I thought we could remat explicitly invariant loads as well?

Hi, Daniel:

Thanks, I tried that patch you provided, it is better than just disabling your previous patch, it has more improvements than degradations.

Do you want to post that patch or you want me to do that?

Regards

Lawrence Hu

From: "Philip Reames" <listmail@philipreames.com>
To: "James Molloy" <james@jamesmolloy.co.uk>, "Daniel Berlin"
<dberlin@dberlin.org>, "Lawrence" <lawrence@codeaurora.org>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Wednesday, July 15, 2015 4:43:51 PM
Subject: Re: [LLVMdev] Register pressure mechanism in PRE or Smarter
rematerialization/split/spiller/coalescing ?

> > Given what you are saying, you are also suggesting we are not

> rematerializing addressing computations where it is cheaper to do
> so.

> That seems pretty critical to good RA :stuck_out_tongue:

> Yep, about 5 months ago I had a conversation about this too... it
> may
> even be the one you're referencing. Our remat is really
> conservative
> - it only rematerializes values that have zero input operands (move
> immediate only, essentially).

I thought we could remat explicitly invariant loads as well?

We can. See TargetInstrInfo::isReallyTriviallyReMaterializableGeneric in lib/CodeGen/TargetInstrInfo.cpp -- What we can't do, for example, it rematerialize small sequences of instructions, such as those that form constants out of smaller immediates.

-Hal

Looks like everyone think we should improve rematerialization, I will keep an eye on it, possible provide a patch to expand it in the future whenever our release features are out of door.

Regards

Lawrence Hu

Hi, Daniel:

Something interesting, even though your patch performed better for our precheckin perf run, it doesn't help the test I was looking at, for my case, revert your patch give the best results.

I will revert internally for now, and looking for better solution in the future.

Regards

Lawrence Hu

That should be literally impossible, which makes me think something
was tested wrong

The second patch i posted disables scalar pre (assuming you use
-disable-pre) but not load pre.

Since the patch you reverted touched only scalar pre, disabling scalar
pre should *also* do the same thing.

That should be literally impossible, which makes me think something
was tested wrong

The second patch i posted disables scalar pre (assuming you use
-disable-pre) but not load pre.

Since the patch you reverted touched only scalar pre, disabling scalar
pre should *also* do the same thing.

Unless that benchmark is relying on *some* scalar PRE happening, but not "too much". i.e. it could be a really really fragile benchmark which seems to fit the description.

Sorry, yes, let me rephrase: It should not be having performance loss
due to the same root cause.
The failure mode described was GEP live ranges being extended due to
new phis that PRE was inserting.
That should not be happening, because it can't be doing that anymore :slight_smile:

It would be helpful if you can file a bug with a test case.

David

I have confirmed I didn't make any mistake, opened Bug 24171 - [AArch64] 403050abcc091260be2e8f58484e7a39c0782b47 increased Spill/Reload/ by 5 times (edit) for it.

Regards

Lawrence Hu

+1

I have a patch I’m currently working on to make rematerialization better - a testcase for this would be very valuable.

Cheers,

James