[RFC] Register Rematerialization (remat) Extension

Hello Developers,

I am working with my other batchmates to improve register remat in LLVM.
We want to remat live ranges made of multiple instruction.

Just to support our proposal here is a simple example that currently remat does
not cover

$ cat ~/tmp/tl.c
void foo(long);
void bar() {
for (int i = 0; i < 1600; ++i)
foo(3494348345984503943);
}

$ clang -O3 -S -o - ~/tmp/tl.c -target powerpc64

BB#0: # %entry


lis 3, 12414
ori 3, 3, 27470
sldi 3, 3, 32
oris 3, 3, 35809
ori 30, 3, 20615

.LBB0_1: # %for.body
mr 3, 30
bl foo

There is a sequence of instructions used to materialize the constant, the first
one (the lis) is trivially rematerialiable, and the others depend only on that one,
and have no side effects. If we otherwise needed to spill the constant, we might
wish to move the entire set of instructions that compute the value into the loop body.
(Many thanks to Hal Finkel for this example and head start)

We are following very old but effective paper “Rematerialization”
http://dl.acm.org/citation.cfm?id=143143 ------------------------------[1]

This extension will specially improve code quality for RICS backends like
powerpc, MIPS, ARM, AArch64 etc.

Here is a tentative apporach ( after reading the above mentioned paper and current remat code) that I would like to follow.

Please share your views because this may be totally wrong direction. Also I will
be happy if this gets into main line LLVM code but if community don’t want
to make remat heavy than please guide me for my class project perspective.

1 ) As LLVM MI is already in SSA form before reg allocation so for LLVM I think it does not require to build SSA graph and converting it back after optimization completed as mentioned in [1]

2 ) We would like to add a pass similar to SCCP.cpp (Sparse Conditional Constant
Propagation based on Wegman and Zadeck’s work http://dl.acm.org/citation.cfm?id=103136) as desribed in [1]. This pass will be scheduled to run before register allocation.

3 ) Output of the pass added in Step 2 will be a Map of def to instructions pointers (instructions which can be used to remat the given live range). The map will contain live ranges which is due to single instruction and multiple instructions.

4 ) The remat APIs defined in LiveRangeEdit.cpp will use analysis from the Map
when a spill is required for RA.

5 ) The remat transformation APIs like rematerializeAt() will be teached to remat
live ranges with multiple instructions too.

6 ) A cost analysis will be require to decide between remat and spill. This should be based on at least two factors register pressure and spill cost

Few points:

1 ) As LLVM MI is already in SSA form before reg allocation so for LLVM I think it does not require to build SSA graph and converting it back after optimization completed as mentioned in [1]

2 ) We would like to add a pass similar to SCCP.cpp (Sparse Conditional Constant
Propagation based on Wegman and Zadeck’s work http://dl.acm.org/citation.cfm?id=103136) as desribed in [1]. This pass will be scheduled to run before register allocation.

3 ) Output of the pass added in Step 2 will be a Map of def to instructions pointers (instructions which can be used to remat the given live range). The map will contain live ranges which is due to single instruction and multiple instructions.

LiveIntervals maintains a quasi-SSA form via VNInfo. It does not allow efficient def-use queries, but use-def is there, which is all that you should need.

It would be great to have better remat during regalloc, but please try to avoid building additional state that needs to be maintained.

-Andy

OT: Instead of viewing this as a single remat problem, can you view this as a series of remat problems? If each instruction can be moved while only increasing the live range of a single register and shrinking the live range of another, then moving each instruction one by one and iterating gives the same effect. Note that this only (obviously) works for single use defs. Extending it to multiple uses would require a more complicated cost model as you point out. This sounds overly complex. Can you implement this without needing the new side structure? Maintaining extra state and keeping it up to date is expensive. (From a maintenance and code complexity perspective.) This would be unfortunate. Not fatal, just unfortunate.

Hello Developers,

I am working with my other batchmates to improve register remat in LLVM.
We want to remat live ranges made of multiple instruction.

Just to support our proposal here is a simple example that currently remat
does
not cover

$ cat ~/tmp/tl.c
void foo(long);
void bar() {
  for (int i = 0; i < 1600; ++i)
    foo(3494348345984503943);
}

$ clang -O3 -S -o - ~/tmp/tl.c -target powerpc64
...
# BB#0: # %entry
...
        lis 3, 12414
        ori 3, 3, 27470
        sldi 3, 3, 32
        oris 3, 3, 35809
        ori 30, 3, 20615
...
.LBB0_1: # %for.body
        mr 3, 30
        bl foo
...

OT: Instead of viewing this as a *single* remat problem, can you view this
as a series of remat problems? If each instruction can be moved while only
increasing the live range of a single register and shrinking the live range
of another, then moving each instruction one by one and iterating gives the
same effect. Note that this only (obviously) works for single use defs.
Extending it to multiple uses would require a more complicated cost model
as you point out.

Hello Philip,

Your idea seems to be effective and can be tried out with few changes. But
correct me if I got this wrong, by multiple uses do you mean a same remat
sequence being remat again? In that case we can run the analysis again to
decide which all instructions are required to be remat or we may add a
simple caching layer which can hold analysis result for once compilation
unit.
Any thoughts ?

There is a sequence of instructions used to materialize the constant, the
first
one (the lis) is trivially rematerialiable, and the others depend only on
that one,
and have no side effects. If we otherwise needed to spill the constant, we
might
wish to move the entire set of instructions that compute the value into
the loop body.
(Many thanks to Hal Finkel for this example and head start)

We are following very old but effective paper "Rematerialization"
http://dl.acm.org/citation.cfm?id=143143 ------------------------------[1]

This extension will specially improve code quality for RICS backends like
powerpc, MIPS, ARM, AArch64 etc.

Here is a tentative apporach ( after reading the above mentioned paper and
current remat code) that I would like to follow.

Please share your views because this may be totally wrong direction. Also
I will
be happy if this gets into main line LLVM code but if community don't want
to make remat heavy than please guide me for my class project perspective.

1 ) As LLVM MI is already in SSA form before reg allocation so for LLVM I
think it does not require to build SSA graph and converting it back after
optimization completed as mentioned in [1]

2 ) We would like to add a pass similar to SCCP.cpp (Sparse Conditional
Constant
Propagation based on Wegman and Zadeck's work http://dl.acm.org/citation.
cfm?id=103136) as desribed in [1]. This pass will be scheduled to run
before register allocation.

3 ) Output of the pass added in Step 2 will be a Map of def to
instructions pointers (instructions which can be used to remat the given
live range). The map will contain live ranges which is due to single
instruction and multiple instructions.

This sounds overly complex. Can you implement this without needing the
new side structure? Maintaining extra state and keeping it up to date is
expensive. (From a maintenance and code complexity perspective.)

Currently I don't think we have any facility to annotate instructions so

to keep analysis available till register allocation we need an immutable
pass.

1 ) As LLVM MI is already in SSA form before reg allocation so for LLVM I think it does not require to build SSA graph and converting it back after optimization completed as mentioned in [1]

2 ) We would like to add a pass similar to SCCP.cpp (Sparse Conditional Constant
Propagation based on Wegman and Zadeck’s work http://dl.acm.org/citation.cfm?id=103136) as desribed in [1]. This pass will be scheduled to run before register allocation.

3 ) Output of the pass added in Step 2 will be a Map of def to instructions pointers (instructions which can be used to remat the given live range). The map will contain live ranges which is due to single instruction and multiple instructions.

LiveIntervals maintains a quasi-SSA form via VNInfo. It does not allow efficient def-use queries, but use-def is there, which is all that you should need.

I also only see a narrow and specific remat cost problem in the example: is it cheaper is issue a chain of instructions rather than a fill? And for this a use-def is sufficient.

It would be great to have better remat during regalloc, but please try to avoid building additional state that needs to be maintained.

You proposed a fairly complex scheme. The question then is always is it worth it? To answer that question you would need to investigate and break down the current remat problems (spills but should remat, remat but should spill, should remat at a different location, etc) eg. for the llvm test suite, and show that your algorithms solves the most important ones.

Hi,

I’ve been looking at this myself for ARM, and came up with a much simpler solution: lower immediate materializations to a post-RA pseudo and expand the chain of materialization instructions after register allocation / remat. Remat only sees one instruction with no dependencies.

Did you look down this route and discount it?

Cheers,

James

The idea seems sound, but do you really have a CPU in which such a complex rematerialization is better than an L1 cache load from the stack frame?

lis 3, 12414
ori 3, 3, 27470
sldi 3, 3, 32
oris 3, 3, 35809
ori 30, 3, 20615

I’m not familiar with modern PPC64 but seems like a lose on PPC G5 and from the docs I quickly found (2 cycle latency on dependent int ALU ops) Power8 too.

OK, maybe (if I didn’t screw up the rldimi):

lis 3, 12414
lis 30, 35809
ori 3, 3, 27470
ori 30, 30, 20615

rldimi 30, 3, 32, 0

Or is there something that optimizes such sequences building constants?

Hi,

I've been looking at this myself for ARM, and came up with a much simpler
solution: lower immediate materializations to a post-RA pseudo and expand
the chain of materialization instructions after register allocation /
remat. Remat only sees one instruction with no dependencies.

Did you look down this route and discount it?

No actually I am not much familiar with this topic so I mostly reply on
research papers available.
But your idea seems to be simple and good solution but I am not sure if
this can cover every possible cases.

Do you have a patch for this? I can work on this to make it work for other
architectures for which this will be beneficial.

Sincerely,
Vivek

Hi Vivek,

This is the way all targets deal with simple rematerialization involving several instructions in LLVM AFAIK.

Basically, the target defines a pseudo instruction that encodes this sequence of instructions and expands it after register allocation. This is a case by case thing, there is no patch that can be generalized for other target.

For instance, look at the expansion of AArch64::MOVi64imm.

The bottom line is, our rematerialization scheme is currently limited, but I am not sure your proposal get us beyond what we already support.

Cheers,
Q.

From: "Bruce Hoult" <bruce@hoult.org>
To: "vivek pandya" <vivekvpandya@gmail.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Hal Finkel"
<hfinkel@anl.gov>, "Matthias Braun" <matze@braunis.de>
Sent: Monday, September 19, 2016 9:10:17 AM
Subject: Re: [llvm-dev] [RFC] Register Rematerialization (remat)
Extension

The idea seems sound, but do you really have a CPU in which such a
complex rematerialization is better than an L1 cache load from the
stack frame?

lis 3, 12414
ori 3, 3, 27470
sldi 3, 3, 32
oris 3, 3, 35809
ori 30, 3, 20615

I'm not familiar with modern PPC64 but seems like a lose on PPC G5
and from the docs I quickly found (2 cycle latency on dependent int
ALU ops) Power8 too.

OK, maybe (if I didn't screw up the rldimi):

lis 3, 12414
lis 30, 35809
ori 3, 3, 27470
ori 30, 30, 20615

rldimi 30, 3, 32, 0

Or is there something that optimizes such sequences building
constants?

I don't know about the G5, but I did some experiments on the P8 and I was unable to distinguish the performance of a load vs. the materialization sequence. This matches my expectations: The load should have a load-to-use latency of 3 cycles. The materialization-sequence instructions can issue together, each instruction has only a 1-cycle latency with forwarding (and 2 can execute per cycle), and the height of the dependency chain is 3 instructions. All other things being roughly equal, keeping pressure off of the memory subsystem tends to trump other concerns.

-Hal

From: "Quentin Colombet via llvm-dev" <llvm-dev@lists.llvm.org>
To: "vivek pandya" <vivekvpandya@gmail.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Nirav Rana"
<h2015087@pilani.bits-pilani.ac.in>, "Matthias Braun"
<matze@braunis.de>
Sent: Monday, September 19, 2016 1:27:10 PM
Subject: Re: [llvm-dev] [RFC] Register Rematerialization (remat)
Extension

Hi Vivek,

> > Hi,
>

> > I've been looking at this myself for ARM, and came up with a much
> > simpler solution: lower immediate materializations to a post-RA
> > pseudo and expand the chain of materialization instructions after
> > register allocation / remat. Remat only sees one instruction with
> > no
> > dependencies.
>

> > Did you look down this route and discount it?
>

> No actually I am not much familiar with this topic so I mostly
> reply
> on research papers available.

> But your idea seems to be simple and good solution but I am not
> sure
> if this can cover every possible cases.

This is the way all targets deal with simple rematerialization
involving several instructions in LLVM AFAIK.

Basically, the target defines a pseudo instruction that encodes this
sequence of instructions and expands it after register allocation.
This is a case by case thing, there is no patch that can be
generalized for other target.

For instance, look at the expansion of AArch64:: MOVi64imm.

The bottom line is, our rematerialization scheme is currently
limited, but I am not sure your proposal get us beyond what we
already support.

I might have misunderstood the proposal, but why do you say that? The problem is not limited to constants (as perhaps evidenced by Ivan Baev's talk at the 2014 dev meeting). One basic thing we should get, for example, is:

q = ...;
r = ...;
for (...) {
// complicated stuff
foo(q, r, q - r); // should prefer to remat (q-r) here instead of spilling.
}

Also, this is all perhaps related to https://llvm.org/bugs/show_bug.cgi?id=25373#c9

Thanks again,
Hal


From: “Quentin Colombet via llvm-dev” <llvm-dev@lists.llvm.org>
To: “vivek pandya” <vivekvpandya@gmail.com>
Cc: “llvm-dev” <llvm-dev@lists.llvm.org>, “Nirav Rana” <h2015087@pilani.bits-pilani.ac.in>, “Matthias Braun” <matze@braunis.de>
Sent: Monday, September 19, 2016 1:27:10 PM
Subject: Re: [llvm-dev] [RFC] Register Rematerialization (remat) Extension

Hi Vivek,

This is the way all targets deal with simple rematerialization involving several instructions in LLVM AFAIK.

Basically, the target defines a pseudo instruction that encodes this sequence of instructions and expands it after register allocation. This is a case by case thing, there is no patch that can be generalized for other target.

For instance, look at the expansion of AArch64::MOVi64imm.

The bottom line is, our rematerialization scheme is currently limited, but I am not sure your proposal get us beyond what we already support.

I might have misunderstood the proposal, but why do you say that? The problem is not limited to constants (as perhaps evidenced by Ivan Baev’s talk at the 2014 dev meeting). One basic thing we should get, for example, is:

q = …;
r = …;
for (…) {
// complicated stuff
foo(q, r, q - r); // should prefer to remat (q-r) here instead of spilling.

I agree, but this is not what Vivek is aiming at, is he?

Hello Quentin,

I am certainly looking to implement best what can be achieved and that will also require a cost estimation module which can address the mentioned bug. How ever I am still looking to some simpler (or less heavy) approach than what I proposed so that whatever I implement that can be sent to upstream.

Sincerely,
Vivek