Instruction bundles before RA: Rematerialization

Hi,

We have a new BE for a VLIW-like processor and I'm currently working on instruction bundles. Ideally, I'd like to have bundles *before* RA to model certain constraints, e.g. the exposed one by Tzu-Chien a while ago in his thread http://lists.cs.uiuc.edu/pipermail/llvmdev/2005-September/004798.html

In order to build bundles, we have added a new bottom-up MIScheduler, right after reg coalescing, which behaves much like ScheduleDAGVLIW but without hazard recognizing. Due to some tricky instructions, we cannot schedule on the DAG. Bundles are built at exitRegion() in the scheduling process and the live interval information is updated correctly. After this, the RA is aware of bundles, at least from a LiveInterval point of view, and I had some problems regarding the rematerialization.

AFAIK, the RA cannot remat if the target instruction is not the bundle's header.
For this, I rather need a light bundle representation, or no bundle at all, so it can remat the right instruction with one condition: the remated location should preserve bundles. Nevertheless, for many other actions like LI splitting, the current representation works very well.

In general, I want the RA to preserve bundles as well as to be able to model the constraint presented above in the thread.
Should I fix the rematerialization code to look for the right instruction into the current bundle ?
What's the direction you are following about instruction bundling ?

We would like to follow llvm evolutions and preserve the core implementations as much as possible.

Thanks in advance,

Ivan

We have a new BE for a VLIW-like processor and I'm currently working on
instruction bundles. Ideally, I'd like to have bundles *before* RA to
model certain constraints, e.g. the exposed one by Tzu-Chien a while ago
in his thread
http://lists.cs.uiuc.edu/pipermail/llvmdev/2005-September/004798.html

The bundle support in the tree should handle constraints like that. The register allocator basically sees bundles as single instructions when computing interference.

In order to build bundles, we have added a new bottom-up MIScheduler,
right after reg coalescing, which behaves much like ScheduleDAGVLIW but
without hazard recognizing. Due to some tricky instructions, we cannot
schedule on the DAG. Bundles are built at exitRegion() in the scheduling
process and the live interval information is updated correctly. After
this, the RA is aware of bundles, at least from a LiveInterval point of
view, and I had some problems regarding the rematerialization.

AFAIK, the RA cannot remat if the target instruction is not the bundle's
header.
For this, I rather need a light bundle representation, or no bundle at
all, so it can remat the right instruction with one condition: the
remated location should preserve bundles. Nevertheless, for many other
actions like LI splitting, the current representation works very well.

In general, I want the RA to preserve bundles as well as to be able to
model the constraint presented above in the thread.
Should I fix the rematerialization code to look for the right
instruction into the current bundle ?
What's the direction you are following about instruction bundling ?

Remat is a problem because it breaks the abstraction of treating whole bundles as instructions. There are two issues:

- Should we rematerialize the full bundle defining a value, or should we try to dig out the 'real' def inside the bundle?

- After remat, we run dead code elimination in case the original def has no more uses. Again, should DCE erase a single instruction from inside a bundle, or should we erase the full bundle once all its defs are dead?

Ideally, I would like to treat the inside of a bundle like a black box that only the target understands. I am not too happy about erasing instructions inside bundles without some help from the target.

How do you need remat to work?

/jakob

Hi Jakob,

2012/6/6 Jakob Stoklund Olesen <stoklund@2pi.dk>

We have a new BE for a VLIW-like processor and I’m currently working on
instruction bundles. Ideally, I’d like to have bundles before RA to
model certain constraints, e.g. the exposed one by Tzu-Chien a while ago
in his thread
http://lists.cs.uiuc.edu/pipermail/llvmdev/2005-September/004798.html

The bundle support in the tree should handle constraints like that. The register allocator basically sees bundles as single instructions when computing interference.

In order to build bundles, we have added a new bottom-up MIScheduler,
right after reg coalescing, which behaves much like ScheduleDAGVLIW but
without hazard recognizing. Due to some tricky instructions, we cannot
schedule on the DAG. Bundles are built at exitRegion() in the scheduling
process and the live interval information is updated correctly. After
this, the RA is aware of bundles, at least from a LiveInterval point of
view, and I had some problems regarding the rematerialization.

AFAIK, the RA cannot remat if the target instruction is not the bundle’s
header.
For this, I rather need a light bundle representation, or no bundle at
all, so it can remat the right instruction with one condition: the
remated location should preserve bundles. Nevertheless, for many other
actions like LI splitting, the current representation works very well.

In general, I want the RA to preserve bundles as well as to be able to
model the constraint presented above in the thread.
Should I fix the rematerialization code to look for the right
instruction into the current bundle ?
What’s the direction you are following about instruction bundling ?

Remat is a problem because it breaks the abstraction of treating whole bundles as instructions. There are two issues:

  • Should we rematerialize the full bundle defining a value, or should we try to dig out the ‘real’ def inside the bundle?

The second option is the preferred one for us. The remated instruction should create a new bundle which may be merged with other bundles later by custom passes. The same condition applies for copy insertions.

  • After remat, we run dead code elimination in case the original def has no more uses. Again, should DCE erase a single instruction from inside a bundle, or should we erase the full bundle once all its defs are dead?

In our particular case, the RA can freely remove instructions from a bundle without any special constraint. We don’t need to wait till all defs inside a bundle are dead, which in some situations it may never happen.

Ideally, I would like to treat the inside of a bundle like a black box that only the target understands. I am not too happy about erasing instructions inside bundles without some help from the target.

How do you need remat to work?

As described above.

I understand your concern, other architectures may have stronger constraints or use bundles for other purposes (ARM?) which makes the task non-trivial.
We have a very particular VLIW architecture with complex operators but fewer constraints regarding the packet building and mangling. We need remat to work because spill is forbidden for a certain reg class.

Do you plan to add more hooks in TargetLowering to remat?

I should probably voice our point of view as well… Hexagon is another VLIW target with “non standard” demands for bundling.

I think Jacob has summarized current view of bundles as “black box” rather precise, but I should say that our view of bundles is way more fluid and open than that.

To avoid going into lengthy discussion, let me just say – bundling for us is not a single occurrence, but rather a multi step process, with individual instructions moved between bundles (and potentially between blocks) multiple times. Current infrastructure does not support this kind of freedom – seemingly more out of philosophy rather than from engineering point of view.

Taking on Ivan’s question about remat, I would probably prefer affected bundles generically “disassembled” (if possible) with liveness correctly updated, with a following (machine specific pass) being able to re-bundle. This of cause not always possible since some parallel semantics are not being expressible in serial code, but those rare “swap” like cases could be left intact (bundled). As far as I understand remat should never come “inside” such circular dependency without breaking it, so it should work just fine.

Generally as far as I concern, there is no way “generic” (platform independent) code can add instructions to bundles optimally. In the given example it is a job for Post RA scheduler… but before this could happen, a lot of infrastructural changes should be put in place.

Sergei

I agree, there are too many ways of modeling stuff with bundles. That is why I took the philosophical stance of treating bundles as black boxes during RA. I think the target should be involved whenever bundles are formed, and we shouldn’t delete instructions from inside bundles without permission from the target.

I think we need to tweak some of the TargetInstrInfo hooks to make bundle remat possible. I would like your input.

Rematerialization has multiple steps:

  1. Feasibility. RA knows the bundle defining a given SSA value of a virtual register. It calls TII.isTriviallyReMaterializable() to determine if the defining instruction can (and should) be rematerialized. See LiveRangeEdit::anyRematerializable().

  2. Feasibility at desired location. LiveRangeEdit::canRematerializeAt() then checks that the instruction can be rematerialized at the new location. This can fail if the instruction depends on virtual register values that are not available at the new location.

  3. Remat. LiveRangeEdit::rematerializeAt() calls TII.reMaterialize() (sic) to insert a copy of the defining instruction at the new location.

  4. Shrink original live range. The original live range may be smaller after some uses have been rematerialized. This may discover dead defs if there are no remaining uses.

  5. LiveRangeEdit::eliminateDeadDefs() erases the dead defs.

It looks like each of these steps require some help from the target if they are to handle bundles.

/jakob

Jakob,

Please see my comments below. Hope this helps.

Sergei

Jakob,

Please see my comments below. Hope this helps.

Sergei

Hi again!

Hi Sergei, Jakob,

Thanks for your comments !

Jakob,

Please see my comments below. Hope this helps.

Sergei

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

*From:*Jakob Stoklund Olesen [mailto:stoklund@2pi.dk]
*Sent:* Thursday, June 07, 2012 1:02 PM
*To:* Sergei Larin
*Cc:* 'Ivan Llopard'; 'LLVM Developers Mailing List'
*Subject:* Re: [LLVMdev] Instruction bundles before RA: Rematerialization

Generally as far as I concern, there is no way “generic” (platform
independent) code can add instructions to bundles optimally

I agree, there are too many ways of modeling stuff with bundles. That
is why I took the philosophical stance of treating bundles as black
boxes during RA. I think the target should be involved whenever
bundles are formed, and we shouldn't delete instructions from inside
bundles without permission from the target.

I also agree. Adding instructions into a bundle strongly depends on the
target and may be a quite complex task, sometimes too complex to be done
at allocation time if allocation speed is an issue. Removing them should
relax resource constraints in almost every conventional VLIW target and
it's something that the RA can potentially handle in a simple way by
*consulting the target before*.

I think we need to tweak some of the TargetInstrInfo hooks to make
bundle remat possible. I would like your input.

Rematerialization has multiple steps:

1. Feasibility. RA knows the bundle defining a given SSA value of a
virtual register. It calls TII.isTriviallyReMaterializable() to
determine if the defining instruction can (and should) be
rematerialized. See LiveRangeEdit::anyRematerializable().

*/[Larin, Sergei] Obviously if you treat bundle as a black box this
does not make much sense… or isTriviallyReMaterializable() should be
able to parse bundle and produce compound answer. Problem is – you
will not be able to find many opportunities like that. On the other
hand if you detect a _single_ instruction as a remat candidate inside
a bundle, you might chose to dissolve (disassemble) the bundle (if
possible, as I said before) and treat new serial group as you normally
would for remat. There should be a platform dependent pass to rebundle
this new serial sequence again, but even if it is not done, but if the
dissolution was performed properly, this will be a performance, not a
correctness issue. As a side note, you might chose simply remove
desired instruction from the bundle (often it is possible and trivial
to do without affecting semantics) and proceed as described above.
_BUT_ instruction removal does need back end support. Example:/*

*/{ /*

*/r0 = add (r0, r1);/*

*/P0 = cmp (r0.new, r0);/*

*/}/*

*/The r0.new means that the new value of r0 is used (reg file bypass
in the same cycle). You can see all possible implications of this. To
offload this mental logic to the back end, we need a utility of form
CanMoveMIBeforeBundle(MI, BundleHeader)/ CanMoveMIAfterBundle(MI,
BundleHeader)/ MoveMIBeforeBundle(MI, BundleHeader)/
MoveMIAfterBundle(MI, BundleHeader). Calling this repeatedly should
achieve desired effect – remove what could be removed, and live what
needs to remain bundled intact. The move utility can change
instruction properly for each target./*

That's explains the hook isLegalToPruneDependencies() in the packetizer :-).

I don't see why that case cannot be handled by the existent model and
needs that kind of API's. For example, if 'add' needs to be rematted,
it's possible as long as all its operands are available at the remat
position. The same holds for 'cmp'. The original 'add' cannot be removed
because of its internal read and the RA should consult the target before.
Anyway, I think we both agree that instruction removal is a
target-specific task. It also remains simple compared to the insertion
of instructions into a bundle.

I rushed my answer, I apologize. I had virtual registers dancing in my head instead of those you put in the example!

Please, correct me if I'm wrong: as long as the second operand of 'cmp' has not been allocated (and if it's available of course), it can be remated. Otherwise, it's tied to the bundle. Is that correct?

Adding instructions into a bundle strongly depends on the target and may be a quite complex task, sometimes too complex to be done at allocation time if allocation speed is an issue.

Yes, we won't be adding instructions to bundles during RA. RA is already inserting copies and spill code as free-standing instructions between bundles. The target can add a post-RA bundling pass to mop up all that stuff if required.

[Larin, Sergei] At this point you need to update liveness on bundle level, and then update global picture. Updating liveness on bundle level also might need help from the back end. See the above example with .new, and you can easily imagine local defs/kills inside a bundle that should not even be visible outside the black box. As of now I consider this mechanism somewhat broken on trunk (it is overly pessimistic)… but API in this case is rather straightforward.

I thought internals def/use were already modelled, is it right ?

Yes. MO.isInternalRead() models exactly that.

4. Shrink original live range. The original live range may be smaller after some uses have been rematerialized. This may discover dead defs if there are no remaining uses.

IMHO, as long as internal defs/uses are taken into account, I don't see any particular problem.

It is easy to recognize that a virtual register has no uses outside the bundle, but we need help from the target to determine if that means some instructions are dead and can be deleted, and if so, which instructions to delete.

Target-independent code should never be deleting instructions from inside bundles without the target's say so.

It is not enough to check for internal uses. There could be other reasons to keep an instruction in a bundle.

/jakob

Ivan, Jacob,

  Yes, you are right here. In this trivial example the bundle could be
legally dissolved in the following way:

{
  r0 = add (r0, r1);
  p0 = cmp (r0.new, r2);
}

...becomes the following (equivalent) sequence:

  r0 = add (r0, r1);
  p0 = cmp (r0, r2); <<< no .new

But this bundle:

{
  r0 = add (r0, r1);
  p0 = cmp (r0.new, r0);
}

Could only be replaced with something like this:

r3 = r0; <<< temp reg
r0 = add (r0, r1);
p0 = cmp (r0, r3);

which of cause is a bit trickier after RA...

> In our BE, a DFA to model
> resource constraints is not enough and a wrapper has been created
> which adds complexity to the process.

Exact same here.

> I thought internals def/use were already modelled, is it right ?

Yes, as Jakob noted in his answer "MO.isInternalRead() models exactly that."
But in the code base I am on at the moment (at about 3.1 branch time), the
internal dead def is exposed to external live list, which is not wrong, but
conservative. In fairness I just observed the fact, but did not have a
chance to determine the root cause. It well might be that I am still missing
some back end hooks...

> How can be dead defs eliminated by calling CanMoveMIBeforeBundle() ?

Jacob says: " It is easy to recognize that a virtual register has no uses
outside the bundle, but we need help from the target to determine if that
means some instructions are dead and can be deleted, and if so, which
instructions to delete."

  Here is what I was trying to say - I want to split the decision in two
steps - If a dead instruction is detected inside a bundle, try to move that
instruction out of the packet/bundle. This is where target gives its "OK".
If that is legal, then "regular" DCE actions could be taken against that
candidate instruction. This seems to simplify the task considerably and
require least amount of massaging to the current implementation.

  As a summary - we all agree that more BE hooks and utilities are needed
for bundle support. I am also very happy that we are discussing this ahead
of time :slight_smile:

Sergei