RFC: Machine Instruction Bundle

Machine Instruction Bundle in LLVM

Hi all,

There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community.

Design Criteria

  1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead.
  2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late.
  3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don’t want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE).
  4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle.

Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don’t want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices?

Bundle Representation

  1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs.
  2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle.

The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this:

Evan,

I will need to comprehend it better, but one small comment right away…

Did we not discuss one more option for bundle implementation – global cycle ID. We would add an unsigned int field to MI definition representing “global scheduling cycle”. All MIs with the same global cycle value belong to one group/packet. Zero means unscheduled MI.

That is light weight, position independent (means if instructions are added/reordered by a pass that does not care for packets/groups, original grouping could be easily restored. With a single bit marking, a newly inserted instruction breaks the coding or special ABI is needed to be used for that). Only real downside I see, iterating over a single packet members becomes a search, but if the scope is BB it is not a big deal.

This way we could also effectively represent NOOP bundles (no bundle for a certain cycle value) – VLIW cycles with no instructions issued or estimate scheduling depth easily etc.

I am not voting here in any way, I just wanted to bring it up for discussion.

Sergei Larin

… and yes, one more thing. On some architectures it might be desirable to know the order of instructions in the packet. That is a bit trickier….


Bundle * | (A MI with special opcode “Bundle”)



> MI *

> MI *

MI | (no bit, this is the end of the bundle)



Bundle * | (a new bundle)



> MI *

> MI

I think you should turn the glue bit upside down, so it means “This instruction is glued to its predecessor” instead of "This instruction is glued to its successor. That way, the inner bundled instructions have the glue bit, while the outer bundle headers and unbundled instructions don’t.

That should make your iterator a lot simpler when you have mixed bundles and unbundled MIs. It simply skips MIs with the glue bit set:


→ | Bundle | (A MI with special opcode “Bundle”)


> MI *

MI * | (last MI in bundle)



→ | MI | (no bit, this is a top-level unbundled MI)


→ | MI | (unbundled MI)


→ | Bundle | (a new bundle)


> MI *

> MI *

What happens when you assign a bundled MI* to a MachineBasicBlock::iterator? Does it rewind to the bundle header? There is a lot of code that treats the two as almost equivalent.

/jakob

Evan Cheng <evan.cheng@apple.com> writes:

2. It must be flexible enough to represent more than VLIW bundles. It
should be useful to represent arbitrary sequence of instructions that
must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare +
branch macro-fusion, or random instruction sequences that are
currently modeled as pseudo instructions that are expanded late.

This is really important. This kind of support could also aid
scheduling to make sure some later pass doesn't screw up a carefully
crafted schedule.

So the overall flow should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Pre-scheduling packetization (optional)
   3d. Pre-RA scheduling (or integrated packetization)
   3e. Post-scheduling packetization (optional)
   3f. Actual register allocation
   3g. Packet finalization
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

This all looks fairly reasonable to me. I just have one question. How
much machine-level information do we have at the point where bundling
occurs? For example, can we figure out instruction sizes exactly? This
kind of thing is important for certain kinds of scheduling.

If we don't have that information at this point, it might be
advantageous to delay the final packetization until later, once we're at
a point where we have all of the useful information we might want.

                         -Dave


Bundle * | (A MI with special opcode “Bundle”)



> MI *

> MI *

MI | (no bit, this is the end of the bundle)



Bundle * | (a new bundle)



> MI *

> MI

I think you should turn the glue bit upside down, so it means “This instruction is glued to its predecessor” instead of "This instruction is glued to its successor. That way, the inner bundled instructions have the glue bit, while the outer bundle headers and unbundled instructions don’t.

That should make your iterator a lot simpler when you have mixed bundles and unbundled MIs. It simply skips MIs with the glue bit set:


→ | Bundle | (A MI with special opcode “Bundle”)


> MI *

MI * | (last MI in bundle)



→ | MI | (no bit, this is a top-level unbundled MI)


→ | MI | (unbundled MI)


→ | Bundle | (a new bundle)


> MI *

> MI *

Ok. Makes sense.

What happens when you assign a bundled MI* to a MachineBasicBlock::iterator? Does it rewind to the bundle header? There is a lot of code that treats the two as almost equivalent.

That’s invalid. If you use the “top level MIs only” iterator, then it’s not legal to assign it to a bundled instruction.

Evan

Evan,

  I will need to comprehend it better, but one small comment right away…

Did we not discuss one more option for bundle implementation – global cycle ID. We would add an unsigned int field to MI

No, I don't remember discussing this.

definition representing “global scheduling cycle”. All MIs with the same global cycle value belong to one group/packet. Zero means unscheduled MI.

  That is light weight, position independent (means if instructions are added/reordered by a pass that does not care for packets/groups, original grouping could be easily restored. With a single bit marking, a newly inserted instruction breaks the coding or special ABI is needed to be used for that). Only real downside I see, iterating over a single packet members becomes a search, but if the scope is BB it is not a big deal.

Sorry, I'm not seeing the advantages of this. This actually requires one additional field to keep the id. And all the passes which can potentially move code around, breaking MBBs etc. will have to maintain it.

Evan

… and yes, one more thing. On some architectures it might be desirable to know the _order_ of instructions in the packet. That is a bit trickier….

Isn't that just the order of the instructions in the list? I don't see anything that prevents getting the order of instructions. It might require iterator over MIs in the packet. But for targets like VLIW, the # of instructions should be small. Also note, passes that require this kind of information should aggregate and maintain the data itself. Information like schedule cycle belongs in "schedule unit", not in MI.

Evan

Evan Cheng <evan.cheng@apple.com> writes:

2. It must be flexible enough to represent more than VLIW bundles. It
should be useful to represent arbitrary sequence of instructions that
must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare +
branch macro-fusion, or random instruction sequences that are
currently modeled as pseudo instructions that are expanded late.

This is really important. This kind of support could also aid
scheduling to make sure some later pass doesn't screw up a carefully
crafted schedule.

So the overall flow should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
  3a. De-ssa (2-address, phi slim)
  3b. Coalescing
  3c. Pre-scheduling packetization (optional)
  3d. Pre-RA scheduling (or integrated packetization)
  3e. Post-scheduling packetization (optional)
  3f. Actual register allocation
  3g. Packet finalization
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

This all looks fairly reasonable to me. I just have one question. How
much machine-level information do we have at the point where bundling
occurs? For example, can we figure out instruction sizes exactly? This
kind of thing is important for certain kinds of scheduling.

That depends on the target.

If we don't have that information at this point, it might be
advantageous to delay the final packetization until later, once we're at
a point where we have all of the useful information we might want.

Targets are allowed to delay packetization to later. The only restriction is packets cannot be finalized under register allocation is done because we do not want to add virtual register defs and uses to bundle instructions.

Evan

Hi,

I'm glad to see some action with regard to static instruction
scheduling and VLIW support in LLVM. I have some questions and
remarks which might not be relevant as I'm not totally familiar
with the current code generation framework of LLVM nor your plan.

2. It must be flexible enough to represent more than VLIW bundles. It should be
useful to represent arbitrary sequence of instructions that must be scheduled as
a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random
instruction sequences that are currently modeled as pseudo instructions that are
expanded late.

The concept of a "VLIW bundle" is to mark a set of instructions that
should/could be executed in *parallel*. A static parallel instruction
schedule for a single instruction cycle, that is.

In other words, with a VLIW target a bundle might not be just "an atomic,
possibly sequentially executed chunk of instructions" or "a set of
instructions that can be executed in parallel but also sequentially".
In some architectures, the sequential execution might break the schedule
due to visible function unit pipeline latencies and no hardware interlocking.

Is it wise to mix the two concepts of "parallel instructions" and the looser
"instructions that should be executed together"? The "parallel semantics"
implies changes to how the scheduling is done (the earliest/latest cycle where
an instruction can be scheduled) and also, e.g., the register allocation's live
ranges (if allocating regs on a "packetized" = parallel code)?

Moreover, the definition of VLIW parallel bundle implies that there cannot be
no "intra bundle dependencies", otherwise those instructions could not be executed in parallel in the reality.

For example, looking at your example of a bundle with "intra-bundle
dependencies":

It's important to understand this is a proposal for code generator IR change, not specific to specific passes for register allocator or scheduler. Passes that want to understand more about the internals of instruction bundles are free to design their own data structure. It is imperative that we do not add multiple constructs for various types of bundles. That would add significant overhead for the rest of the code generator.

I have discussed the proposal for current code owners of register allocator and instruction scheduler. This is a proposal that will work with existing (or module being planned for near future) infrastructure.

Evan

Hi Evan,

Thanks for sending out the proposal. This pass will, of course, be very important for the Hexagon backend. I have to study the proposal in more detail but overall, it looks good to me.

I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion).

My team will be happy to help with the tasks and specifically with the packet-formation pass. We already have a pass that forms packets for Hexagon and we can generalize the pass and make it machine-independent.

-Anshu

Yes it is. Or it could be. All I am saying – if we make this implied assumption (that the order of instructions in the list == semantical order in the bundle) it needs to be explicitly stated. Otherwise a pass developer could really treat a bundle as truly “parallel” entity potentially causing nasty things.

Sergei Larin

Let me add some information about how the register handles this extension.

First of all, I agree it is important that this can be used for more than VLIW targets. Besides the uses you mention, I would like to use bundle DBG_VALUE instructions so we can avoid silly code like this:

  MachineBasicBlock::iterator InsertPos = mi;
  while (InsertPos != MBB->begin() && llvm::prior(InsertPos)->isDebugValue())
    --InsertPos;
  MachineBasicBlock::iterator From = KillMI;
  MachineBasicBlock::iterator To = llvm::next(From);
  while (llvm::prior(From)->isDebugValue())
    --From;

I also think bundles can be used for implementing parallel copies.

Value Semantics

Thinking about it, this probably wouldn't work for pre-RA passes that can delete code. They could inadvertently delete DBG_VALUE instructions.

/jakob

This sounds like a simple and good solution to the "parallel bundle" semantic
worries I had.

What about instruction scheduling? Has anyone thought how/if isched could work with parallel bundles? That is, to create the bundles (referred to as "packing"
by some) the first place (before and/or after RA).

I'm not familiar enough with the current LLVM isched work so I do not know
if the current algorithms could be used for static scheduling for VLIWs.
Is one really required to write a separate "packer" that just tries to pack
sequential(?) instructions to bundles without trying to reorder to improve
the packing density?

How I see it, "packing" is just instruction scheduling with an additional
"what can be in a single bundle and in what order"-scheduling constraint and has
fixed cycles (bundles) for the instructions as an outcome. The bundle
constraint (the wide instruction template(s) supported by the machine) can be
implemented with the state machine approach or just a resource table.

VLIW scheduling requires also multiplicity for the FU resource constraints.
Can those be described in the current TableGen format? Clustered (or in
general, not-fully-connected) VLIWs also require connectivity information
(which FUs are connected to which RFs) but that can be modeled with
register classes, I've been told.

> What about instruction scheduling? Has anyone thought how/if isched could work 
> with parallel bundles? That is, to create the bundles (referred to as "packing"
> by some) the first place (before and/or after RA).
> ...
> The bundle constraint (the wide instruction template(s) supported by the machine)
> can be implemented with the state machine approach or just a resource table.

We have thought about it from a VLIW perspective. Yes, you can integrate scheduling with packetization. I committed a DFA-based packetization mechanism last week and that can be used by the scheduler to make more intelligent decisions for a VLIW target. The flow that Evan described is flexible enough to accommodate both an integrated scheduler and a pre and post-scheduling packetizer. Some VLIW targets may need all three phases; others can use some combination of the three. 

-Anshu

``

Indeed, there are strict VLIW architectures out there and VLIW architectures that leverage some aspects of conventional architectures, sometimes taking advantage of how the microarchitecture was implemented.

Hi Evan,

I just read your proposal and the following discussion for VLIW support and want to share my experience of writing a VLIW back-end for LLVM.

I would not integrate the packetizer into the register allocator super class since it would reduce the flexibility for the back-end developer to add some optimization passes after the packetizer. Instead, I would add the packetizer as a separate pass. It is true that the packetizer must deal in that case with PHI and COPY nodes that are eliminated by the RA. The packetizer can simple group all PHI and COPY instruction into single bundles consisting of only one instruction.

However, the post-RA scheduler in combination with a VLIW-aware hazard recognizer can be used before RA to bundle and schedule instructions for VLIW architectures. Only small modifications within the post-RA scheduler classes to support virtual registers are necessary.

I also would not include packet finalization into the register allocator super class since also the following pre- and epilog code insertion (PECI) pass adds extra instruction into the instruction list. So I would add the packet finalization after pre- and epilog code insertion. Both the RA and PECI can add its instruction into single bundles that can be integrated into larger bundles within packet finalization. For packet finalization it also makes sense to perform a post-ra VLIW scheduling.

Timo

Hi Evan/Evan.
This is great! I’'ve been experimenting on a VLIW target by adding a FinalizeBundle loop and combination of Scoreboard / Schedule hazard classes.

Only question that is nagging now is how to pass the scheduling information (or the flags) down PECI, to ‘choose’ the correct opcode. I am currently hacking a few bits off the “MachineInstr.Flags”. But somebody suggested avoiding this and recommended adding a dummy operand. I’m not sure if that is the correct way. Any comments on that?

Thanks.

Girish.