[RFC] Bundling support in the PostRA Scheduler

Hi,

I'm working on a custom top-down post RA scheduler which builds bundles at the same time for our VLIW processor. I've borrowed most of the implementation from the resource priority queue implemented for the existent VLIW scheduler but applied to the context of MI scheduling. Basically, instructions that are likely to be bundled must be scheduled first (i.e. get higher priority).
This work should integrate very well with the current infrastructure and I'd like to contribute with a patch to add bundling capabilities to the current post RA scheduler if the LLVM community is interested :slight_smile: (May Hexagon need it as well?). It would also be a great opportunity for us to get feedback from the community about this.

We have a non-interlocked processor which relies on the post ra scheduler to emit cycle-accurate bundles (valid bundles without incurring in structural hazards). The construction of bundles outside the scope of post RA scheduling will require structural hazard information to work properly for processors without pipeline interlocks. For example, we can discover that an instruction can fit into the current packet (following a schema of linear code scanning, just like the current DFAPacketizer does) while in fact it cannot because of structural hazards. The two terms are strongly connected and necessary to build valid packets.
The concerns are mainly based on our non-interlocked processor, where cycle-accurate bundle emission is necessary. Other approaches/ideas are very welcome.
Do you have any plan for adding a more robust bundler into the current infrastructure ?

Ivan

Hi,

I'm working on a custom top-down post RA scheduler which builds bundles
at the same time for our VLIW processor. I've borrowed most of the
implementation from the resource priority queue implemented for the
existent VLIW scheduler but applied to the context of MI scheduling.
Basically, instructions that are likely to be bundled must be scheduled
first (i.e. get higher priority).
This work should integrate very well with the current infrastructure and
I'd like to contribute with a patch to add bundling capabilities to the
current post RA scheduler if the LLVM community is interested :slight_smile: (May
Hexagon need it as well?). It would also be a great opportunity for us
to get feedback from the community about this.

This could potentially be useful for the R600 backend, which targets VLIW
GPUs (AMD HD2XXX-HD6XXX), so I would be very interested in this.

-Tom

Hi,

I'm working on a custom top-down post RA scheduler which builds bundles
at the same time for our VLIW processor. I've borrowed most of the
implementation from the resource priority queue implemented for the
existent VLIW scheduler but applied to the context of MI scheduling.
Basically, instructions that are likely to be bundled must be scheduled
first (i.e. get higher priority).
This work should integrate very well with the current infrastructure and
I'd like to contribute with a patch to add bundling capabilities to the
current post RA scheduler if the LLVM community is interested :slight_smile: (May
Hexagon need it as well?). It would also be a great opportunity for us
to get feedback from the community about this.

We have a non-interlocked processor which relies on the post ra
scheduler to emit cycle-accurate bundles (valid bundles without
incurring in structural hazards). The construction of bundles outside
the scope of post RA scheduling will require structural hazard
information to work properly for processors without pipeline interlocks.
For example, we can discover that an instruction can fit into the
current packet (following a schema of linear code scanning, just like
the current DFAPacketizer does) while in fact it cannot because of
structural hazards. The two terms are strongly connected and necessary
to build valid packets.
The concerns are mainly based on our non-interlocked processor, where
cycle-accurate bundle emission is necessary. Other approaches/ideas are
very welcome.

We would be interested in this since we are targeting a non-interlocked
VLIW processor as well. Currently we are using a custom post RA
scheduler implemented in a pre emit pass. However, we would like to
integrate better with and reuse more of the current LLVM infrastructure
such as the post RA scheduler framework and bundling support.

Jordy

Hi Ivan,

Your description sounds fine to me. I assume you are totally decoupled from what LLVM currently calls the "postRA" scheduling pass. Hopefully you don't need anything in PostRASchedulerList.cpp.

Running your bundler as a preEmit pass is the cleanest approach. But if need be, we can support preRA bundling at the time the MachineScheduler currently runs (if enabled). TargetPassConfig allows you to substitute your own pass in place of MachineScheduler. Passes that run after MachineScheduler are intended to support instruction bundles. This feature is not extensively tested, so anyone taking this approach would need to work with backend maintainers to get things fixed.

-Andy

Ivan,

  IMHO the current bundle infrastructure is in need of some additional
features - not only bundle handling utilities, but also support for liveness
computation, several (key) code transformation passes etc. (see my recent
post for conditional defs for instance). Does your back-end perform any
substantial transformations to the code _after_ the second pass (postRA)
scheduler and bundle formation/finalization? Does any of them might demand
bundle decomposition? If so, I really wonder how do you plan to address
that.

  We also have an MI based version of the current VLIW scheduler, that I am
planning to bring up to date with the tip, and upstream ASAP. But it is
preRA scheduler, and it does not preserve bundles (though it constructs them
and easily could keep them around). The issue is what I have said above,
many current passes do not understand bundles (well or at all) and accurate
liveness update has been an issue for me recently.

  I also tried to mess with PostRA scheduler to achieve similar goals, only
to find out that all the additional dependencies after RA make it virtually
impossible to produce high quality schedule, and obviously it is too late at
that point to address reg pressure via scheduling techniques, so I have put
that project on the back burner for a while.

  I guess what I am saying is that you are not alone in this :slight_smile: ...and we do
need to formulate our requirements clearly... and repeat them regularly :slight_smile:
Hope this helps.

Sergei

Hi,

   I also tried to mess with PostRA scheduler to achieve similar goals, only
to find out that all the additional dependencies after RA make it virtually
impossible to produce high quality schedule, and obviously it is too late at
that point to address reg pressure via scheduling techniques, so I have put
that project on the back burner for a while.

Has anyone studied how much work it would be to implement
an integrated allocator/scheduler in LLVM now? It should be
an efficient solution to the VLIW RA/scheduling phase ordering
problem, but of course it's more complex to implement and
modularize. (I'm unsure if it's possible to exploit the current
register allocator code easily with it).

Another solution (which we use in TCE) is to use register renaming.
That is, rename schedule-restricting antideps introduced by
the pre-RA "on the fly" during scheduling. It's sort of a middle
ground solution that allows modularizing the proper RA to
a separate pass cleanly. However, its proper implementation goes
step-by-step towards the integrated allocator (smarter the renamer
becomes, more it looks like a proper RA).

BR,

Hi all,

Thanks for your feed-backs :slight_smile:

@Andrew: In fact, I've reused most of the postRA list scheduler code and the resource priority queue. Every time it needs to move forward, either because of a res hazard (HazardRec) or an invalid combination of instructions in the current packet (DFA), it closes the current bundle and advances to the next cycle. The non-interlocked nature of our processor forces the bundling logic to live with the scheduling logic. We cannot build bundles without the scoreboard.

I also tried to build bundles as a preRA pass in order to reduce the register pressure (so the RA will take full advantage of the vliw architecture). I've ran into some problems such as the re-materialization one that we have discussed some time ago (http://llvm.1065342.n5.nabble.com/Instruction-bundles-before-RA-Rematerialization-td45900.html) and the liveness re-computation while moving MI's into packets, where we have contributed with a patch. Other problems are related to our specific BE implementation which doesn't allow us to get good performances with the preRA bundler. The preRA bundling forces a starting point from which the postRA bundler must start with which may or may not be the optimal point. Without bundles before RA, the register pressure will be higher but the postRA bundler will get the freedom it needs to build better bundles. It seems to have a trade-off between reg pressure and bundling capabilities. We have chosen the latter.

@Sergei: It's good to see that you are working on it also :-). ATM, we don't do any transformation which may affect bundles. The postRA scheduler will schedule and packetize all MI's then we run the packetFinalization pass. Bundle decomposition is somewhat complex in non-interlocked processors where the DFA is not enough to rebuild them.

@Pekka: We don't care about anti-deps as far as the dependent MI's can fit into the same bundle. There are anti-dep breakers in llvm and the requested one runs together with the postRA scheduler.

The changes I'd like to propose are mainly based in:
- Adapting the current resource priority queue to work with MI's.
- And either add a new postRA SchedBundler or modify the existent one.

But I think I should wait until Sergei send upstream his MI based scheduler. Sergei, are you working on some resource priority queue at MI level?

Ivan

Hi Pekka,

Has anyone studied how much work it would be to implement an integrated
allocator/scheduler in LLVM now?

  Not to my knowledge.

Another solution (which we use in TCE) is to use register renaming.

  You do it in LLVM? Do you plan to upstream it?

Also, I do not know your target/goal, but do you look at global scheduling
at all?

Thanks.

Sergei

Sergei, are you working on some resource priority queue at
MI level?

Yes.

Sergei

Another solution (which we use in TCE) is to use register renaming.

   You do it in LLVM? Do you plan to upstream it?

Not in LLVM but with a custom scheduler framework.

Our target is quite unique (see the blog post for an example)
therefore it's not completely clear (yet) how to map it to the LLVM
codegen/machineinstructions the best way while still exploiting the
unique optimizations of TTA. I've been following the activity on VLIW
support of LLVM in the hopes we can someday move more towards LLVM code
base.

I didn't know about the "post-RA sched. antidep breakers" in LLVM codegen
of which idea sound similar to what we do.

Also, I do not know your target/goal, but do you look at global scheduling
at all?

Only somewhat at this point (branch delay slot filling from the
target BB). Region scheduling and speculation is something that is
interesting but it hasn't been the "biggest itch" yet in our projects
therefore not implemented yet. I see efficient predication support as
highest priority (e.g. hyperblocks) for VLIWs.

BR,