Register data flow commits

Hi,

I commited several patches today that implement a framework that enables data-flow optimizations on the post-RA (post-SSA) representation. I thought I'd say a few words about what it is. The code is currently under lib/Target/Hexagon, but there intent for it from the beginning was to be target-independent.

The motivation for it came from our (Hexagon) experience with customer applications. Customers have reported several cases of suboptimal code, often in the form of "redundant register assignment", which was typically a result of the register allocation sequence (phi node elimination, etc). Because of the origin of these issues the usual SSA-based optimizations were not able to deal with it, and doing data-flow optimizations after RA was in itself a non-trivial task. Hence the idea of having a framework that would make such optimizations easier to implement.

The data-flow graph that is the backbone of the framework resembles SSA, i.e. it does contain phi nodes, and the like, but it is not a pure SSA, since it uses physical registers, which can have any number of definitions. The graph itself is described in detail in the comments in RDFGraph.h.

The intended use scenario is:
1. Build the data flow graph.
2. Perform a sequence of data-flow optimizations and update the graph in the process.
3. Update liveness.

Currently, only two optimizations are implemented:
- copy propagation, and
- dead code elimination.

The copy propagation algorithm is very simple at the moment and can only handle COPY instructions (it calls MachineInstr::isCopy to check if an instruction is a "copy"). It could easily be extended (via target hooks) to recognize more advanced examples (such as reg+imm(0), for example, or "combine" instructions in case of Hexagon). It also does not update the DFG, which is the biggest issue right now. At the moment, the Hexagon code that uses it simply rebuilds the DFG after copy propagation, but this will need to change in the near future.

The dead code elimination provides two functions: collect dead nodes and erase nodes. The simplest form of use would be to call the first, then the second (there are comments in the sources that talk about it). The Hexagon used has an extra step, where it looks for dead address updates in post-increment loads/stores.

The liveness computation uses an SSA-based algorithm, although it's not exactly the same as the one from the paper referenced in the comment in the source. The main difference is that the RDF's DFG contains links between defs, whereas the SSA form assumed in the paper only connects defs to uses. The extra step in RDF is to identify phi nodes that do not reach any non-phi uses, which adds a bit of an extra cost.

The only limitation that I am aware of, that could affect non-Hexagon targets is that the code does not handle regmasks. There is nothing special about them, it's just that Hexagon does not use regmasks and so they didn't show up during the development. There is nothing particularly hard about adding support for them.

The code was only tested on Hexagon. I haven't tried using it with any other target.

-Krzysztof

I am surprised that the regalloc passes (in broader sense of phi elimination, two address elimination, copy coalescing + regalloc) produce opportunities for copy propagation and dead code elimination. I believe that in principle we should be able to avoid creating dead code and unnecessary copies (however I am aware of a few shortcomings in this area when sub registers are involved).

Can you give some examples on situations where you need such post-RA cleanups?

- Matthias

The regalloc passes have indeed improved significantly since the last release of our (Hexagon) compiler. I ran a few tests and I didn't see cases where dead code would be a direct outcome of the regalloc process. There are, on the other hand, cases of instruction chains scattered over a loop (a fairly large one in the specific case I was just looking at), whose result is not ultimately used, and which end up getting removed. I cannot post the code, since it came from a customer application, plus the dead code was not related to regalloc.

To clarify---although this code does perform a "cleanup" right now, the ultimate intent to make post-RA optimizations easier. We do have cases, where there are register copies inside loops, which could be eliminated (possibly via copy-ins/copy-outs outside the loop), or cases where rearranging the registers could result in more compact code (i.e. through forming register pairs). All of these could potentially be quite useful for us.

I'll let you know if I see examples that could be improved in the regalloc passes themselves.

-Krzysztof