Whole-function isel

I know that this has been discussed (at least in passing) a few times on the list, but I couldn’t locate a bug for it. Have any architectural plans been made for it?

Are there architectural roadblocks with the current LLVM infrastructure that will complicate the process? What has demotivated the implementation of this so far (is it not that big a benefit on important targets, too much time/effort for the payoff)?

In toying with some code generators I’ve been frustrated in getting the code gen prepare pass + isel to implement the heuristics I need to match some addressing modes and I believe that whole-function isel might be the answer.

I know that this has been discussed (at least in passing) a few times on the list, but I couldn't locate a bug for it. Have any architectural plans been made for it?

Funny you bring this up. Evan and I were tossing around crazy ideas about this just today. If you're interested, maybe we should get together for lunch or coffee or something.

Are there architectural roadblocks with the current LLVM infrastructure that will complicate the process? What has demotivated the implementation of this so far (is it not that big a benefit on important targets, too much time/effort for the payoff)?

There is a huge payoff to doing this or even making progress on it. It is an elegant solution to many issues. There are two current problems:

1) There are some scalability issues (in compile time) with large selectiondags. Roman has been tackling this recently, so this will hopefully be a non-issue really soon. In any case, it would be very reasonable continue doing single MBB selectiondags when in "-fast" mode (i.e. -O0) and do whole function ones when compile time is worth it (-O2+ etc).

2) The bigger issue is one of scheduling. The great thing about whole function isel is that nothing up to scheduling needs to care about what block something is in (as long as chain nodes pin side-effecting instrs to their block of course). Things like machine-code loop invariant code motion, CSE of values from different blocks, sinking and many other details are easily handled, and directly in the selection dags: this is really cool! :slight_smile: The problem is that *something has to decide what block to put all the operations in. The most logical thing to do here is to have a pass either part of sched or right before it that assigns a block to each node (which is a target instr node) and then have sched remain a single-bb at a time (for now, it could obviously be improved later to work on traces etc).

The problem is that in full generality, this is a problem very similar to PRE: assigning blocks to operations optimally is complex and somewhat subtle. An excellent place to start looking at this is Cliff Click's thesis, which is available online (and incidentally inspired a lot of selection dag design). In his thesis, he has a hacky solution that is probably good enough, certainly to start out with.

The reason this hasn't been done so far is that we (people at apple) haven't been able to justify the "big project" of tackling this when there are still tons of little projects that are lower hanging for our needs. We will almost certainly do this someday (and would have already done it if it were blocking some project of ours), but would love to have someone work on it in the shorter term. The thing that Evan and I were discussing today is that there are a variety of ways to get intermediate results without tackling the full generality of the problem:

1) One annoying problem for targets like PPC and ARM (which promote i8/i16 to i32) can be solved with a really simple hack. Consider a switch on i16 that codegens to a branch tree. This turns into a bunch of small blocks: the first zext's the value then does an unsigned comparison against the range typically, and each subsequent block does an unsigned comparison against the subportion of the range. The problem is that right now, the subsequent comparisons just see that the i16 value is promoted, which means the top bits are unknown. This means that each block ends up doing tons of redundant "and"s to clear out the top bits before their comparison. This is obviously a special case of a general problem: uses of values in blocks that aren't the def can't see properties of the value.

This (and similar problems) boil down to not knowing info about a cross-block promoted value. This sort of thing can be solved with a very simple hack by adding a few maps to SDISel. The idea is that, in ISel, each cross-block CopyToReg for a Vreg can call ComputeMaskedBits and ComputeSignBits and record the known live out bit+sign information for the vreg. Later, if someone calls ComputeMaskedBits or ComputeSignBits and it recurses up to a CopyFromReg from the vreg, the information can be provided from its definition.

This elegantly and cheaply solves the problem. The reason I call it a hack is that it means you get different code for a function based on the order that blocks are selected (i.e. it is better to compile the defs of values before the uses), which is annoying, but actually probably just fine in practice. Codegen remains deterministic of course.

2) An intermediate step to whole function isel is to do region based isel with regions limited by what makes the sched problem simple. One very interesting class of region is a Single-Entry-Multiple-Exit region. It would be straight-forward to break each function into SEME regions and select them one at a time, allowing them to compile to multiple MBBs.

In addition to the obvious dag combiner and isel improvements that would come from larger regions, SEME regions encapsulate a number of interesting things (such as switch lowering). Having SEME regions would allow us to remove the block juggling in SDISel: switches would just lower to multiple blocks in one dag. SEME regions are also totally trivial for the sched problem: because your region is just a tree, all you have to do is move a computation "up" the tree towards the root to the place that dominates all its uses. Another interesting thing about SEME regions is that you don't have to worry about phi nodes etc.

OTOH, SEME regions are still limiting, so we would eventually want to do whole function (or at least "arbitrary region") isel. SEME regions do handle "sinking" well, but don't handle LICM (because you can't have a whole loop in the region) and don't handle the diamond pattern than "select" lowers to when a target doesn't support cmov. Wouldn't it be nice if legalize could lower select directly for targets (by inserting blocks into the DAG), instead of having to do the "custom sched inserter" hack? :slight_smile: Wouldn't it be nice if dag combine could actually itself form cmovs from diamond shapes in the cfg? :slight_smile:

At the end of the day, attacking a subclass of the "big problem" is a really interesting way (to me) to incrementally tackle this problem, and I strongly encourage anyone interested in this to start with SEME regions first. This allows us to figure out things like "how do we represent the start of a bb" (probably a new chained node), "how does the sched pass set the block that an instruction goes in" (probably a new field in SDNode), and allows us to do some cleanups (e.g. switch lowering) without tackling some of the other hard problems. Once SEME regions are working really well, we can tackle the rest of the hard problems, because they are obviously worthwhile doing eventually.

In toying with some code generators I've been frustrated in getting the code gen prepare pass + isel to implement the heuristics I need to match some addressing modes and I believe that whole-function isel might be the answer.

I would love to see any progress in this area. It is clearly an important thing to tackle, and it is blocking other interesting improvements in the code generator. It would also allow us to eliminate a significant amount of weirdness that exists to hack around this (e.g. switch lowering). That said, even when we *can* do whole function isel, I think we should keep the ability to handle subregions of the function for -O0, at least until SD stuff scales perfectly :).

-Chris

Looks like this suite is pretty empty :slight_smile:

May I delete it or should it be beefed up some time?

Cheers,

  Gabor

Chris,

Chris Lattner wrote:

I would love to see any progress in this area. It is clearly an important thing to tackle, and it is blocking other interesting improvements in the code generator. It would also allow us to eliminate a significant amount of weirdness that exists to hack around this (e.g. switch lowering).

we've been working on a whole-function instruction selector here at the vienna university of technology in the recent past. our approach
considers ssa graphs and is based on a problem transformation to a
specialized quadratic assignment problem (pbqp). in contrast to previous
work [1], the technique is flexible enough to cope with general DAG
patterns such as pre/postincrement or divmod patterns.

the instruction selector is a drop-in replacement for the original implementation (llvm 2.1). we've used the ARM backend for evaluation and obtained quite encouraging results: speedups are up to 10% for SPEC/Mibench and up to 57% for simple loop kernels. the compile time increase is about a factor of 2.

the paper has been accepted for this year's LCTES conference (june 12-13, tucson, az). i'm afraid i cannot post it on the list but i'm happy to send a preliminary version to anybody who's interested.

the implementation is not yet as efficient as it could be (e.g., we maintain an additional datastructure for the ssa graph along with the selection DAG) and i'm afraid there are licensing issues that do not allow me to directly post or contribute the code. however, i'm happy to share further experimental results and discuss the approach in case somebody is interested.

Please delete it, that pass is long gone.

-Chris

Chris,

Chris Lattner wrote:

I would love to see any progress in this area. It is clearly an
important thing to tackle, and it is blocking other interesting
improvements in the code generator. It would also allow us to
eliminate a significant amount of weirdness that exists to hack around
this (e.g. switch lowering).

we've been working on a whole-function instruction selector here at the
vienna university of technology in the recent past. our approach
considers ssa graphs and is based on a problem transformation to a
specialized quadratic assignment problem (pbqp). in contrast to previous
work [1], the technique is flexible enough to cope with general DAG
patterns such as pre/postincrement or divmod patterns.

the instruction selector is a drop-in replacement for the original
implementation (llvm 2.1). we've used the ARM backend for evaluation and
obtained quite encouraging results: speedups are up to 10% for
SPEC/Mibench and up to 57% for simple loop kernels. the compile time
increase is about a factor of 2.

Very nice!

the paper has been accepted for this year's LCTES conference (june
12-13, tucson, az). i'm afraid i cannot post it on the list but i'm
happy to send a preliminary version to anybody who's interested.

the implementation is not yet as efficient as it could be (e.g., we
maintain an additional datastructure for the ssa graph along with the
selection DAG) and i'm afraid there are licensing issues that do not
allow me to directly post or contribute the code. however, i'm happy to
share further experimental results and discuss the approach in case
somebody is interested.

That's unfortunate. What kind of licensing issues are there?

Evan

Evan Cheng wrote:

the implementation is not yet as efficient as it could be (e.g., we
maintain an additional datastructure for the ssa graph along with the
selection DAG) and i'm afraid there are licensing issues that do not
allow me to directly post or contribute the code. however, i'm happy to
share further experimental results and discuss the approach in case
somebody is interested.

That's unfortunate. What kind of licensing issues are there?

i'm working in a christian doppler laboratory. projects are funded partially by industry and the developed code "belongs" to the partner company unless they decide to release it. i would have to talk to them and see what i can do. furthermore, there's a generic solver that has been implemented by bernhard scholz (university of sydney) [1]. it's free for research purposes but would have to be re-licensed for llvm. however, the latter would be no problem if you just want to play with it.

note that the implementation is pretty stable but sincerely not ready for prime time. several parts would have to be revised and/or replaced and some design decisions would probably be different for a production quality system vs. a research prototype.

Evan Cheng wrote:

That's unfortunate. What kind of licensing issues are there?

i've talked to our supporting company and they agreed to release the code to interested parties. it's not a copyleft license but the code can be used freely for private and research purposes.

be warned that the code is merely a prototype implementation and not ready for inclusion in LLVM. it also requires a cost augmented graph grammar. for our ARM implementation, we could derive a basic set of rules directly from the LLVM tablegen description, but additional handwritten rules (e.g., replacing the c++ written address mode selection) were necessary. the implementation has been ported from another private backend which is still evident in some places.

if you're still interested in the code, i could prepare a small package to play with along with some useful instructions. if you think that a PBQP based approach would be useful for LLVM and has a chance for inclusion, i would be willing to bring the code in shape and make changes where necessary.

Evan Cheng wrote:

That's unfortunate. What kind of licensing issues are there?

i've talked to our supporting company and they agreed to release the
code to interested parties. it's not a copyleft license but the code
can be used freely for private and research purposes.

be warned that the code is merely a prototype implementation and not
ready for inclusion in LLVM. it also requires a cost augmented graph
grammar. for our ARM implementation, we could derive a basic set of
rules directly from the LLVM tablegen description, but additional
handwritten rules (e.g., replacing the c++ written address mode
selection) were necessary. the implementation has been ported from
another private backend which is still evident in some places.

if you're still interested in the code, i could prepare a small package
to play with along with some useful instructions. if you think that a
PBQP based approach would be useful for LLVM and has a chance for
inclusion, i would be willing to bring the code in shape and make
changes where necessary.

Sounds like it's probably not going to be part of llvm then. :frowning:

Are there parts that might be introduced back?

Thanks,

Evan

Evan Cheng schrieb:

i've talked to our supporting company and they agreed to release the
code to interested parties. it's not a copyleft license but the code
can be used freely for private and research purposes.

be warned that the code is merely a prototype implementation and not
ready for inclusion in LLVM. it also requires a cost augmented graph
grammar. for our ARM implementation, we could derive a basic set of
rules directly from the LLVM tablegen description, but additional
handwritten rules (e.g., replacing the c++ written address mode
selection) were necessary. the implementation has been ported from
another private backend which is still evident in some places.

if you're still interested in the code, i could prepare a small package
to play with along with some useful instructions. if you think that a
PBQP based approach would be useful for LLVM and has a chance for
inclusion, i would be willing to bring the code in shape and make
changes where necessary.

Sounds like it's probably not going to be part of llvm then. :frowning:

Are there parts that might be introduced back?

i guess my last message was slightly misleading. the license just allows interested people to play with the code and do some experiments. if you like the method and decide that that's what you want for LLVM, then i'm confident that licensing issues will not be an obstacle and the code can be released under the usual llvm release license.

i'll try to do some minor cleanups and send you the code in the next few days.

cheers,

I thought I'd share a little bit of progress I made this weekend. I've gotten the first interesting test-case (a simple switch) through hyperblock-based DAGISel, and there's a pretty picture too! Each part of the switch is emitted directly into the DAG, rather than being deferred.

This is the function:
define i32 @foo(i32 %x, i32 %z) nounwind {
entry:
         switch i32 %x, label %UnifiedReturnBlock [
                  i32 0, label %bb
                  i32 1, label %bb5
         ]
bb: ; preds = %entry
         %tmp4 = mul i32 %z, %x ; <i32> [#uses=1]
         ret i32 %tmp4
bb5: ; preds = %entry
         %tmp8 = add i32 %z, %x ; <i32> [#uses=1]
         ret i32 %tmp8
UnifiedReturnBlock: ; preds = %entry
         ret i32 %z
}

And here's the X86 DAG right before it gets fed one BB at a time into pre-RA scheduling. Each color is a different basic block.

Cheers!

simple_switch.pdf (32.3 KB)

Very nice! Why did you decide on hyperblock instead of SEME region and how are you forming the blocks?

Evan

This is really cool!

-Chris

My mistake Evan, they are SEME regions. I had a chat with a friend about hyper-blocks the other day and got confused in my email. =)