I know that this has been discussed (at least in passing) a few times on the list, but I couldn't locate a bug for it. Have any architectural plans been made for it?
Funny you bring this up. Evan and I were tossing around crazy ideas about this just today. If you're interested, maybe we should get together for lunch or coffee or something.
Are there architectural roadblocks with the current LLVM infrastructure that will complicate the process? What has demotivated the implementation of this so far (is it not that big a benefit on important targets, too much time/effort for the payoff)?
There is a huge payoff to doing this or even making progress on it. It is an elegant solution to many issues. There are two current problems:
1) There are some scalability issues (in compile time) with large selectiondags. Roman has been tackling this recently, so this will hopefully be a non-issue really soon. In any case, it would be very reasonable continue doing single MBB selectiondags when in "-fast" mode (i.e. -O0) and do whole function ones when compile time is worth it (-O2+ etc).
2) The bigger issue is one of scheduling. The great thing about whole function isel is that nothing up to scheduling needs to care about what block something is in (as long as chain nodes pin side-effecting instrs to their block of course). Things like machine-code loop invariant code motion, CSE of values from different blocks, sinking and many other details are easily handled, and directly in the selection dags: this is really cool!
The problem is that *something has to decide what block to put all the operations in. The most logical thing to do here is to have a pass either part of sched or right before it that assigns a block to each node (which is a target instr node) and then have sched remain a single-bb at a time (for now, it could obviously be improved later to work on traces etc).
The problem is that in full generality, this is a problem very similar to PRE: assigning blocks to operations optimally is complex and somewhat subtle. An excellent place to start looking at this is Cliff Click's thesis, which is available online (and incidentally inspired a lot of selection dag design). In his thesis, he has a hacky solution that is probably good enough, certainly to start out with.
The reason this hasn't been done so far is that we (people at apple) haven't been able to justify the "big project" of tackling this when there are still tons of little projects that are lower hanging for our needs. We will almost certainly do this someday (and would have already done it if it were blocking some project of ours), but would love to have someone work on it in the shorter term. The thing that Evan and I were discussing today is that there are a variety of ways to get intermediate results without tackling the full generality of the problem:
1) One annoying problem for targets like PPC and ARM (which promote i8/i16 to i32) can be solved with a really simple hack. Consider a switch on i16 that codegens to a branch tree. This turns into a bunch of small blocks: the first zext's the value then does an unsigned comparison against the range typically, and each subsequent block does an unsigned comparison against the subportion of the range. The problem is that right now, the subsequent comparisons just see that the i16 value is promoted, which means the top bits are unknown. This means that each block ends up doing tons of redundant "and"s to clear out the top bits before their comparison. This is obviously a special case of a general problem: uses of values in blocks that aren't the def can't see properties of the value.
This (and similar problems) boil down to not knowing info about a cross-block promoted value. This sort of thing can be solved with a very simple hack by adding a few maps to SDISel. The idea is that, in ISel, each cross-block CopyToReg for a Vreg can call ComputeMaskedBits and ComputeSignBits and record the known live out bit+sign information for the vreg. Later, if someone calls ComputeMaskedBits or ComputeSignBits and it recurses up to a CopyFromReg from the vreg, the information can be provided from its definition.
This elegantly and cheaply solves the problem. The reason I call it a hack is that it means you get different code for a function based on the order that blocks are selected (i.e. it is better to compile the defs of values before the uses), which is annoying, but actually probably just fine in practice. Codegen remains deterministic of course.
2) An intermediate step to whole function isel is to do region based isel with regions limited by what makes the sched problem simple. One very interesting class of region is a Single-Entry-Multiple-Exit region. It would be straight-forward to break each function into SEME regions and select them one at a time, allowing them to compile to multiple MBBs.
In addition to the obvious dag combiner and isel improvements that would come from larger regions, SEME regions encapsulate a number of interesting things (such as switch lowering). Having SEME regions would allow us to remove the block juggling in SDISel: switches would just lower to multiple blocks in one dag. SEME regions are also totally trivial for the sched problem: because your region is just a tree, all you have to do is move a computation "up" the tree towards the root to the place that dominates all its uses. Another interesting thing about SEME regions is that you don't have to worry about phi nodes etc.
OTOH, SEME regions are still limiting, so we would eventually want to do whole function (or at least "arbitrary region") isel. SEME regions do handle "sinking" well, but don't handle LICM (because you can't have a whole loop in the region) and don't handle the diamond pattern than "select" lowers to when a target doesn't support cmov. Wouldn't it be nice if legalize could lower select directly for targets (by inserting blocks into the DAG), instead of having to do the "custom sched inserter" hack?
Wouldn't it be nice if dag combine could actually itself form cmovs from diamond shapes in the cfg? 
At the end of the day, attacking a subclass of the "big problem" is a really interesting way (to me) to incrementally tackle this problem, and I strongly encourage anyone interested in this to start with SEME regions first. This allows us to figure out things like "how do we represent the start of a bb" (probably a new chained node), "how does the sched pass set the block that an instruction goes in" (probably a new field in SDNode), and allows us to do some cleanups (e.g. switch lowering) without tackling some of the other hard problems. Once SEME regions are working really well, we can tackle the rest of the hard problems, because they are obviously worthwhile doing eventually.
In toying with some code generators I've been frustrated in getting the code gen prepare pass + isel to implement the heuristics I need to match some addressing modes and I believe that whole-function isel might be the answer.
I would love to see any progress in this area. It is clearly an important thing to tackle, and it is blocking other interesting improvements in the code generator. It would also allow us to eliminate a significant amount of weirdness that exists to hack around this (e.g. switch lowering). That said, even when we *can* do whole function isel, I think we should keep the ability to handle subregions of the function for -O0, at least until SD stuff scales perfectly :).
-Chris