[global-isel] Proposal for a global instruction selector

I am hoping that this proposal will generate a lot of feedback, and there are many different topics to discuss. When replying to this email, please change the subject header to something more specific, but keep the [global-isel] tag.

Thanks,
/jakob

Proposal for a Global Instruction Selector

It is becoming evident that we need a replacement for the SelectionDAG-based instruction selector. This proposal describes the architecture of a global instruction selector based on the MachineInstr intermediate representation.

A new instruction selector is a large project that will probably take a couple of years to complete. This proposal only describes the desired final design, it does not describe the path of incremental development we should follow to get there.

This is a strawman design which shouldn’t be seen as a final proposal. I am hoping for a lot of community feedback, and I don’t expect we’ll have a final design before we actually begin implementing it.

Why not legalize LLVM IR?

It’s been proposed to legalize LLVM IR before instruction selection, see http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-April/061567.html. I recommend reading the mailing list thread debating the proposal, but let me summarize the reasons for legalizing MI instead of LLVM IR:

  • ABI boundaries have a too high-level representation in LLVM IR. By ABI boundaries, I mean function call and return, landing pads, and inline assembly.

    LLVM IR doesn’t make it clear which function arguments go in registers and which go on the stack. Stack frame and call frame maintenance instructions are not represented in LLVM IR. Byval arguments are a big mystery, and some arguments are actually passed in multiple registers, depending on their type.

    While LLVM IR could be extended to represent all these things, it would pollute the clean design of LLVM IR, and it is not the right level of abstraction for this IR.

  • The getelementpointer instruction is too abstract. We want pointer arithmetic to be exposed as normal integer arithmetic in a type system without pointers. This would require a lot of awkward pointer-to-integer conversions in LLVM IR.

  • The LLVM IR type system is too high-level. We would need to get rid of first-class aggregates and pointers. This is possible, but it would require lots of bitcasts and other noisy instructions that are really no-ops as far as instruction selection is concerned.

  • We want LLVM IR to be readable by future versions of the compiler, and we desire that target authors can rearrange their design as the target evolves. Requiring target authors to autoupgrade old idioms (e.g., the equivalent of X86ISD nodes) isn’t a reasonable thing.

Motivation

Everybody loves to hate SelectionDAG, but it is still useful to make its shortcomings explicit. These are some of the goals for a new instruction selector architecture:

  • We want a global instruction selector.

    SelectionDAG operates on a basic block at a time, and we have been forced to implement a number of hacks to work around that. For example, most of CodeGenPrepare is moving instructions around to make good local instruction selection more likely. Legalization of switches and selects must be done either before or after instruction selection because it requires creating new basic blocks.

    A number of passes running after instruction selection are also mostly about cleaning up after the single-basic-block selector. This includes MachineCSE, MachineLICM, and the peephole pass.

  • We want a faster instruction selector.

    The SelectionDAG process is quite heavyweight because it uses continuous CSE, a whole new IR, and a mandatory scheduling phase to linearize the DAG. By selecting directly to MI, we can avoid one IR translation phase. By using a linearized IR, scheduling becomes optional.

  • We want a shared code path for fast and good instruction selection.

    Currently, the fast instruction selector used for -O0 builds is a completely separate code path. This is not healthy because it increases the likelihood of bugs in the fast path that were not present in the slow path.

    It would be better if the -O0 instruction selector were a trimmed down version of the full instruction selector.

  • We want an IR that represents ISA concepts better.

    The SelectionDAG IR is very good at representing LLVM IR directly, but as the code is lowered to model target machine concepts, weird hacks are often required. This is evident in the way too many SDNodes required to represent a function call, or the many custom ISD nodes that targets need to define.

    In many cases, custom target code knows exactly which instructions it wants to produce, and the IR should make it possible and easy to just emit the desired instructions directly. The MI intermediate representation isn’t perfect either, and we should plan some MI improvements as well.

    The SelectionDAG concept of legal types and their mapping to a single register class often causes problems. In some cases, it is necessary to lie about value types, just to get the instruction selector to do the right thing.

  • We want a more configurable instruction selector.

    Weird targets have weird requirements, and it should be possible for targets to inject new passes into the instruction selection process. Sometimes, it may even be required to replace a standard pass.

Overall design

The global instruction selector is implemented as a series of machine function passes. Targets can inject their own passes or replace standard passes, using the existing mechanism for configuring code generation pipelines.

The MI intermediate representation is extended with a set of abstract target-independent opcodes very similar to the ISD opcodes used by SelectionDAG. Virtual registers can be typed so they have an EVT instead of a register class.

The standard passes are:

  1. MI builder.
  2. Switch lowering.
  3. Iterative legalization and cleanup.
  4. Register bank selection.
  5. Common subexpression elimination.
  6. Instruction selection.

I’ll describe these passes in more detail below.

MI builder

The MI builder pass translates LLVM IR to MI IR, much like the current SelectionDAGBuilder phase. Like SelectionDAGBuilder, this pass will:

  • Expand LLVM IR first-class aggregate types into their constituent parts. This also includes splitting load, store, phi, and select instructions.
  • Replace pointer types with target-specific integers.
  • Expand getelementptr instructions with integer adds and multiplies.
  • Map LLVM instructions to target-independent abstract MI opcodes.
  • Lower ABI boundaries with help from target hooks.

Unlike SelectionDAGBuilder, this pass will:

  • Create a 1-1 mapping of the LLVM IR CFG. No new basic blocks are created.
  • Preserve switch instructions.
  • Preserve i1 logic instructions.
  • Not use MERGE_VALUES instructions. We’ll use a value map that can handle aggregates.

The aggregate type expansion creates value types that can all be represented by EVTs, and MachineRegisterInfo will be extended to allow virtual registers to have an EVT instead of a register class. EVTs are all the integer, floating point, and vector types from LLVM IR.

The ABI boundary lowering requires types to be broken down further into ‘legal types’ that can be mapped to registers. The secondary breakdown is currently handled by TargetLowering::LowerCallTo() calling getRegisterType() and getNumRegisters(). Most ABIs are defined in terms of C types, not LLVM IR types, so there is a close connection between the C frontend and the ABI lowering code in the instruction selector. It would be a good idea to have the ABI lowering code work independently of the type system used during instruction selection. This is made possible by having ABI lowering be part of the LLVM IR to MI translation process.

Switch lowering

SelectionDAGBuilder is lowering switches immediately because the CFG can’t be changed during instruction selection. This isn’t required with a global instruction selector, so switch lowering can be its own pass, simplifying the design of the MI builder.

The switch lowering pass converts switch instructions into a combination of branch trees and jump tables. A default target-independent implementation is possible, but targets may want to override this pass. For example, the ARM target could try to create jump tables that would work well with the TBB/TBH instructions to reduce code size.

It is convenient to lower switches in the MI intermediate representation because it already has an object type for jump tables.

Legalization

SelectionDAG’s concept of legal types isn’t clearly defined. It seems to be a bit random which operations must be supported before a type can be considered legal. We’ll define legal types precisely:

A type is considered legal if it can be loaded into and stored from an allocatable register class. (And all allocatable register classes must support copy instructions.)

We don’t require other supported operations than load and store for a type to be legal, and all types that can be loaded and stored are legal. This means that most targets will have a much larger set of legal types than they do today. Also note that we don’t require targets to designate a register class to use for each legal type; in fact, TableGen can compute the legal type set automatically. Register classes can be inferred from the selected instructions,MachineRegisterInfo::recomputeRegClass() already knows how to do that.

With a much larger set of legal types, a separate type legalization phase becomes superfluous. The operation legalizer must be able to do anything the type legalizer can do anyway, so the type legalizer isn’t adding any value.

The legalizer will work bottom-up and iteratively. As illegal instructions are visited, the legalizer will apply transformations that make the current instruction ‘more legal’. Each transformation is allowed to modify a whole chain of single-use instructions for efficiency, but it is also allowed to only modify the current instruction in hard cases. The instructions created don’t need to be legal, the legalizer iterates until the current instruction is legal before it moves on.

The set of new instructions created while legalizing a single instruction is fed through an instruction simplifier that cleans up redundancies. This replaces DAGCombine.

Clearly, more details are required here, and community input would be much appreciated. Some specific questions that need answering are:

  • What does ‘more legal’ mean? Can we restrict the possible legalization transformations so the iterative process is guaranteed to make progress?

  • A long time ago, SelectionDAG had a single legalization pass. It was broken into multiple passes because the code got too complicated. Is the iterative approach proposed here good enough to keep the code complexity under control?

    In my opinion, it’s the value mapping that gets you. If the legalization can be broken into small well-defined transformations with no shared state, such as a value map, then there is no reason for the code to become complicated.

    Compare this to the live range splitting implemented in the register allocator. Splitting a live range can be fairly complicated, but it is implemented as a transaction that transforms valid IR to valid IR. We no longer have the side maps with deferred spill code and so on; there is no shared state between live range splits other than the IR. This makes the problem manageable.

  • DAGCombine is supposed to clean up legalization products, but today it seems to be accumulating some transformations that probably belong in InstCombine. Since it is running so late in the pass pipeline, its canonicalizing approach is causing problems for targets that are trying to arrange the IR optimally. Optimal and canonical are not always the same thing.

    Can we get away with a gentler instruction simplifier that is more focused at cleaning up after legalization?

  • Does it still work if we constrain the instruction simplifier to creating legal operations?

Register bank selection

Many instruction set architectures have multiple register banks. X86 has three: Integer, vector, and x87 floating point registers. (Four if you count MMX registers as a separate bank.) Blackfin and m68k have separate pointer and data register banks, etc. It is also common to have the same operations available in multiple register banks. For example, most ISAs with a vector register bank support bitwise and/or/xor operations in both the integer and vector register banks.

SelectionDAG is using the type system to select register banks; each legal type is mapped to a single register class. This mapping is causing a lot of strange modeling problems in targets, and it often leads to spurious cross-class copies with values bouncing back and forth between register banks. See for example the strange occurrences of v1i64 in the ARM target. It is also very difficult to make efficient use of operations that are available in multiple register banks.

The global instruction selector will assign virtual registers to register banks explicitly, not by using the type system. The set of register banks is typically small (2-3) and defined by the target. Modelling register banks explicitly makes it possible to move operations between register banks to minimize cross-bank copies which are often expensive. SPARC even requires cross-bank copies to go through memory, as does x86 in some cases.

The register bank selection pass computes the optimal bank assignment and inserts copy instructions when a value needs to cross banks. Sometimes, it may be beneficial to have the same value available in two register banks simultaneously, this can also be represented with cross-bank copy instructions. The bank selection can also be affected by register pressure concerns. On x86-64, for example, many i32 values could be moved to the SSE registers, freeing up the integer registers.

The interaction between legalization and bank selection is not clear yet. For example, i64 would be a legal type on ARM with and/or/xor/add/sub operations supported in a D-register. Some i64 operations are not supported in D-registers, and it would be necessary to expand those i64 operations to i32 operations in the GPR bank. When that happens, it is most likely preferable to ‘legalize’ more connected i64 operations so they can be moved to the GPR bank as well.

Instruction selection

The instruction selection pass replaces most target-independent instructions with real target opcodes, and it ensures that all virtual registers have register classes instead of EVTs. Some target-independent instructions, like COPY, are not lowered until after register allocation.

SelectionDAG instruction selection is controlled only by expression types, and the selected instructions are expected to use the unique register banks selected by the type:

(operation, type) --> opcode

We’re representing register banks explicitly, and many operations are available in multiple banks, so the register bank needs to be part of the instruction selection:

(operation, type, bank) --> opcode

On the other hand, when types are not used to select register banks, it becomes really difficult to explain the difference between load i32 and load f32. The hardware doesn’t care either, it simply knows how to load 32 bits into a given register. We can use a three-level hierarchical type system to better describe this:

  1. Bit width. Operations like load, store, select, and the bitwise and/or/xor only depend on the number of bits in the operands. Their effect is independent of the vector lane structure and whether lanes are ints or floats.

  2. Vector lanes + lane width. Operations like shufflevector and insertelement depend of the vector topology of the operand types, but they don’t care if vector lanes are floats or ints.

  3. Full EVTs. Finally, arithmetic instructions actually depend on the full type of their operands. It is worth noting, though, that the int/float distinction is already redundantly encoded in the operations. LLVM IR has separate add and fadd instructions, for example.

The instruction selection will only use the relevant parts of the operand type, depending on the operation being matched. It will consider load i32, load f32, and load v2i16 to be simply 32-bit wide loads. The target is likely to have multiple 32-bit load instructions. The explicit register bank is used to pick the right one.

Apart from the way operations and types are matched, the instruction selection algorithm is a lot like the current SelectionDAG algorithm. The process is less transactional than it is in SelectionDAG. Specifically, targets are allowed to insert pre-selected instructions at any time.

The instruction selection algorithm is global, which means it can look upwards in the dominator tree when matching patterns. A cost model is required to determine when it is a good idea to fold instructions outside the current loop. The same applies to matching instructions with multiple uses.

Ramifications

This proposal goes a lot further than simply implementing the same instruction selector on a new intermediate representation. It’s worthwhile to point out some of the effects of the proposed design.

More legal types

The new definition of a legal type is very permissive, and there are going to be many more legal types in most targets. It is also worth noting that the legality of a type is a function of the type’s bit size only. In other words, if f64 is a legal type, so is i64, v2f32, and even v64i1. On the ARM target, for example, these types would be legal:

  • All 8-bit types via ldrb/strb to GPR. (i8, v1i8, v2i4, v4i2, v8i1)
  • All 16-bit types via ldrh/strh to GPR. (i16, f16, v1i16, v2i8, …)
  • All 32-bit types via ldr/str to GPR and vldr/vstr to SPR.
  • All 64-bit types via ldrd/strd to GPRPair and vldr/vstr to DPR.
  • All 128-bit types via vld1/vst1 to DPair.
  • All 192-bit types via vld1/vst1 to DTriple.
  • All 256-bit types via vld1/vst1 to DQuad.

This larger set of legal types also makes it easier to handle things like extractelement <8 x i8> which currently produces an illegal type and is thus obfuscated by the type legalizer.

An i8 live range could exist in an ARM function as long as the value is only loaded and stored. We could even use a register class with a single-byte spill size. Different spill sizes for the same registers is already supported. That is how the x86 target holds 32-bit floats in the 128-bit xmm registers. If the i8 value has any illegal arithmetic instructions, the legalizer will expand it to i32, and the expansion should probably be propagated to other basic blocks and threaded through phis.

Most of CodeGenPrepare can go away

The instruction selector will be able to match patterns across basic blocks by looking up the dominator tree. This means that CodeGenPrepare no longer needs to sink GEPs and compares down to their uses. CodeGenPrepare has a lot of addressing mode matching code which can simply be deleted.

Other transformations that currently live in CodeGenPrepare could probably be moved to the MI representation, although it isn’t quite clear if we would want to do that.

Better modeling of ISA concepts

Try compiling this function for SPARC64:

float f(int *p) {
  return *p;
}

It’s a simple load and float-to-int conversion which produces this IR:

define float @f(i32* nocapture %p) {
entry:
  %l = load i32* %p, align 4
  %conv = sitofp i32 %l to float
  ret float %conv
}

SelectionDAG generates this SPARC64 assembly:

ld [%i0], %l0
st %l0, [%fp+2043]
ld [%fp+2043], %f0
fitos %f0, %f1

The integer value %l is first loaded into an integer register (%l0) and then bitcasted to a float so it can be used by the fitos instruction which reads an integer in%f0 and writes its float result to %f1. The SPARC target is essentially creating a DAG representing this IR:

define float @f(i32* nocapture %p) {
entry:
  %l = load i32* %p, align 4
  %l2 = bitcast i32 %l to float
  %conv = sitofp float %l2 to float
  ret float %conv
}

The bitcast instruction really represents a cross-bank register copy, and the sitofp instruction no longer typechecks; it really does read an integer from a floating point register, but SelectionDAG has no way of representing that. The type labels are no longer representing types, they are purely used to select register banks. The %l2 is not a float, and it isn’t being interpreted as a float by the fitos instruction. It’s an integer in the floating point register bank.

By modeling register banks better, we don’t have to lie to the instruction selector or depend on its type checking being defunct, and we get better results for free. Initially, the MI builder would produce code like this, with default register banks labels derived from value types:

define float @f(i32* nocapture %p) {
entry:
  %l:IntBank = load i32* %p, align 4
  %conv:FloatBank = sitofp i32 %l to float
  ret float %conv
}

The legalizer recognizes that SPARC only supports sitofp in the FloatBank register banks, so it inserts a cross-bank copy:

define float @f(i32* nocapture %p) {
entry:
  %l:IntBank = load i32* %p, align 4
  %l2:FloatBank = COPY %l
  %conv:FloatBank = sitofp i32 %l2 to float
  ret float %conv
}

Finally, the bank selection pass can eliminate the cross-bank copy by moving the %l value to the floating point register bank:

define float @f(i32* nocapture %p) {
entry:
  %l:FloatBank = load i32* %p, align 4
  %conv:FloatBank = sitofp i32 %l to float
  ret float %conv
}

Now we have an i32 load into the floating point register bank which is fine, the instruction selector simply has a pattern that matches any 32-bit load targeting the floating point bank, and we get this code:

ld [%i0], %f0
fitos %f0, %f1

More tuned targets like ARM already has peephole patterns that handle code like this, but it shouldn’t be necessary to manually fix all these cases. The ISA fully supports integers in the ‘floating point’ register banks, so the instruction selector should be able to model that as a first class concept.

Fragile legal types

The new model allows legal types with very few legal operations, and this creates extra challenges for the legalizer. Some legal types are ‘fragile’ in the sense that it can be beneficial to split even legal operations to avoid too many cross bank copies. Consider this ARM program:

void dot(uint64_t *a, uint64_t *p, uint64_t *q, unsigned n) {
  uint64_t s = 0;
  while (n--)
    s += *p++ * *q++;
  *a = s;
}

which produces this IR after the MI builder has expanded GEPs: (I am borrowing LLVM IR syntax to describe the MI IR.)

define void @dot(i32 %a, i32 %p, i32 %q, i32 %n) {
entry:
  %sum0:i64,VecBank = const i64 0
  %cmp1 = icmp eq i32 %n, 0
  br i1 %cmp1, label %exit, label %loop

loop:
  %iv:i32,IntBank = phi i32 [ %iv2, %loop ], [ %n, %entry ]
  %p1:i32,IntBank = phi i32 [ %p2, %loop ], [ %p, %entry ]
  %q1:i32,IntBank = phi i32 [ %q2, %loop ], [ %q, %entry ]
  %sum1:i64,VecBank = phi i64 [ %sum2, %loop ], [ %sum0, %entry ]
  %iv2:i32,IntBank = add i32 %iv, -1
  %p2:i32,IntBank = add i32 %p1, i32 8
  %q2:i32,IntBank = add i32 %q1, i32 8
  %lp:i64,VecBank = load %p1, align 4
  %lq:i64,VecBank = load %q1, align 4
  %prod:i64,VecBank = mul i64 %lp, %lq
  %sum2:i64,VecBank = add i64 %prod, %sum1
  %cmp2 = icmp eq i32 %iv, 0
  br i1 %cmp2, label %exit, label %entry

exit:
  %s:i64,VecBank = phi i64 [ %sum0, %entry ], [ %sum2, %loop ]
  store i64 %s, i32 %a, align 4
  ret void
}

These instructions are all legal on ARM, except for the mul i64 instruction which the legalizer will split and move to the integer register bank:

define void @dot(i32 %a, i32 %p, i32 %q, i32 %n) {
entry:
  %sum0:i64,VecBank = const i64 0
  %cmp1 = icmp eq i32 %n, 0
  br i1 %cmp1, label %exit, label %loop

loop:
  %iv:i32,IntBank = phi i32 [ %iv2, %loop ], [ %n, %entry ]
  %p1:i32,IntBank = phi i32 [ %p2, %loop ], [ %p, %entry ]
  %q1:i32,IntBank = phi i32 [ %q2, %loop ], [ %q, %entry ]
  %sum1:i64,VecBank = phi i64 [ %sum2, %loop ], [ %sum0, %entry ]
  %iv2:i32,IntBank = add i32 %iv, -1
  %p2:i32,IntBank = add i32 %p1, i32 8
  %q2:i32,IntBank = add i32 %q1, i32 8
  %lp:i64,VecBank = load %p1, align 4
  %lq:i64,VecBank = load %q1, align 4

  %lpl:i32,IntBank = extract_element i64 %lp, 0
  %lph:i32,IntBank = extract_element i64 %lp, 1
  %lql:i32,IntBank = extract_element i64 %lq, 0
  %lqh:i32,IntBank = extract_element i64 %lq, 1

  %prodl:i32,IntBank, %prodh0:i32,IntBank = umul_lohi i32 %lpl, %lql
  %prodh1:i32,IntBank = mul i32 %lpl, %lqh
  %prodh2:i32,IntBank = mul i32 %lph, %lql
  %prodh3:i32,IntBank = add i32 %prodh0, %prodh1
  %prodh:i32,IntBank = add i32 %prodh2, %prodh3

  %prod:i64,VecBank = build_pair %prodl, %prodh

  %sum2:i64,VecBank = add i64 %prod, %sum1
  %cmp2 = icmp eq i32 %iv, 0
  br i1 %cmp2, label %exit, label %entry

exit:
  %s:i64,VecBank = phi i64 [ %sum0, %entry ], [ %sum2, %loop ]
  store i64 %s, i32 %a, align 4
  ret void
}

All operations are now legal in their respective register banks, and an -O0 build would probably stop here. However, this code is quite expensive to execute because of the many cross-bank copies, and the bank selector pass will clean that up. First, the two 64-bit loads are moved to the integer register bank where their values are needed. Second, the legal add i64 operation is split so it can also be moved to the integer register bank, and the split is threaded through the%sum1 phi to eliminate all cross-bank copies:

define void @dot(i32 %a, i32 %p, i32 %q, i32 %n) {
entry:
  %sum0l:i32,IntBank = const i32 0
  %sum0h:i32,IntBank = const i32 0
  %cmp1 = icmp eq i32 %n, 0
  br i1 %cmp1, label %exit, label %loop

loop:
  %iv:i32,IntBank = phi i32 [ %iv2, %loop ], [ %n, %entry ]
  %p1:i32,IntBank = phi i32 [ %p2, %loop ], [ %p, %entry ]
  %q1:i32,IntBank = phi i32 [ %q2, %loop ], [ %q, %entry ]
  %sum1l:i32,IntBank = phi i32 [ %sum2l, %loop ], [ %sum0l, %entry ]
  %sum1h:i32,IntBank = phi i32 [ %sum2h, %loop ], [ %sum0h, %entry ]
  %iv2:i32,IntBank = add i32 %iv, -1
  %p2:i32,IntBank = add i32 %p1, i32 8
  %q2:i32,IntBank = add i32 %q1, i32 8
  %lp:i64,IntBank = load %p1, align 4
  %lq:i64,IntBank = load %q1, align 4

  %lpl:i32,IntBank = extract_element i64 %lp, 0
  %lph:i32,IntBank = extract_element i64 %lp, 1
  %lql:i32,IntBank = extract_element i64 %lq, 0
  %lqh:i32,IntBank = extract_element i64 %lq, 1

  %prodl:i32,IntBank, %prodh0:i32,IntBank = umul_lohi i32 %lpl, %lql
  %prodh1:i32,IntBank = mul i32 %lpl, %lqh
  %prodh2:i32,IntBank = mul i32 %lph, %lql
  %prodh3:i32,IntBank = add i32 %prodh0, %prodh1
  %prodh:i32,IntBank = add i32 %prodh2, %prodh3

  %sum2l:i32,IntBank, carry = addc i32 %prodl, %sum1l
  %sum2h:i32,IntBank = adde %prodh, %sum1h

  %cmp2 = icmp eq i32 %iv, 0
  br i1 %cmp2, label %exit, label %entry

exit:
  %sl:i32,IntBank = phi i32 [ %sum0l, %entry ], [ %sum2l, %loop ]
  %sh:i32,IntBank = phi i32 [ %sum0h, %entry ], [ %sum2h, %loop ]
  %s:i64,IntBank = build_pair %sl, %sh
  store i64 %s, i32 %a, align 4
  ret void
}

The bank selection pass will need to weigh the cost of an extra adde instruction against the cost of a cross-bank copy. Sometimes, like in this example, it is preferable to split legal operations. On the other hand, if it weren’t for the multiplication, the entire function would run in the vector register bank. A single instruction can cause entire expression webs to migrate.

[snip]

DAGCombine is supposed to clean up legalization products, but today
it seems to be accumulating some transformations that probably
belong in InstCombine. Since it is running so late in the pass
pipeline, its canonicalizing approach is causing problems for
targets that are trying to arrange the IR optimally. Optimal and
canonical are not always the same thing.

Can we get away with a gentler instruction simplifier that is more
focused at cleaning up after legalization?

From what I've observed, there tends to be a lot to cleanup after legalization, especially because of the complexity of expanding GEP and all kinds of vector operations. Can you provide some examples of simplifications in DAGCombine that you think that we won't need with this new infrastructure?

What might be better is to put some abstract interface between instcombine and the IR, so that instcombine can be run on these pseudo-MIs as well as on IR.

Thanks again,
Hal

    *

Does it still work if we constrain the instruction simplifier to
creating legal operations?

[snip]

I am hoping that this proposal will generate a lot of feedback, and there are many different topics to discuss. When replying to this email, please change the subject header to something more specific, but keep the [global-isel] tag.

Subject changed, but I'm not sure if helps...

Overall, I really like this design (and would have had a much easier 18 months if we'd had it two years ago). Comments inline.

Thanks,
/jakob

Proposal for a Global Instruction Selector

It is becoming evident that we need a replacement for the SelectionDAG-based instruction selector. This proposal describes the architecture of a global instruction selector based on the MachineInstr intermediate representation.

A new instruction selector is a large project that will probably take a couple of years to complete. This proposal only describes the desired final design, it does not describe the path of incremental development we should follow to get there.

This is a strawman design which shouldn't be seen as a final proposal. I am hoping for a lot of community feedback, and I don't expect we'll have a final design before we actually begin implementing it.

Why not legalize LLVM IR?

It's been proposed to legalize LLVM IR before instruction selection, see http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-April/061567.html. I recommend reading the mailing list thread debating the proposal, but let me summarize the reasons for legalizing MI instead of LLVM IR:

  • ABI boundaries have a too high-level representation in LLVM IR. By ABI boundaries, I mean function call and return, landing pads, and inline assembly.

LLVM IR doesn't make it clear which function arguments go in registers and which go on the stack. Stack frame and call frame maintenance instructions are not represented in LLVM IR. Byval arguments are a big mystery, and some arguments are actually passed in multiple registers, depending on their type.

While LLVM IR could be extended to represent all these things, it would pollute the clean design of LLVM IR, and it is not the right level of abstraction for this IR.

There is a lot of black magic required for ABI support. I can read the SysV ABI supplement for a given architecture and understand how, from my source language, types should be represented and where function arguments should go. There is then almost no documentation on how this maps to the IR, so the only way when writing a new front end for an existing back end is to see what clang generates. The 'clean' design of the IR here is not very clean, as the IR contains something that is neither what the back end nor the front end wants and has a non-obvious mapping to either.

  • The getelementpointer instruction is too abstract. We want pointer arithmetic to be exposed as normal integer arithmetic in a type system without pointers. This would require a lot of awkward pointer-to-integer conversions in LLVM IR.

For our back end, we absolutely do not want a type system without pointers - not only do we have separate registers for pointers, we also have separate instructions for manipulating them and representing this in SelectionDAG requires some very ugly hacks. This is also the case for several DSPs, which have separate address and data registers, and for upcoming Intel chips with MPX that have 4 hardware bounds registers that need keeping in sync with pointers. Something like GEP is actually a very good model for these, because it allows the same bounds register to be used for a set of pointer operations, however it would be fine to have a single pointer type that was independent of its pointee type (or small set of different-sized ones, which will be useful for several GPUs) and a lower-level GEP that was just a byte offset.

For the architectures that don't distinguish between pointers and integers, there should be some option of lowering to integer types, however the information that a particular register contains an

Motivation

Everybody loves to hate SelectionDAG, but it is still useful to make its shortcomings explicit. These are some of the goals for a new instruction selector architecture:

I would add:

It is very difficult to extend SelectionDAG / TableGen with new value types that are not very similar to existing ones. For example, adding a new kind of vector is easy, but adding a fat pointer representation is not really possible.

Overall design

The global instruction selector is implemented as a series of machine function passes. Targets can inject their own passes or replace standard passes, using the existing mechanism for configuring code generation pipelines.

The MI intermediate representation is extended with a set of abstract target-independent opcodes very similar to the ISD opcodes used by SelectionDAG. Virtual registers can be typed so they have an EVT instead of a register class.

The standard passes are:

  • MI builder.
  • Switch lowering.
  • Iterative legalization and cleanup.
  • Register bank selection.
  • Common subexpression elimination.
  • Instruction selection.
I'll describe these passes in more detail below.

MI builder

The MI builder pass translates LLVM IR to MI IR, much like the current SelectionDAGBuilder phase. Like SelectionDAGBuilder, this pass will:

  • Expand LLVM IR first-class aggregate types into their constituent parts. This also includes splitting load, store, phi, and select instructions.
  • Replace pointer types with target-specific integers.
  • Expand getelementptr instructions with integer adds and multiplies.
  • Map LLVM instructions to target-independent abstract MI opcodes.
  • Lower ABI boundaries with help from target hooks.
Unlike SelectionDAGBuilder, this pass will:

  • Create a 1-1 mapping of the LLVM IR CFG. No new basic blocks are created.
  • Preserve switch instructions.
  • Preserve i1 logic instructions.
  • Not use MERGE_VALUES instructions. We'll use a value map that can handle aggregates.
The aggregate type expansion creates value types that can all be represented by EVTs, and MachineRegisterInfo will be extended to allow virtual registers to have an EVT instead of a register class. EVTs are all the integer, floating point, and vector types from LLVM IR.

That all sounds very nice and clean. I'd also like to represent pointers (including their address space, but not their pointee type) in this level and then have an optional pass (or a parameter for this one) that lowers pointer operations to integers.

The ABI boundary lowering requires types to be broken down further into 'legal types' that can be mapped to registers. The secondary breakdown is currently handled by TargetLowering::LowerCallTo() calling getRegisterType() and getNumRegisters(). Most ABIs are defined in terms of C types, not LLVM IR types, so there is a close connection between the C frontend and the ABI lowering code in the instruction selector. It would be a good idea to have the ABI lowering code work independently of the type system used during instruction selection. This is made possible by having ABI lowering be part of the LLVM IR to MI translation process.

This seems cleaner. I wonder if there is a way of making the C ABI more explicit to front-end authors at this point.

The global instruction selector will assign virtual registers to register banks explicitly, not by using the type system. The set of register banks is typically small (2-3) and defined by the target. Modelling register banks explicitly makes it possible to move operations between register banks to minimize cross-bank copies which are often expensive. SPARC even requires cross-bank copies to go through memory, as does x86 in some cases.

And on some ARM chips (e.g. Cortex A8), the via-memory path is faster than the register-register path in some cases.

The interaction between legalization and bank selection is not clear yet. For example, i64 would be a legal type on ARM with and/or/xor/add/sub operations supported in a D-register. Some i64 operations are not supported in D-registers, and it would be necessary to expand those i64 operations to i32 operations in the GPR bank. When that happens, it is most likely preferable to 'legalize' more connected i64 operations so they can be moved to the GPR bank as well.

How would you model the microMIPS/Thumb-1/RISC-V 16-bit (and, in some cases, x86) idea that a denser encoding can be used if you restrict register accesses to the bottom half of the register set? The registers are still the same bank, and the set of operations is the same, but in some cases much more expensive (having to switch in and out of microMIPS mode, for example, is very expensive). This problem hits multiple layers in the lowering process. I don't have a good solution for it (nor a strong requirement for it, but if there's a design that will make it easier to solve in the future then that would be nice, as we and the people at Berkeley are going to be doing a reasonable amount of work with variable-length encodings over the coming years).

Instruction selection

The instruction selection pass replaces most target-independent instructions with real target opcodes, and it ensures that all virtual registers have register classes instead of EVTs. Some target-independent instructions, like COPY, are not lowered until after register allocation.

What is a target-independent instruction? Does a register-register add count? It would be nice if target-specific machine instructions could have enough metadata that some MI passes could become target independent. For example, I'm not sure how many people have implemented a delay slot filler so far (I know of 4, but there could be more) and it would be great if we could have one generic one and enough metadata added to the instruction definitions that it could detect branches with nops in delay slots and determine which instructions were safe to move there.

  • Full EVTs. Finally, arithmetic instructions actually depend on the full type of their operands. It is worth noting, though, that the int/float distinction is already redundantly encoded in the operations. LLVM IR has separate add and fadd instructions, for example.

And a separate GEP for pointer arithmetic...

David

I thoroughly agree that a more configurable instruction selector is desirable and that DAGCombine's canonicalisation results in less than optimal instruction selection for some targets, such as the microcontrollers I'm working on.

I'd be happy to contribute effort to this task if and when it comes to pass.

Steve

This design does indeed appear generally cleaner than the current state of the world. What follows is a bit long, so here's the tl;dr version:
You *really* want to support address types in your type system, even if you expand GEP-like operations into open arithmetic. This lets you produce good code for a wide variety of odd-ball architectures with relative ease. Failing to support address types makes it really difficult to support those odd-ball architectures.

  • The getelementpointer instruction is too abstract. We want pointer arithmetic to be exposed as normal integer arithmetic in a type system without pointers. This would require a lot of awkward pointer-to-integer conversions in LLVM IR.

For our back end, we absolutely do not want a type system without pointers - not only do we have separate registers for pointers, we also have separate instructions for manipulating them and representing this in SelectionDAG requires some very ugly hacks. This is also the case for several DSPs, which have separate address and data registers, and for upcoming Intel chips with MPX that have 4 hardware bounds registers that need keeping in sync with pointers. Something like GEP is actually a very good model for these, because it allows the same bounds register to be used for a set of pointer operations, however it would be fine to have a single pointer type that was independent of its pointee type (or small set of different-sized ones, which will be useful for several GPUs) and a lower-level GEP that was just a byte offset.

For the architectures that don't distinguish between pointers and integers, there should be some option of lowering to integer types, however the information that a particular register contains an

At Tartan Labs, we found it very useful to represent three different kinds of information: the type of each value, the result context of each operation, and whether operations required checks (or not).

The type system should absolutely know about addresses! At Tartan, the backend (Optimizer & code-gen) concept of "type" included more kinds of types than those described by the OP. Specifically, when compiling for various DSPs and other unusual architectures, we found it was crucial to know not only which values were "addresses", but also what specific kind of address they were. Specific kinds of addresses were used to support machines with multiple address spaces, fat pointers of various kinds (e.g., [pointer-to-array, pointer-to-bounds-info], [pointer, capability], or even [pointer, capability, bounds-info] tuples) and so on. Consider, for example a machine with a split Harvard architecture, where there are separate buses for the Code, Data-A and Data-B address spaces. Given a "pointer" value, you must emit different instructions to load from one of those spaces; the same bit-pattern in the value references completely different memory. On other architectures, you also had to use different registers to access the different address spaces. And you only get good performance by keeping all the code and data buses busy simultaneously. The machine with fat-pointers-with-capabilities required the use of both special instructions and special registers for operations on those pointer-with-capa values. Using ordinary (e.g., non-capa) instructions or registers cleared the capability bits and so wiped out access to the protected resources. Generating good code for architectures like these requires richer knowledge of the types of the values the compiler is working with.

Result context may not matter for a 3-address IR such as LLVM uses.
The simplest result context system we used had three values: address, value, and flow. This mattered because we had a tree-based IR, in which expression temporaries were implicit. So we needed context information to guide allocation of expression temporaries. Given a 3-address IR, I suspect that at least address and value context are redundant with the type information for the destination.

That leaves only "flow" as an interesting concept. It was useful to know when the result of an operation would be used for control flow (rather than being stored in some other variable). This helped us make better use of condition codes (for machines that have them), or even of multiple condition code registers. It also helped guide flow-targetted computations and temporaries to register/instruction combinations that also set condition codes (when relevant). The usefulness of flow context across a very broad mix of architectures may suggest that one or more "flow"-like types could be useful.

Finally, checked operations.
If you ever expect to support one or more languages that do overflow checking on arithmetic (or pointer deref, or array bounds, or address-sanitization, or ...), you may find it useful to know whether particular operations (a) require checks, (b) require lack-of-checks, or (c) don't care about checks -- all of these from the viewpoint of language semantics. In addition, when checks are required, it's really useful to distinguish the cases where the result of the check is statically known (e.g., guaranteed to succeed or to fail) -- we discovered this latter property in the optimizer, as part of removing unnecessary checks. The obvious alternative here is to require front-ends that need checks to generate the obvious IR to perform said check. However, it's rather more difficult to generate efficient code this way. For C & C++, I believe that unsigned requires lack-of-checks (because it's defined to wrap), but signed integers don't care for most implementations (wrapping or not is undefined).

Dean Sutherland
(a Tartan Labs alumnus)

There is no doubt that we need a lot of cleanup after legalization.

The problem is with the approach taken by DAGCombine which is a canonicalizer more than it is an optimizer. Specifically, it will reassociate expressions without achieving anything by the transformation.

LSR is sometimes trying to make code look right for post-increment addressing mode patterns, but DAGCombine messes things up before the instruction selector gets to see them.

Thanks
/jakob

We already support Thumb-2 quite well with the CostPerUse field in register descriptors. The register allocator tries to use the cheap half of the register bank more often to allow the compact encoding to be used as much as possible. This also applies to the x86-64 REX prefix.

Having to switch CPU modes mid-function sounds more like a research project. I suppose you could model MIPS/microMIPS as two separate register banks, but you would at least need a custom bank selector pass.

Thanks,
/jakob

I am a bit skeptical about the need for pointer types; I don’t think it is the right level of abstraction for an instruction selector. The processors I know about with separate address and data registers (Blackfin, m86k, TriCore) would be modeled perfectly by the register bank labels in the proposal.

Won’t that work for your case?

Thanks,
/jakob

I like the idea of sharing code between IR and MI passes through an abstract interface. I think that later stages in the IR pipeline also need an instruction optimizer instead of a canonicalizer.

An alternative approach would be to describe these transformations in a DSL instead of C++.

Thanks,
/jakob

It might work, although given that our fat pointers are not integers I'm not sure how you think we should represent them. We can represent them as i256, but that would be very misleading as they don't have the majority of integer operations defined on them, and we can't materialise a pointer from an integer, only from an existing pointer and an offset. We ideally want address operations to always be of the form base + offset, where the offset is in an integer register and moves, but the base is not. Currently, we are generating a lot of increment-base + load sequences and have a pre-RA pass that tries to turn these back into load base+offset instructions.

Modelling the more complex addressing mode (something like a GEP) directly, and then expanding the ones that can't be expressed with the current target seems cleaner.

The related issue of the pattern matching in TableGen not being able to cope with multiple valid pointer types is also something I've not seen addressed in your proposal.

I'm also not sure how MPX would be represented in this model, where every pointer has a base and a length in one register set and a cursor in another.

David

Hi Jakob,

Thanks for creating this interesting proposal.
Let me just comment on this part:

What might be better is to put some abstract interface between instcombine and the IR, so that instcombine can be run on these pseudo-MIs as well as on IR.

I like the idea of sharing code between IR and MI passes through an abstract interface. I think that later stages in the IR pipeline also need an instruction optimizer instead of a canonicalizer.

An alternative approach would be to describe these transformations in a DSL instead of C++.

*Legalization*
- What does 'more legal' mean? Can we restrict the possible legalization transformations so the iterative process is guaranteed to make progress?

I've been thinking for a while that we could express the instcombine transformations in a DSL, and then generate the C++ code from there. The advantage is that if we formalize the LLVM IR, then we can check the correctness of all the instcombine transformations for "free".
Moreover, instcombine has also suffered from infinite loops in the past (because canonicalization did not make progress for some inputs), which is also your concern with legalization of MI. We have algorithms to prove termination of rewriting systems, so I believe we could also prove progress for both instcombine and MI legalization.

I'm mixing MI legalization and instcombine, since I think that the correctness and progress checking technology that we would need is the exactly the same.
I wouldn't mind working on this project. But the time-frame on my side is academic, which may mean 1 or 2 years to be completed.

Nuno

This sounds promising. But we have some requirements that textbook rewriting systems can't handle:

- Expressions are DAGs, not trees.
- Targets can add custom rewriting rules and override standard rules.
- Rules will have predicates. Some predicates are static and only depend on the subtarget, some predicates will depend on the pattern being matched.
- The system won't be convergent if you ignore all the predicates.

So the question is, given those requirements, can we still prove termination?

Thanks,
/jakob

Yeah, that's probably not a good idea to model as i256.

I would like an extensible system that can do more than just adding pointers to the type system. I think it should be possible for targets to pre-select instructions in custom passes. If you know the instruction you want, just insert it directly. That shouldn't be hard to handle for the target-independent code.

I also think it would make sense to allow virtual registers that have a register class instead of an (EVT, Bank) tuple before instruction selection. That would probably make sense for something like PowerPC condition registers, there is no reason to invent fake types for that.

If Nuno can pull off a rewriting system with a target-extensible rule set, it probably wouldn't be too hard to allow custom opaque types either.

Thanks,
/jakob

Sounds good to me.

David

Just a comment:
If you believe you can represent all instcombine transformations in
DSL, you then transform them to C++.
Why not work out a method for checking correctness in the C++?
Then, you would not have to convert everything to DSL, and could run
the checks on the current C++.
The code in C++ should be able to be written in a static analysis
friendly way so that we could use static analysis to prove no infinite
loops.

James

I like the idea of sharing code between IR and MI passes through an abstract interface. I think that later stages in the IR pipeline also need an instruction optimizer instead of a canonicalizer.

An alternative approach would be to describe these transformations in a DSL instead of C++.

*Legalization*
- What does 'more legal' mean? Can we restrict the possible legalization transformations so the iterative process is guaranteed to make progress?

I've been thinking for a while that we could express the instcombine transformations in a DSL, and then generate the C++ code from there. The advantage is that if we formalize the LLVM IR, then we can check the correctness of all the instcombine transformations for "free".
Moreover, instcombine has also suffered from infinite loops in the past (because canonicalization did not make progress for some inputs), which is also your concern with legalization of MI. We have algorithms to prove termination of rewriting systems, so I believe we could also prove progress for both instcombine and MI legalization.

I'm mixing MI legalization and instcombine, since I think that the correctness and progress checking technology that we would need is the exactly the same.
I wouldn't mind working on this project. But the time-frame on my side is academic, which may mean 1 or 2 years to be completed.

This sounds promising. But we have some requirements that textbook rewriting systems can't handle:

- Expressions are DAGs, not trees.
- Targets can add custom rewriting rules and override standard rules.
- Rules will have predicates. Some predicates are static and only depend on the subtarget, some predicates will depend on the pattern being matched.
- The system won't be convergent if you ignore all the predicates.

So the question is, given those requirements, can we still prove termination?

Uhm, I'm not sure about the custom target rules, since most likely a bunch of them will need to be specified in C++, and then they will be a black box to the tool..

Regarding the predicates, how complicated are these? Aren't these expressions mostly in linear integer arithmetic? Or can you have arbitrary calls to complex C++ code?

Nuno

What might be better is to put some abstract interface between
instcombine and the IR, so that instcombine can be run on these pseudo-MIs
as well as on IR.

I like the idea of sharing code between IR and MI passes through an
abstract interface. I think that later stages in the IR pipeline also need
an instruction optimizer instead of a canonicalizer.

An alternative approach would be to describe these transformations in a
DSL instead of C++.

*Legalization*
- What does 'more legal' mean? Can we restrict the possible legalization
transformations so the iterative process is guaranteed to make progress?

I've been thinking for a while that we could express the instcombine
transformations in a DSL, and then generate the C++ code from there. The
advantage is that if we formalize the LLVM IR, then we can check the
correctness of all the instcombine transformations for "free".
Moreover, instcombine has also suffered from infinite loops in the past
(because canonicalization did not make progress for some inputs), which is
also your concern with legalization of MI. We have algorithms to prove
termination of rewriting systems, so I believe we could also prove progress
for both instcombine and MI legalization.

I'm mixing MI legalization and instcombine, since I think that the
correctness and progress checking technology that we would need is the
exactly the same.
I wouldn't mind working on this project. But the time-frame on my side is
academic, which may mean 1 or 2 years to be completed.

Just a comment:
If you believe you can represent all instcombine transformations in
DSL, you then transform them to C++.
Why not work out a method for checking correctness in the C++?
Then, you would not have to convert everything to DSL, and could run
the checks on the current C++.
The code in C++ should be able to be written in a static analysis
friendly way so that we could use static analysis to prove no infinite
loops.

I believe the advantage of writing down the optimizations in a DSL is that the representation is way more succinct. The code in instcombine is a bit boring and repetitive (and huge!).
We could, however, try to automatically convert from C++ to the DSL. Not impossible, AFAICT.
Proving absence of loops is a bit more complicated than traditional static analysis. It often requires things like synthesis of ranking functions (to prove that the transition relation is well-founded), which is not trivial.

Nuni

>>>> What might be better is to put some abstract interface between
>>>> instcombine and the IR, so that instcombine can be run on these
>>>> pseudo-MIs
>>>> as well as on IR.
>>>
>>>
>>> I like the idea of sharing code between IR and MI passes through
>>> an
>>> abstract interface. I think that later stages in the IR pipeline
>>> also
>>> need
>>> an instruction optimizer instead of a canonicalizer.
>>>
>>> An alternative approach would be to describe these
>>> transformations in a
>>> DSL instead of C++.
>>
>>
>>> *Legalization*
>>> - What does 'more legal' mean? Can we restrict the possible
>>> legalization
>>> transformations so the iterative process is guaranteed to make
>>> progress?
>>
>>
>>
>> I've been thinking for a while that we could express the
>> instcombine
>> transformations in a DSL, and then generate the C++ code from
>> there. The
>> advantage is that if we formalize the LLVM IR, then we can check
>> the
>> correctness of all the instcombine transformations for "free".
>> Moreover, instcombine has also suffered from infinite loops in the
>> past
>> (because canonicalization did not make progress for some inputs),
>> which
>> is
>> also your concern with legalization of MI. We have algorithms to
>> prove
>> termination of rewriting systems, so I believe we could also prove
>> progress
>> for both instcombine and MI legalization.
>>
>> I'm mixing MI legalization and instcombine, since I think that the
>> correctness and progress checking technology that we would need is
>> the
>> exactly the same.
>> I wouldn't mind working on this project. But the time-frame on my
>> side
>> is
>> academic, which may mean 1 or 2 years to be completed.
>>
>
> Just a comment:
> If you believe you can represent all instcombine transformations in
> DSL, you then transform them to C++.
> Why not work out a method for checking correctness in the C++?
> Then, you would not have to convert everything to DSL, and could
> run
> the checks on the current C++.
> The code in C++ should be able to be written in a static analysis
> friendly way so that we could use static analysis to prove no
> infinite
> loops.

I believe the advantage of writing down the optimizations in a DSL is
that
the representation is way more succinct. The code in instcombine is
a bit
boring and repetitive (and huge!).
We could, however, try to automatically convert from C++ to the DSL.

And, as per the purpose of the original thread :wink: -- we'd like to apply the same transformation logic (or some significant part of it) both to IR and also to these global-isel pseudo-MI instructions post-legalization.

-Hal

This sounds promising. But we have some requirements that textbook rewriting systems can’t handle:

  • Expressions are DAGs, not trees.
  • Targets can add custom rewriting rules and override standard rules.
  • Rules will have predicates. Some predicates are static and only depend on the subtarget, some predicates will depend on the pattern being matched.
  • The system won’t be convergent if you ignore all the predicates.

So the question is, given those requirements, can we still prove termination?

Uhm, I’m not sure about the custom target rules, since most likely a bunch of them will need to be specified in C++, and then they will be a black box to the tool…

I think that SelectionDAG forces you to drop to C++ too soon, but we’ll always need that possibility.

Regarding the predicates, how complicated are these? Aren’t these expressions mostly in linear integer arithmetic? Or can you have arbitrary calls to complex C++ code?

There are too many static predicates, so it isn’t feasible to enumerate all possible combinations, but we can expose their relationships to the tool as boolean expressions. For example, X86 currently has:

def HasCMov : Predicate<“Subtarget->hasCMov()”>;
def NoCMov : Predicate<“!Subtarget->hasCMov()”>;

It would be simple to define:

def NoCMov : Predicate<(not HasCMov)>;

And the tool would be able to infer that the predicates are disjoint. Similarly:

def : Always<(or HasSSE1, (not HasSSE2))>;

can be used to infer an implication.

Many pattern predicates could be expressed in a DSL, but we should still make it possible to use C++ code.

Thanks,
/jakob

This sounds promising. But we have some requirements that textbook rewriting systems can't handle:

- Expressions are DAGs, not trees.
- Targets can add custom rewriting rules and override standard rules.
- Rules will have predicates. Some predicates are static and only depend on the subtarget, some predicates will depend on the pattern being matched.
- The system won't be convergent if you ignore all the predicates.

So the question is, given those requirements, can we still prove termination?

Uhm, I'm not sure about the custom target rules, since most likely a bunch of them will need to be specified in C++, and then they will be a black box to the tool..

I think that SelectionDAG forces you to drop to C++ too soon, but we'll always need that possibility.

Regarding the predicates, how complicated are these? Aren't these expressions mostly in linear integer arithmetic? Or can you have arbitrary calls to complex C++ code?

There are too many static predicates, so it isn't feasible to enumerate all possible combinations, but we can expose their relationships to the tool as boolean expressions. For example, X86 currently has:

def HasCMov : Predicate<"Subtarget->hasCMov()">;
def NoCMov : Predicate<"!Subtarget->hasCMov()">;

It would be simple to define:

def NoCMov : Predicate<(not HasCMov)>;

And the tool would be able to infer that the predicates are disjoint. Similarly:

def : Always<(or HasSSE1, (not HasSSE2))>;

can be used to infer an implication.

Many pattern predicates could be expressed in a DSL, but we should still make it possible to use C++ code.

Ok, thanks a lot for the examples! After vacations I'll discuss this topic with a few colleagues here.

Nuno