predicated execution

hi,

we're trying to come up with a static compiler based on llvm for a 4-way vliw architecture with full support for predicated execution. a distinguishing feature is the omission of flag registers. instead, conditions can be paired with a particular instruction within the same bundle.

a performance critical issue will be proper use of predicated execution. if-conversion can either be performed early in the code generation process [1] (exposing larger basic blocks to the optimizers) or deferred until code generation is almost
complete [2].

for the time being, i'm planning to go with the second approach and have a late optimization pass over the selected machine instructions that
a] preliminarily schedules and bundles the selected instructions
b] speculatively executes instructions in the predecessor block if there are unused resources
c] converts blocks B into a appropriately predicated version (eliminating branches) if it's profitable for the particular architecture.

this approach will require only some minor extensions to the existing class hierarchy to represent instruction bundles and predicates. additionally, it appears to be more reasonable to make the decision what to execute conditionally late in the compilation process when we can easily account for architectural constraints than perform if-conversion early and undoing it if necessary. also, previous work [2] indicates, that this approach is able to retain almost all opportunities for if-conversion present in the input program.

for what i've seen so far (i'm not yet fully familiar with llvm), there's almost no existing support for predicated execution. however, architectures such as arm and ia64 should show a significant speedup. are there already people working on those enhancements or are there any short-term plans? also, does anybody have strong objections against the approach outlined above? any comments are kindly welcome!

cheers and thanks in advance,

hi,

we're trying to come up with a static compiler based on llvm for a 4-
way vliw architecture with full support for predicated execution. a
distinguishing feature is the omission of flag registers. instead,
conditions can be paired with a particular instruction within the
same bundle.

Ok.

a performance critical issue will be proper use of predicated
execution. if-conversion can either be performed early in the code
generation process [1] (exposing larger basic blocks to the
optimizers) or deferred until code generation is almost
complete [2].

Option 1 makes a lot of sense, most predication aware compilers go this route. I haven't read the paper. But I am guessing their system is rediscovering CFG etc. at link time, reverse compiling assembly code into some internal representation? That is an interesting approach for a company who is servicing clients who do not provide their source code.

If you use LLVM and you build option 1, you can do something like option 2 without a lot of horribleness. :slight_smile: LLVM does link time optimization at bytecode level.

for the time being, i'm planning to go with the second approach and
have a late optimization pass over the selected machine instructions
that
a] preliminarily schedules and bundles the selected instructions
b] speculatively executes instructions in the predecessor block if
there are unused resources
c] converts blocks B into a appropriately predicated version
(eliminating branches) if it's profitable for the particular
architecture.

Ok. It doesn't sound like option 2 though (it has nothing to do with LTO). Item (b) sounds like meld scheduling, if that what you are considering?

this approach will require only some minor extensions to the existing
class hierarchy to represent instruction bundles and predicates.
additionally, it appears to be more reasonable to make the decision
what to execute conditionally late in the compilation process when we
can easily account for architectural constraints than perform if-
conversion early and undoing it if necessary. also, previous work [2]
indicates, that this approach is able to retain almost all
opportunities for if-conversion present in the input program.

Sounds reasonable especially given the current LLVM infrastructure. The current register allocator is not predication aware. Therefor if-conversion too aggressively would over-subscribe the available resources. Once you have the exact instruction count and register requirement, you should be able to make a better if-conversion decision.

for what i've seen so far (i'm not yet fully familiar with llvm),
there's almost no existing support for predicated execution. however,
architectures such as arm and ia64 should show a significant speedup.
are there already people working on those enhancements or are there
any short-term plans? also, does anybody have strong objections
against the approach outlined above? any comments are kindly welcome!

I don't know enough about your target architecture to critique your choice. This isn't the "traditional" flow for a fully predicated VLIW model, but it's likely you can get it up and running at reasonable time frame.

LLVM does not have much support for predicated execution. You can introduce predicate register class, but the backend passes would not treat them differently.

However, you are in luck! It's extremely likely we will be starting on predication work for the ARM backend in the very near future. ARM has an extremely limited predication model (just condition codes), so it's not yet clear how much of our work will be applicable to your needs. But we'll try to keep our work as target independent as possible.

Evan

Evan,

thanks for your detailed answer!

a performance critical issue will be proper use of predicated
execution. if-conversion can either be performed early in the code
generation process [1] (exposing larger basic blocks to the
optimizers) or deferred until code generation is almost
complete [2].

Option 1 makes a lot of sense, most predication aware compilers go
this route. I haven't read the paper. But I am guessing their system
is rediscovering CFG etc. at link time, reverse compiling assembly
code into some internal representation? That is an interesting
approach for a company who is servicing clients who do not provide
their source code.

If you use LLVM and you build option 1, you can do something like
option 2 without a lot of horribleness. :slight_smile: LLVM does link time
optimization at bytecode level.

right, i agree option 1 might be favorable in many terms. however, this would involve major changes in various components (instruction selection, register allocation, probably various optimization passes) and i don't have the time now to do this. also, it would be very hard to predict on the llvm level how code will look like after code generation and register allocation, i.e., aggressive if-conversion and hyperblock generation will probably require partial reverse if-conversion to be effective.

for the time being, i'm planning to go with the second approach and
have a late optimization pass over the selected machine instructions
that
a] preliminarily schedules and bundles the selected instructions
b] speculatively executes instructions in the predecessor block if
there are unused resources
c] converts blocks B into a appropriately predicated version
(eliminating branches) if it's profitable for the particular
architecture.

Ok. It doesn't sound like option 2 though (it has nothing to do with
LTO). Item (b) sounds like meld scheduling, if that what you are
considering?

not exactly and yes, it has nothing to do with LTO. the idea is just to move instructions to the predecessor block (if there is only one), if they can execute there (conditionally) for free, i.e., there are unused slots/nops that might be replaced with a particular instruction. this decreases the size of blocks for (c).

However, you are in luck! It's extremely likely we will be starting
on predication work for the ARM backend in the very near future. ARM
has an extremely limited predication model (just condition codes), so
it's not yet clear how much of our work will be applicable to your
needs. But we'll try to keep our work as target independent as possible.

thats what i was hoping. for the time being, i need a fast implementation that gets me an idea of the performance that can be reached with the existing infrastructure (our target doesn't exist yet in silicon but we do have a simulator). i'll report first results as they become available.

cheers,