llvm, gpu execution environments

Keith_Whitwell · May 18, 2007, 10:23am

I'm interested in understanding the extent of the assumptions which llvm makes about the types of hardware it is capable of targeting.

In particular, I'm investigating a proposal by Zack Rusin to use llvm as the shader compilation engine within Mesa, targeting GPU backends.

I'm aware of the Apple GLSL compiler, and also I've seen the Vector LLVA paper. However, I'm not sure that either of these quite bridges the gap to the execution environment provided by modern GPUs.

Though there are a couple of question marks, I'll pick the most obvious one:

It seems that LLVA and by extension Vector-LLVA assumes that looping and branching control flow can be expressed in terms of a simple "br" branch operation.

Typically GPU environments cannot provide such a facility as they tend to run 16, 32 or 64 simd threads all with the same program counter. Though this is a wide vector environment, each of the threads is typically a scalar program and at any branch point, some of those threads may take the branch and some not. So, to provide dynamic branching facilities in this environment, you end up with per-channel execution masks, and opcodes like "IF", "THEN", and "ELSE" which manipulate those per-channel masks, and use stack semantics for pushing and popping masks to emulate nested control structures.

This is probably all very familiar to anybody who's thought about simd program execution. But it means that GPUs, and low-level GPU abstractions tend not to have branch instructions.

The question then, is to what extent it is possible to target this type of execution environment with LLVM and the LLVA/Vector-LLVA ISAs???

Is it necessary (or feasible) to try to analyse LLVA programs and extract IF/THEN/ELSE semantics from a set of arbitary branch instructions?

Is it possible to extend LLVA with these 'high level' control flow instructions and end up generating those instead of branches, and if so how does that affect the rest of LLVM?

Is it for some reason just not feasible at all?

Keith

clattner · May 19, 2007, 8:01am

I'm interested in understanding the extent of the assumptions which llvm makes about the types of hardware it is capable of targeting.

Different pieces of the compiler make different assumptions. In particular, the code generator we ship is good for targetting certain classes of devices, but isn't fully general (it doesn't help if you're synthesizing a netlist from llvm, for example).

In particular, I'm investigating a proposal by Zack Rusin to use llvm as
the shader compilation engine within Mesa, targeting GPU backends.

Ok

It seems that LLVA and by extension Vector-LLVA assumes that looping and
branching control flow can be expressed in terms of a simple "br" branch
operation.

LLVA is not a part of LLVM, so I won't answer for it.

Typically GPU environments cannot provide such a facility as they tend
to run 16, 32 or 64 simd threads all with the same program counter.
Though this is a wide vector environment, each of the threads is
typically a scalar program and at any branch point, some of those
threads may take the branch and some not. So, to provide dynamic
branching facilities in this environment, you end up with per-channel
execution masks, and opcodes like "IF", "THEN", and "ELSE" which
manipulate those per-channel masks, and use stack semantics for pushing
and popping masks to emulate nested control structures.

Right, it's basically a form of predication.

This is probably all very familiar to anybody who's thought about simd
program execution. But it means that GPUs, and low-level GPU
abstractions tend not to have branch instructions.

The question then, is to what extent it is possible to target this type
of execution environment with LLVM and the LLVA/Vector-LLVA ISAs???
Is it necessary (or feasible) to try to analyse LLVA programs and
extract IF/THEN/ELSE semantics from a set of arbitary branch instructions?
Is it possible to extend LLVA with these 'high level' control flow
instructions and end up generating those instead of branches, and if so
how does that affect the rest of LLVM?

The code generator and llvm should be able to handle this just fine, with only minimal extensions.

Basically, you want to model this as predicated execution, and you want the code generator to predicate away as many branches etc as possible.

One observation can be made though: there will always be some programs that you can't map onto the hardware. For example, if you don't have branches, you can't do loops that execute for a variable number of iterations.

As such, I'd structure the compiler as a typical code generator with an early predication pass that flattens branches. If you get to the end of the codegen and have some dynamic branches left, you detect the error condition and reject the shader from the hardware path (so you have to emulate it in software).

Does this make sense?

-Chris

Keith_Whitwell · May 19, 2007, 7:57am

Chris Lattner wrote:

I'm interested in understanding the extent of the assumptions which llvm makes about the types of hardware it is capable of targeting.

Different pieces of the compiler make different assumptions. In particular, the code generator we ship is good for targetting certain classes of devices, but isn't fully general (it doesn't help if you're synthesizing a netlist from llvm, for example).

In particular, I'm investigating a proposal by Zack Rusin to use llvm as
the shader compilation engine within Mesa, targeting GPU backends.

Ok

It seems that LLVA and by extension Vector-LLVA assumes that looping and
branching control flow can be expressed in terms of a simple "br" branch
operation.

LLVA is not a part of LLVM, so I won't answer for it.

OK, I guess I misunderstood the papers I pulled down - my impression was that at some stage programs in llvm would be represented in LLVA.

What, out of interest, is the relationship between LLVM and LLVA?

Typically GPU environments cannot provide such a facility as they tend
to run 16, 32 or 64 simd threads all with the same program counter.
Though this is a wide vector environment, each of the threads is
typically a scalar program and at any branch point, some of those
threads may take the branch and some not. So, to provide dynamic
branching facilities in this environment, you end up with per-channel
execution masks, and opcodes like "IF", "THEN", and "ELSE" which
manipulate those per-channel masks, and use stack semantics for pushing
and popping masks to emulate nested control structures.

Right, it's basically a form of predication.

Yes, a set of high-level-ish instructions layered on dynamic predication to give a very close match to the normally understood dynamic branching and looping constructs we're all familiar with.

It's initially quite odd to see something like "ELSE" or "WHILE" as hardware opcodes, but it makes sense under the hood.

This is probably all very familiar to anybody who's thought about simd
program execution. But it means that GPUs, and low-level GPU
abstractions tend not to have branch instructions.

The question then, is to what extent it is possible to target this type
of execution environment with LLVM and the LLVA/Vector-LLVA ISAs???
Is it necessary (or feasible) to try to analyse LLVA programs and
extract IF/THEN/ELSE semantics from a set of arbitary branch instructions?
Is it possible to extend LLVA with these 'high level' control flow
instructions and end up generating those instead of branches, and if so
how does that affect the rest of LLVM?

The code generator and llvm should be able to handle this just fine, with only minimal extensions.

Basically, you want to model this as predicated execution, and you want the code generator to predicate away as many branches etc as possible.

One observation can be made though: there will always be some programs that you can't map onto the hardware. For example, if you don't have branches, you can't do loops that execute for a variable number of iterations.

Actually, you can - there is a program counter, the loop keeps executing until the execution mask reaches zero. Likewise branches are dynamic.

As such, I'd structure the compiler as a typical code generator with an early predication pass that flattens branches. If you get to the end of the codegen and have some dynamic branches left, you detect the error condition and reject the shader from the hardware path (so you have to emulate it in software).

The hardware *does* support dynamic branching, and looping, provided it is expressed in IF/THEN/ELSE, LOOP/BREAK/CONTINUE/ENDLOOP type instructions. Even CALL/RETURN. The only thing it can't do is execute something like "GOTO" or "BRANCH" dynamically.

Does this make sense?

Yes, but at slight cross-purposes.

There are no cases where compilation should fail to produce a hardware executable result, within the constraints of the high-level language we are compiling. Dynamic branches and looping are entirely within the capability of the hardware, provided they are expressed in terms of the hardware IF/THEN/ELSE, LOOP/ENDLOOP, etc, opcodes.

But it seems like my initial understanding of the intermediate representation within llvm is incorrect & I probably should just dive into the source to figure out what's going on.

My concern was that llva throws away the information that I'd probably need to reconstruct these high-level opcodes required by the hardware - if the code generator can come in at a higher level while that information still exists, then a lot of things get easier.

Keith

Vikram_S_Adve · May 19, 2007, 9:49pm

LLVA is a research project developing a more complete virtual architecture, starting with the LLVM ISA as a basis. This enables us to use the LLVM infrastructure directly in our work. The vector LLVA work was a short-term project led by Rob Bocchino and me, developing a general set of vector extensions to the instruction set, but we are no longer working on this.

Our main focus in the LLVA project today is on a virtual machine approach for enforcing memory safety and security. We call this Secure Virtual Architecture (SVA).

--Vikram
http://www.cs.uiuc.edu/~vadve
http://llvm.org

clattner · May 23, 2007, 7:16am

It seems that LLVA and by extension Vector-LLVA assumes that looping and
branching control flow can be expressed in terms of a simple "br" branch
operation.

LLVA is not a part of LLVM, so I won't answer for it.

OK, I guess I misunderstood the papers I pulled down - my impression was
that at some stage programs in llvm would be represented in LLVA.

What, out of interest, is the relationship between LLVM and LLVA?

Vikram already answered this, but LLVA is a research project that uses LLVM. LLVM does already have vector support in place.

Basically, you want to model this as predicated execution, and you want
the code generator to predicate away as many branches etc as possible.

One observation can be made though: there will always be some programs
that you can't map onto the hardware. For example, if you don't have
branches, you can't do loops that execute for a variable number of
iterations.

Actually, you can - there is a program counter, the loop keeps executing
until the execution mask reaches zero. Likewise branches are dynamic.

Ok, but can you supports 4 level deep loops with arbitrary indexed loads in them?

As such, I'd structure the compiler as a typical code generator with an
early predication pass that flattens branches. If you get to the end of
the codegen and have some dynamic branches left, you detect the error
condition and reject the shader from the hardware path (so you have to
emulate it in software).

The hardware *does* support dynamic branching, and looping, provided it
is expressed in IF/THEN/ELSE, LOOP/BREAK/CONTINUE/ENDLOOP type
instructions. Even CALL/RETURN. The only thing it can't do is execute
something like "GOTO" or "BRANCH" dynamically.

Ahh, ok. Very interesting.

Does this make sense?

Yes, but at slight cross-purposes.

There are no cases where compilation should fail to produce a hardware
executable result, within the constraints of the high-level language we
are compiling. Dynamic branches and looping are entirely within the
capability of the hardware, provided they are expressed in terms of the
hardware IF/THEN/ELSE, LOOP/ENDLOOP, etc, opcodes.

Okay.

But it seems like my initial understanding of the intermediate
representation within llvm is incorrect & I probably should just dive
into the source to figure out what's going on.

Always good

My concern was that llva throws away the information that I'd probably need to reconstruct these high-level opcodes required by the hardware - if the code generator can come in at a higher level while that information still exists, then a lot of things get easier.

s/llva/llvm/ But yes, you're right. Reconstructing loops etc from LLVM is actually really easy, but we don't have good support for it in the code generator yet. This is a desired area of extension that we'd like to do at some point, see Code generator needs DominatorTree · Issue #1725 · llvm/llvm-project · GitHub

-Chris

Keith_Whitwell · May 23, 2007, 7:58am

Chris Lattner wrote:

It seems that LLVA and by extension Vector-LLVA assumes that looping and
branching control flow can be expressed in terms of a simple "br" branch
operation.

LLVA is not a part of LLVM, so I won't answer for it.

OK, I guess I misunderstood the papers I pulled down - my impression was
that at some stage programs in llvm would be represented in LLVA.

What, out of interest, is the relationship between LLVM and LLVA?

Vikram already answered this, but LLVA is a research project that uses LLVM. LLVM does already have vector support in place.

Basically, you want to model this as predicated execution, and you want
the code generator to predicate away as many branches etc as possible.

One observation can be made though: there will always be some programs
that you can't map onto the hardware. For example, if you don't have
branches, you can't do loops that execute for a variable number of
iterations.

Actually, you can - there is a program counter, the loop keeps executing
until the execution mask reaches zero. Likewise branches are dynamic.

Ok, but can you supports 4 level deep loops with arbitrary indexed loads in them?

Yes, absolutely.

There may be limits to the depth of the stacks that underly the predication mechanism, and there may be special actions required to be taken when those stacks are exhausted, but this type of detail varies from GPU to GPU.

The hardware we have access to (Intel i965) just needs some hand-holding when the predication stacks max out.

Some GPUs may use an entirely different internal architecture and implementation details like stacks and predication may not apply. But the evidence from benchmarks of dynamic branching behaviour, etc, on those GPUs suggests things must be pretty similar.

As such, I'd structure the compiler as a typical code generator with an
early predication pass that flattens branches. If you get to the end of
the codegen and have some dynamic branches left, you detect the error
condition and reject the shader from the hardware path (so you have to
emulate it in software).

The hardware *does* support dynamic branching, and looping, provided it
is expressed in IF/THEN/ELSE, LOOP/BREAK/CONTINUE/ENDLOOP type
instructions. Even CALL/RETURN. The only thing it can't do is execute
something like "GOTO" or "BRANCH" dynamically.

Ahh, ok. Very interesting.

Does this make sense?

Yes, but at slight cross-purposes.

There are no cases where compilation should fail to produce a hardware
executable result, within the constraints of the high-level language we
are compiling. Dynamic branches and looping are entirely within the
capability of the hardware, provided they are expressed in terms of the
hardware IF/THEN/ELSE, LOOP/ENDLOOP, etc, opcodes.

Okay.

But it seems like my initial understanding of the intermediate
representation within llvm is incorrect & I probably should just dive
into the source to figure out what's going on.

Always good

My concern was that llva throws away the information that I'd probably need to reconstruct these high-level opcodes required by the hardware - if the code generator can come in at a higher level while that information still exists, then a lot of things get easier.

s/llva/llvm/ But yes, you're right. Reconstructing loops etc from LLVM is actually really easy, but we don't have good support for it in the code generator yet. This is a desired area of extension that we'd like to do at some point, see Code generator needs DominatorTree · Issue #1725 · llvm/llvm-project · GitHub

OK, I'll take a look.

Thanks,

Keith

Topic		Replies	Views
Folding vector instructions LLVM Dev List Archives	9	103	January 1, 2009
Getting To Native Code LLVM Dev List Archives	10	122	November 12, 2003
Extending LLVM LLVM Dev List Archives	5	127	March 28, 2004
Newbie LLVM Dev List Archives	12	119	April 11, 2008
Software Pipelineing \| Embedded C Extensions LLVM Dev List Archives	11	220	April 15, 2014

llvm, gpu execution environments

Related topics