Folding vector instructions

Hello.

Sorry I am not sure this question should go to llvm or mesa3d-dev mailing list, so I post it to both.

I am writing a llvm backend for a modern graphics processor which has a ISA very similar to that of Direct 3D.

I am reading the code in Gallium-3D driver in a mesa3d branch, which converts the shader programs (TGSI tokens) to LLVM IR.

For the shader instruction also found in LLVM IR, the conversion is trivial:

llvm::Value * Instructions::mul(llvm::Value *in1, llvm::Value *in2) { return m_builder.CreateMul(in1, in2, name("mul")); // m_builder is a llvm::IRBuilder }

However, the special instrucions cannot directly be mapped to LLVM IR, like “min”, the conversion involves in ‘extract’ the vector, create less-than-compare, create ‘select’ instruction, and create ‘insert-element’ instruction.

llvm::Value * Instructions::min(llvm::Value *in1, llvm::Value *in2) { std::vector vec1 = extractVector(in1); // generate LLVM extract element std::vector vec2 = extractVector(in2);

Value *xcmp = m_builder.CreateFCmpOLT(vec1[0], vec2[0], name(“xcmp”));
Value *selx = m_builder.CreateSelect(xcmp, vec1[0], vec2[0],
name(“selx”));

Value *ycmp = m_builder.CreateFCmpOLT(vec1[1], vec2[1], name(“ycmp”));
Value *sely = m_builder.CreateSelect(ycmp, vec1[1], vec2[1],
name(“sely”));

Value *zcmp = m_builder.CreateFCmpOLT(vec1[2], vec2[2], name(“zcmp”));
Value *selz = m_builder.CreateSelect(zcmp, vec1[2], vec2[2],
name(“selz”));

Value *wcmp = m_builder.CreateFCmpOLT(vec1[3], vec2[3], name(“wcmp”));
Value *selw = m_builder.CreateSelect(wcmp, vec1[3], vec2[3],
name(“selw”));
return vectorFromVals(selx, sely, selz, selw); // generate LLVM ‘insert-element’
}

Eventually all these should be folded to a ‘min’ instruction in the codegen, so I wonder if the conversion only generates a simple ‘call’ instruction to a ‘min Function’ will make the instruction selection easier (no folding and complicated pattern-matching in the instruction selection DAG).

I don’t have experience of the new vector instructions in LLVM, and perhaps that’s why it makes me feel it’s complicated to fold the swizzle and writemask.

Thanks.

Alex wrote:

Hello.

Sorry I am not sure this question should go to llvm or mesa3d-dev mailing
list, so I post it to both.

I am writing a llvm backend for a modern graphics processor which has a ISA
very similar to that of Direct 3D.

I am reading the code in Gallium-3D driver in a mesa3d branch, which
converts the shader programs (TGSI tokens) to LLVM IR.

For the shader instruction also found in LLVM IR, the conversion is trivial:

<code>
llvm::Value * Instructions::mul(llvm::Value *in1, llvm::Value *in2) {
   return m_builder.CreateMul(in1, in2, name("mul")); // m_builder is a
llvm::IRBuilder
}
</code>

However, the special instrucions cannot directly be mapped to LLVM IR, like
"min", the conversion involves in 'extract' the vector, create
less-than-compare, create 'select' instruction, and create 'insert-element'
instruction.

<code>
llvm::Value * Instructions::min(llvm::Value *in1, llvm::Value *in2)
{
   std::vector<llvm::Value*> vec1 = extractVector(in1); // generate LLVM
extract element
   std::vector<llvm::Value*> vec2 = extractVector(in2);

   Value *xcmp = m_builder.CreateFCmpOLT(vec1[0], vec2[0], name("xcmp"));
   Value *selx = m_builder.CreateSelect(xcmp, vec1[0], vec2[0],
                                        name("selx"));

   Value *ycmp = m_builder.CreateFCmpOLT(vec1[1], vec2[1], name("ycmp"));
   Value *sely = m_builder.CreateSelect(ycmp, vec1[1], vec2[1],
                                        name("sely"));

   Value *zcmp = m_builder.CreateFCmpOLT(vec1[2], vec2[2], name("zcmp"));
   Value *selz = m_builder.CreateSelect(zcmp, vec1[2], vec2[2],
                                        name("selz"));

   Value *wcmp = m_builder.CreateFCmpOLT(vec1[3], vec2[3], name("wcmp"));
   Value *selw = m_builder.CreateSelect(wcmp, vec1[3], vec2[3],
                                        name("selw"));
   return vectorFromVals(selx, sely, selz, selw); // generate LLVM
'insert-element'
}
</code>

Eventually all these should be folded to a 'min' instruction in the codegen,
so I wonder if the conversion only generates a simple 'call' instruction to
a 'min Function' will make the instruction selection easier (no folding and
complicated pattern-matching in the instruction selection DAG).

I don't have experience of the new vector instructions in LLVM, and perhaps
that's why it makes me feel it's complicated to fold the swizzle and
writemask.

Thanks.

I hope marcheu sees this too.

Um, I was thinking that we should eventually create intrinsic functions
for some of the commands, like LIT, that might not be
single-instruction, but that can be lowered eventually, and for commands
like LG2, that might be single-instruction for shaders, but probably not
for non-shader chipsets.

Unfortunately, I'm still learning LLVM, so I might be completely and
totally off-base here.

Out of curiosity, which chipset are you working on? R600? NV50?
Something else?

~ C.

However, the special instrucions cannot directly be mapped to LLVM IR, like
"min", the conversion involves in 'extract' the vector, create
less-than-compare, create 'select' instruction, and create 'insert-element'
instruction.

Using scalar operations obviously works, but will probably produce very inefficient code. One positive thing is that all target-specific operations of supported vector ISAs (Altivec and SSE[1-4] currently) are exposed either through LLVM IR ops or through target-specific builtins/intrinsics. This means that you can get access to all the crazy SSE instructions, but it means that your codegen would have to handle this target-specific code generation.

The direction we're going is to expose more and more vector operations in LLVM IR. For example, compares and select are currently being worked on, so you can do a comparison of two vectors which returns a vector of bools, and use that as the compare value of a select instruction (selecting between two vectors). This would allow implementing min and a variety of other operations and is easier for the codegen to reassemble into a first-class min operation etc.

I don't know what the status of this is, I think it is partially implemented but may not be complete yet.

I don't have experience of the new vector instructions in LLVM, and perhaps
that's why it makes me feel it's complicated to fold the swizzle and
writemask.

We have really good support for swizzling operations already with the shuffle_vector instruction. I'm not sure about writemask.

Um, I was thinking that we should eventually create intrinsic functions
for some of the commands, like LIT, that might not be
single-instruction, but that can be lowered eventually, and for commands
like LG2, that might be single-instruction for shaders, but probably not
for non-shader chipsets.

Sure, it would be very reasonable to make these target-specific builtins when targeting a GPU, the same way we have target-specific builtins for SSE.

-Chris

Well, scalar is surely an option we're aiming at. NV50 or even your
regular FPU are examples of fully scalar architectures. As for SSE
generation, it was solved by using horizontal parallelism (i.e.
processing four fragments or vertices at once) instead of vertical
parallelism. Sadly this doens't work with GPUs.

So what remains are chips that are natively vector GPUs. The question
is more whether we'll be able to have llvm build up vector
instructions from scalar ones, and from my limited testing with SSE
and simple test programs it seemed to work, so I suppose the same can
be obtained from GPU targets.

Stephane

>> However, the special instrucions cannot directly be mapped to LLVM
>> IR, like
>> "min", the conversion involves in 'extract' the vector, create
>> less-than-compare, create 'select' instruction, and create 'insert-
>> element'
>> instruction.

Using scalar operations obviously works, but will probably produce
very inefficient code. One positive thing is that all target-specific
operations of supported vector ISAs (Altivec and SSE[1-4] currently)
are exposed either through LLVM IR ops or through target-specific
builtins/intrinsics. This means that you can get access to all the
crazy SSE instructions, but it means that your codegen would have to
handle this target-specific code generation.

I think Alex was referring here to a AOS layout which is completely not ready.
The currently supported one is SOA layout which eliminates scalar operations.

The direction we're going is to expose more and more vector operations
in LLVM IR. For example, compares and select are currently being
worked on, so you can do a comparison of two vectors which returns a
vector of bools, and use that as the compare value of a select
instruction (selecting between two vectors). This would allow
implementing min and a variety of other operations and is easier for
the codegen to reassemble into a first-class min operation etc.

I don't know what the status of this is, I think it is partially
implemented but may not be complete yet.

Ah, that's good to know!

>> I don't have experience of the new vector instructions in LLVM, and
>> perhaps
>> that's why it makes me feel it's complicated to fold the swizzle and
>> writemask.

We have really good support for swizzling operations already with the
shuffle_vector instruction. I'm not sure about writemask.

With SOA they're rarely used (essentially never unless we "kill" a pixel") [4
x <4 x float> ] {{xxxx, yyyy, zzzz, wwww}, {xxxx, yyyy, zzzz, www}...} so with
SOA both shuffles and writemask come down to a simple selection of the element
within the array (whether that will be good or bad is yet to be seen based on
the code in gpu llvm backends that we'll have)

Sure, it would be very reasonable to make these target-specific
builtins when targeting a GPU, the same way we have target-specific
builtins for SSE.

Actually currently the plan is to have essentially a "two pass" LLVM IR. I
wanted the first one to never lower any of the GPU instructions so we'd have
intrinsics or maybe even just function calls like gallium.lit, gallium.dot,
gallium.noise and such. Then gallium should query the driver to figure out
which instructions the GPU supports and runs our custom llvm lowering pass
that decomposes those into things the GPU supports. Essentially I'd like to
make as many complicated things in gallium as possible to make the GPU llvm
backends in drivers as simple as possible and this would help us make the
pattern matching in the generator /a lot/ easier (matching gallium.lit vs 9+
instructions it would be be decomposed to) and give us a more generic GPU
independent layer above. But that hasn't been done yet, I hope to be able to
write that code while working on the OpenCL implementation for Gallium.

z

However, the special instrucions cannot directly be mapped to LLVM
IR, like
"min", the conversion involves in 'extract' the vector, create
less-than-compare, create 'select' instruction, and create 'insert-
element'
instruction.

Using scalar operations obviously works, but will probably produce
very inefficient code. One positive thing is that all target-specific
operations of supported vector ISAs (Altivec and SSE[1-4] currently)
are exposed either through LLVM IR ops or through target-specific
builtins/intrinsics. This means that you can get access to all the
crazy SSE instructions, but it means that your codegen would have to
handle this target-specific code generation.

I think Alex was referring here to a AOS layout which is completely not ready.
The currently supported one is SOA layout which eliminates scalar operations.

Ok!

Sure, it would be very reasonable to make these target-specific
builtins when targeting a GPU, the same way we have target-specific
builtins for SSE.

Actually currently the plan is to have essentially a "two pass" LLVM IR. I
wanted the first one to never lower any of the GPU instructions so we'd have
intrinsics or maybe even just function calls like gallium.lit, gallium.dot,
gallium.noise and such. Then gallium should query the driver to figure out
which instructions the GPU supports and runs our custom llvm lowering pass
that decomposes those into things the GPU supports.

That makes a lot of sense. Note that there is no reason to use actual LLVM intrinsics for this: naming them "gallium.lit" is just as good as "llvm.gallium.lit" for example.

Essentially I'd like to
make as many complicated things in gallium as possible to make the GPU llvm
backends in drivers as simple as possible and this would help us make the
pattern matching in the generator /a lot/ easier (matching gallium.lit vs 9+
instructions it would be be decomposed to) and give us a more generic GPU
independent layer above. But that hasn't been done yet, I hope to be able to
write that code while working on the OpenCL implementation for Gallium.

Makes sense. For the more complex functions (e.g. texture lookup) you can also just compile C code to LLVM IR and use the LLVM inliner to inline the code if you prefer.

-Chris

Zack Rusin wrote:

Sure, it would be very reasonable to make these target-specific
builtins when targeting a GPU, the same way we have target-specific
builtins for SSE.

Actually currently the plan is to have essentially a "two pass" LLVM IR. I
wanted the first one to never lower any of the GPU instructions so we'd have
intrinsics or maybe even just function calls like gallium.lit, gallium.dot,
gallium.noise and such. Then gallium should query the driver to figure out
which instructions the GPU supports and runs our custom llvm lowering pass
that decomposes those into things the GPU supports. Essentially I'd like to
make as many complicated things in gallium as possible to make the GPU llvm
backends in drivers as simple as possible and this would help us make the
pattern matching in the generator /a lot/ easier (matching gallium.lit vs 9+
instructions it would be be decomposed to) and give us a more generic GPU
independent layer above. But that hasn't been done yet, I hope to be able to
write that code while working on the OpenCL implementation for Gallium.

Um, whichever. Honestly, I'm gonna do s/R300VS/R300FS/g on my current
work, commit it, and then forget about for the next two months while I
get a pipe working. I've got a skeleton that does nothing, and I won't
do anything else until we're solid on how to proceed. I'm definitely not
very experienced in this area, so I defer to you all.

R300 Radeons have insts that operate on vectors, and insts that operate
only on the .w of each operand. I don't know how to best represent them.

So far, the strange (read: non-LLVM) things seem to be:

- No pointers.
- No traditional load and store concepts.
- Only one type, v4f32.
- No modifiable stack, no frame pointers, no calling conventions.
- No variable-length loops.

I can tell you for sure that the ATI HLSL compiler unwinds and unrolls
everything, so that they don't have to deal with call and ret. Other
than that, I don't know how to handle this stuff.

~ C.

Chris Lattner wrote:

The direction we're going is to expose more and more vector operations in
LLVM IR. For example, compares and select are currently being worked on,
so you can do a comparison of two vectors which returns a vector of bools,
and use that as the compare value of a select instruction (selecting between
two vectors). This would allow implementing min and a variety of other
operations and is easier for the codegen to reassemble into a first-class
min operation etc.

With the motivation of making the codegen easier to reassemble the first-class
operations, do you also mean that there will be vector version of add, sub,
mul, which are usually supported by many vector GPU?

Stephane Marchesin wrote:

So what remains are chips that are natively vector GPUs. The question
is more whether we'll be able to have llvm build up vector
instructions from scalar ones

The reason why I started this thread was looking for some example code doing
this? Have we already had any backend in LLVM doing this? It seems not easy
to me.

Zack Rusin wrote:

I think Alex was referring here to a AOS layout which is completely not
ready.
Actually currently the plan is to have essentially a "two pass" LLVM IR. I
wanted the first one to never lower any of the GPU instructions so we'd have
intrinsics or maybe even just function calls like gallium.lit, gallium.dot,
gallium.noise and such. Then gallium should query the driver to figure out
which instructions the GPU supports and runs our custom llvm lowering pass
that decomposes those into things the GPU supports.

If I understand correct, that is to say, the gallium will dynamically build a
lowering pass by querying the capability (instructions supported by the GPU)?
Instead, isn't it a better approach to have a lowering pass for each GPU and
gallium simply uses it?

Essentially I'd like to
make as many complicated things in gallium as possible to make the GPU llvm
backends in drivers as simple as possible and this would help us make the
pattern matching in the generator /a lot/ easier (matching gallium.lit vs 9+
instructions it would be be decomposed to) and give us a more generic GPU
independent layer above. But that hasn't been done yet, I hope to be able to
write that code while working on the OpenCL implementation for Gallium.

This two-pass approach is what I am taking now to write the compiler for a GPU (
sorry but I am not allowed to reveal the name).

I don't work on the gallium directly. I am writing a frontend which converts
vs_3_0 to LLVM IR. That's why I reference both SOA and AOS code. I think the
NDA will allow me (to be confirmed) to contribute only this frontend but not
the LLVM backend neither the lowering pass of this GPU.

What do you plan to do with SOA and AOS paths in the gallium?

(1) Will they eventually be developed independently? So that for a scalar/SIMD
GPU, the SOA will be used to generate LLVM IR, and for a vector GPU, AOS is
used?

(2) At present the difference between SOA and AOS path is not only the
layout of
the input data. The AOS seems to be more complete for me, though Rusin has said
that it's completely not ready and not used in the gallium. Is there a plan to
merge/add the support of function/branch and LLVM IR extract/insert/shuffle to
the SOA code?

By the way, is there any open source frontend which converts GLSL to LLVM IR?

Alex.

Yep, LLVM already fully supports those, just use the normal add/sub/mul etc operations. If unsupported by a target, they are converted to scalar operations, or vectors of a different type where possible.

I'll let others respond to the Gallium-specific portions of your email,

-Chris

Zack Rusin wrote:
> I think Alex was referring here to a AOS layout which is completely not
> ready.
> Actually currently the plan is to have essentially a "two pass" LLVM IR.
> I wanted the first one to never lower any of the GPU instructions so we'd
> have intrinsics or maybe even just function calls like gallium.lit,
> gallium.dot, gallium.noise and such. Then gallium should query the driver
> to figure out which instructions the GPU supports and runs our custom
> llvm lowering pass that decomposes those into things the GPU supports.

If I understand correct, that is to say, the gallium will dynamically build
a lowering pass by querying the capability (instructions supported by the
GPU)? Instead, isn't it a better approach to have a lowering pass for each
GPU and gallium simply uses it?

The whole point of Gallium is to make driver development as simple as
possible. So while it's certainly harder to write this code in a way that
could be generic it's essentially what Gallium is all about and it's at least
worth a try.

What do you plan to do with SOA and AOS paths in the gallium?

For now we need to figure out whether we need all the layouts or whether one
is enough for all the backends.

(1) Will they eventually be developed independently? So that for a
scalar/SIMD GPU, the SOA will be used to generate LLVM IR, and for a vector
GPU, AOS is used?

Well, they're all connected, so developing them independently would be hard.
As mentioned above, depending on what's going to happen either we'll let the
drivers ask for the layout that they want to work with or decide to use one
layout everywhere.

(2) At present the difference between SOA and AOS path is not only the
layout of the input data. The AOS seems to be more complete for me, though
Rusin has said that it's completely not ready and not used in the gallium.
Is there a plan to merge/add the support of function/branch and LLVM IR
extract/insert/shuffle to the SOA code?

I wrote both so I can tell you they're both far from usable. It looks like
Stephane and Corbin are rocking right now but the infrastructure code in
Gallium needs a lot of love. We have a lot of choices to make over the next
few months and obviously all the paths (assuming those will be "paths" and not
a "path") will require feature parity.

By the way, is there any open source frontend which converts GLSL to LLVM
IR?

Yes, there is at:
http://cgit.freedesktop.org/~zack/mesa.git.old/log/?h=llvm
but it's also not complete.

z