PTX builtin functions.

Dear Justin,

I am trying to add the support for some OpenCL builtin functions to
the PTX backend.
The attached file represent the first stub of a patch for the fmax
builtin function.

The test case I am trying is the following:

define ptx_device float @f(float %x, float %y) {
entry:
  %z = call float @fmax(float %x, float %y)
  ret float %z
}

declare float @fmax(float, float)

But at the moment llc crashes saying that "calls are not supported",
this does not
happens with llvm builtins like llvm.sqrt.f32

Can you please give me a hint on what I am missing, or some general
advice on how
to add builtin functions.

Thank you in advance,

Alberto.

fmax_stub.patch (1.92 KB)

Dear Justin,

I am trying to add the support for some OpenCL builtin functions to
the PTX backend.
The attached file represent the first stub of a patch for the fmax
builtin function.

First off, thanks for helping to improve the PTX back-end!

There are really two main issues here. First, OpenCL built-in functions do not belong in the PTX back-end. These will be implemented in the libclc library (http://www.pcc.me.uk/~peter/libclc). The back-end will only implement PTX intrinsics, which may be used by the OpenCL built-in functions in libclc. However, this particular function (max) corresponds to a PTX instruction, so it makes sense to implement it as an intrinsic in the back-end.

Second, intrinsic functions require a bit more work. You’re off to a great start, but intrinsics are implemented a bit differently. It looks like LLVM does not have a max intrinsic, so we’ll need to create one. Have a look at include/llvm/IntrinsicsPTX.td. This file defines the PTX-specific intrinsics. You can add an intrinsic for max here, and then implement a pattern-match in the PTXInstrInfo.td file. There is no need to create a new SDNode type for intrinsics, unless they require some special handling in the C++ code, which I do not see being the case here.

When you define a new intrinsic, use the following template as a name: int_ptx_max. This will define the LLVM intrinsic as @llvm.ptx.max(). Please follow the same convention when naming the _builtin* function.

The test case I am trying is the following:

define ptx_device float @f(float %x, float %y) {
entry:
%z = call float @fmax(float %x, float %y)
ret float %z
}

declare float @fmax(float, float)

But at the moment llc crashes saying that “calls are not supported”,
this does not
happens with llvm builtins like llvm.sqrt.f32

Which version of LLVM are you using? Calls to PTX device functions have been implemented for a little while now, so I’m surprised to see that error. Perhaps it’s because the fmax function is not defined as ptx_device.

Dear Justin,

I am trying to add the support for some OpenCL builtin functions to
the PTX backend.
The attached file represent the first stub of a patch for the fmax
builtin function.

First off, thanks for helping to improve the PTX back-end!

There are really two main issues here. First, OpenCL built-in functions do not belong in the PTX back-end. These will be implemented in the libclc library (http://www.pcc.me.uk/~peter/libclc). The back-end will only implement PTX intrinsics, which may be used by the OpenCL built-in functions in libclc. However, this particular function (max) corresponds to a PTX instruction, so it makes sense to implement it as an intrinsic in the back-end.

Second, intrinsic functions require a bit more work. You’re off to a great start, but intrinsics are implemented a bit differently. It looks like LLVM does not have a max intrinsic, so we’ll need to create one. Have a look at include/llvm/IntrinsicsPTX.td. This file defines the PTX-specific intrinsics. You can add an intrinsic for max here, and then implement a pattern-match in the PTXInstrInfo.td file. There is no need to create a new SDNode type for intrinsics, unless they require some special handling in the C++ code, which I do not see being the case here.

Sorry, there’s a typo here. The intrinsic pattern matching goes in PTXInstrinsicInstrInfo.td.

Hi Justin,

attached you find the patch for the integer max instruction.
The multiclass PTX_INTRINSIC_INT3 in file PTXIntrinsicInstrInfo.td
is almost an exact copy of PTX_INT3 in PTXInstrInfo.td, maybe
a modification of this class can be defined in a separate file.

I’m copying llvmdev. We should keep discussions like this on the list for the benefit of others.

We can probably factor out a generic description, or even just use the PTX_INT3 multiclass directly. The PTXIntrinsicInstrInfo.td file is included by PTXInstrInfo.td, so anything defined in PTXInstrInfo.td is available in PTXIntrinsicInstrInfo.td.

Do you agree with this approach ?
Also, do you think that a class like PTX_INTRINSIC_INT3_SIGNED
(a clone of PTX_INT3_SIGNED) is required ?

Yes, I believe we should split these into signed and unsigned variants. The results of max/min operations can definitely be different depending on whether the operands are signed or unsigned. Since this information is not encoded in LLVM types, we may want to create two versions for each integer type; something like:

i32 @llvm.ptx.max.signed.i32(i32, i32)
i32 @llvm.ptx.max.unsigned.i32(i32, i32)

Otherwise, the patch looks good.

Hi Justin,

attached you find the patch for the integer max instruction.
The multiclass PTX_INTRINSIC_INT3 in file PTXIntrinsicInstrInfo.td
is almost an exact copy of PTX_INT3 in PTXInstrInfo.td, maybe
a modification of this class can be defined in a separate file.

I'm copying llvmdev. We should keep discussions like this on the list for
the benefit of others.

I always forget "Reply to All".

We can probably factor out a generic description, or even just use the
PTX_INT3 multiclass directly. The PTXIntrinsicInstrInfo.td file is included
by PTXInstrInfo.td, so anything defined in PTXInstrInfo.td is available in
PTXIntrinsicInstrInfo.td.

I agree with you but my class PTX_INTRINSIC_INT3 works with an Intrinsic
and not with a SDNode, like PTX_INT3.
PTX_INTRINSIC_INT3 also requires the presence of the type of
the immediate in the pattern, e.g. (i32 imm:$b).

Do you agree with this approach ?
Also, do you think that a class like PTX_INTRINSIC_INT3_SIGNED
(a clone of PTX_INT3_SIGNED) is required ?

Yes, I believe we should split these into signed and unsigned variants. The
results of max/min operations can definitely be different depending on
whether the operands are signed or unsigned. Since this information is not
encoded in LLVM types, we may want to create two versions for each integer
type; something like:

i32 @llvm.ptx.max.signed.i32(i32, i32)
i32 @llvm.ptx.max.unsigned.i32(i32, i32)

Yes, this the only way.

Hi Justin,

attached you find the patch for the integer max instruction.
The multiclass PTX_INTRINSIC_INT3 in file PTXIntrinsicInstrInfo.td
is almost an exact copy of PTX_INT3 in PTXInstrInfo.td, maybe
a modification of this class can be defined in a separate file.

I’m copying llvmdev. We should keep discussions like this on the list for
the benefit of others.

I always forget “Reply to All”.

We can probably factor out a generic description, or even just use the
PTX_INT3 multiclass directly. The PTXIntrinsicInstrInfo.td file is included
by PTXInstrInfo.td, so anything defined in PTXInstrInfo.td is available in
PTXIntrinsicInstrInfo.td.

I agree with you but my class PTX_INTRINSIC_INT3 works with an Intrinsic
and not with a SDNode, like PTX_INT3.
PTX_INTRINSIC_INT3 also requires the presence of the type of
the immediate in the pattern, e.g. (i32 imm:$b).

Alright, I’m fine with that.

Do you agree with this approach ?
Also, do you think that a class like PTX_INTRINSIC_INT3_SIGNED
(a clone of PTX_INT3_SIGNED) is required ?

Yes, I believe we should split these into signed and unsigned variants. The
results of max/min operations can definitely be different depending on
whether the operands are signed or unsigned. Since this information is not
encoded in LLVM types, we may want to create two versions for each integer
type; something like:

i32 @llvm.ptx.max.signed.i32(i32, i32)
i32 @llvm.ptx.max.unsigned.i32(i32, i32)

Yes, this the only way.

A couple more comments:

  1. Please make sure to set TargetPrefix=“ptx” for the intrinsics (probably best in the multiclass, see PTXReadSpecialRegisterIntrinsic_r32)
  2. I’m not sure how to define a GCCBuiltin for an intrinsic that can take multiple types, but it’s probably worth looking into so we can expose this intrinsic to Clang.

>>
>> Hi Justin,
>>
>> attached you find the patch for the integer max instruction.
>> The multiclass PTX_INTRINSIC_INT3 in file PTXIntrinsicInstrInfo.td
>> is almost an exact copy of PTX_INT3 in PTXInstrInfo.td, maybe
>> a modification of this class can be defined in a separate file.
>
>
> I'm copying llvmdev. We should keep discussions like this on the list
> for
> the benefit of others.

I always forget "Reply to All".

> We can probably factor out a generic description, or even just use the
> PTX_INT3 multiclass directly. The PTXIntrinsicInstrInfo.td file is
> included
> by PTXInstrInfo.td, so anything defined in PTXInstrInfo.td is available
> in
> PTXIntrinsicInstrInfo.td.

I agree with you but my class PTX_INTRINSIC_INT3 works with an Intrinsic
and not with a SDNode, like PTX_INT3.
PTX_INTRINSIC_INT3 also requires the presence of the type of
the immediate in the pattern, e.g. (i32 imm:$b).

Alright, I'm fine with that.

>>
>>
>> Do you agree with this approach ?
>> Also, do you think that a class like PTX_INTRINSIC_INT3_SIGNED
>> (a clone of PTX_INT3_SIGNED) is required ?
>
>
> Yes, I believe we should split these into signed and unsigned variants.
> The
> results of max/min operations can definitely be different depending on
> whether the operands are signed or unsigned. Since this information is
> not
> encoded in LLVM types, we may want to create two versions for each
> integer
> type; something like:
>
> i32 @llvm.ptx.max.signed.i32(i32, i32)
> i32 @llvm.ptx.max.unsigned.i32(i32, i32)

Yes, this the only way.

A couple more comments:

Please make sure to set TargetPrefix="ptx" for the intrinsics (probably best
in the multiclass, see PTXReadSpecialRegisterIntrinsic_r32)]

Ok

I'm not sure how to define a GCCBuiltin for an intrinsic that can take
multiple types, but it's probably worth looking into so we can expose this
intrinsic to Clang.

This could be an issue. I looked for something similar in other backends
and I found no previous examples. It may be worth to ask on the ML
explicitly for this.
The only fallback that I see is to define explicitly every intrinsic
for every data type,
but this would prevent the usage of the multiclass for the definition
of the patterns.

Bye.

Alberto,
The AMDIL backend solves your problem with intrinsic overloading this way:
def int_AMDIL_mad : GCCBuiltin<"__amdil_mad">, TernaryIntFloat;

Where TernaryIntFloat is defined as:
class TernaryIntFloat :
          Intrinsic<[llvm_anyfloat_ty], [LLVMMatchType<0>,
          LLVMMatchType<0>, LLVMMatchType<0>], []>;

This allows us to write a multi-def for int_AMDIL_mad like so:
defm MAD : TernaryIntrinsicFloat<IL_OP_MAD, int_AMDIL_mad>;

Where TernaryIntrinsicFloat is defined as:
multiclass TernaryIntrinsicFloat<ILOpCode opcode, Intrinsic intr>
{
  def _f32 : ThreeInOneOut<opcode, (outs GPRF32:$dst),
      (ins GPRF32:$src, GPRF32:$src2, GPRF32:$src3),
      !strconcat(opcode.Text, " $dst, $src, $src2, $src3"),
      [(set GPRF32:$dst,
          (intr GPRF32:$src, GPRF32:$src2, GPRF32:$src3))]>;
  def _v2f32 : ThreeInOneOut<opcode, (outs GPRV2F32:$dst),
      (ins GPRV2F32:$src, GPRV2F32:$src2, GPRV2F32:$src3),
      !strconcat(opcode.Text, " $dst, $src, $src2, $src3"),
      [(set GPRV2F32:$dst,
          (intr GPRV2F32:$src, GPRV2F32:$src2, GPRV2F32:$src3))]>;
...
}

Now, this doesn't completely work, because LLVM does not allow overloading of intrinsics values, so there needs to be a little coding in *IntrinsicInfo class.
AMD always encodes builtin names as __amdil_mad_f32, __amdil_mad_v2f32, __amdil_mad_v4f32, etc....
So in the function "*IntrinsicInfo::lookup_name", when attempting to find out what intrinsic the function maps to, the AMDIL backend strips off the type, and then looks up for just '__amdil_mad'.

This is how you can do intrinsic overloading in LLVM.

Hope this helps,
Micah

Alberto,
The AMDIL backend solves your problem with intrinsic overloading this way:
def int_AMDIL_mad : GCCBuiltin<"__amdil_mad">, TernaryIntFloat;

Where TernaryIntFloat is defined as:
class TernaryIntFloat :
Intrinsic<[llvm_anyfloat_ty], [LLVMMatchType<0>,
LLVMMatchType<0>, LLVMMatchType<0>], []>;

This allows us to write a multi-def for int_AMDIL_mad like so:
defm MAD : TernaryIntrinsicFloat<IL_OP_MAD, int_AMDIL_mad>;

Where TernaryIntrinsicFloat is defined as:
multiclass TernaryIntrinsicFloat<ILOpCode opcode, Intrinsic intr>
{
def _f32 : ThreeInOneOut<opcode, (outs GPRF32:$dst),
(ins GPRF32:$src, GPRF32:$src2, GPRF32:$src3),
!strconcat(opcode.Text, " $dst, $src, $src2, $src3"),
[(set GPRF32:$dst,
(intr GPRF32:$src, GPRF32:$src2, GPRF32:$src3))]>;
def _v2f32 : ThreeInOneOut<opcode, (outs GPRV2F32:$dst),
(ins GPRV2F32:$src, GPRV2F32:$src2, GPRV2F32:$src3),
!strconcat(opcode.Text, " $dst, $src, $src2, $src3"),
[(set GPRV2F32:$dst,
(intr GPRV2F32:$src, GPRV2F32:$src2, GPRV2F32:$src3))]>;
...
}

Now, this doesn't completely work, because LLVM does not allow overloading of intrinsics values, so there needs to be a little coding in *IntrinsicInfo class.
AMD always encodes builtin names as __amdil_mad_f32, __amdil_mad_v2f32, __amdil_mad_v4f32, etc....
So in the function "*IntrinsicInfo::lookup_name", when attempting to find out what intrinsic the function maps to, the AMDIL backend strips off the type, and then looks up for just '__amdil_mad'.

This is how you can do intrinsic overloading in LLVM.

Hope this helps,
Micah

Thank you Micah, it really does.

At the moment the PTX backend does not have a PTXIntrinsicInfo class,
the only backend which does so is MBlaze.
If Justin agrees with the approach I will look on how to generate the
PTXGenIntrinsics.inc file (I am still learning TableGen)
required by PTXIntrinsicInfo and write the lookUp method.

Cheers,

Alberto

Alberto,
The AMDIL backend solves your problem with intrinsic overloading this way:
def int_AMDIL_mad : GCCBuiltin<"__amdil_mad">, TernaryIntFloat;

Where TernaryIntFloat is defined as:
class TernaryIntFloat :
Intrinsic<[llvm_anyfloat_ty], [LLVMMatchType<0>,
LLVMMatchType<0>, LLVMMatchType<0>], []>;

This allows us to write a multi-def for int_AMDIL_mad like so:
defm MAD : TernaryIntrinsicFloat<IL_OP_MAD, int_AMDIL_mad>;

Where TernaryIntrinsicFloat is defined as:
multiclass TernaryIntrinsicFloat<ILOpCode opcode, Intrinsic intr>
{
def _f32 : ThreeInOneOut<opcode, (outs GPRF32:$dst),
(ins GPRF32:$src, GPRF32:$src2, GPRF32:$src3),
!strconcat(opcode.Text, " $dst, $src, $src2, $src3"),
[(set GPRF32:$dst,
(intr GPRF32:$src, GPRF32:$src2, GPRF32:$src3))]>;
def _v2f32 : ThreeInOneOut<opcode, (outs GPRV2F32:$dst),
(ins GPRV2F32:$src, GPRV2F32:$src2, GPRV2F32:$src3),
!strconcat(opcode.Text, " $dst, $src, $src2, $src3"),
[(set GPRV2F32:$dst,
(intr GPRV2F32:$src, GPRV2F32:$src2, GPRV2F32:$src3))]>;

}

Now, this doesn’t completely work, because LLVM does not allow overloading of intrinsics values, so there needs to be a little coding in *IntrinsicInfo class.
AMD always encodes builtin names as __amdil_mad_f32, __amdil_mad_v2f32, __amdil_mad_v4f32, etc…
So in the function “*IntrinsicInfo::lookup_name”, when attempting to find out what intrinsic the function maps to, the AMDIL backend strips off the type, and then looks up for just ‘__amdil_mad’.

This is how you can do intrinsic overloading in LLVM.

Hope this helps,
Micah

Thank you Micah, it really does.

At the moment the PTX backend does not have a PTXIntrinsicInfo class,
the only backend which does so is MBlaze.
If Justin agrees with the approach I will look on how to generate the
PTXGenIntrinsics.inc file (I am still learning TableGen)
required by PTXIntrinsicInfo and write the lookUp method.

Looks good to me. For OpenCL support in clang, we definitely need the built-in function support. And the total number of intrinsics like this should be relatively minimal.

Alberto,
The AMDIL backend solves your problem with intrinsic overloading this way:
def int_AMDIL_mad : GCCBuiltin<"__amdil_mad">, TernaryIntFloat;

Where TernaryIntFloat is defined as:
class TernaryIntFloat :
Intrinsic<[llvm_anyfloat_ty], [LLVMMatchType<0>,
LLVMMatchType<0>, LLVMMatchType<0>], []>;

This allows us to write a multi-def for int_AMDIL_mad like so:
defm MAD : TernaryIntrinsicFloat<IL_OP_MAD, int_AMDIL_mad>;

Where TernaryIntrinsicFloat is defined as:
multiclass TernaryIntrinsicFloat<ILOpCode opcode, Intrinsic intr>
{
def _f32 : ThreeInOneOut<opcode, (outs GPRF32:$dst),
(ins GPRF32:$src, GPRF32:$src2, GPRF32:$src3),
!strconcat(opcode.Text, " $dst, $src, $src2, $src3"),
[(set GPRF32:$dst,
(intr GPRF32:$src, GPRF32:$src2, GPRF32:$src3))]>;
def _v2f32 : ThreeInOneOut<opcode, (outs GPRV2F32:$dst),
(ins GPRV2F32:$src, GPRV2F32:$src2, GPRV2F32:$src3),
!strconcat(opcode.Text, " $dst, $src, $src2, $src3"),
[(set GPRV2F32:$dst,
(intr GPRV2F32:$src, GPRV2F32:$src2, GPRV2F32:$src3))]>;

}

Now, this doesn’t completely work, because LLVM does not allow overloading of intrinsics values, so there needs to be a little coding in *IntrinsicInfo class.
AMD always encodes builtin names as __amdil_mad_f32, __amdil_mad_v2f32, __amdil_mad_v4f32, etc…
So in the function “*IntrinsicInfo::lookup_name”, when attempting to find out what intrinsic the function maps to, the AMDIL backend strips off the type, and then looks up for just ‘__amdil_mad’.

This is how you can do intrinsic overloading in LLVM.

Hope this helps,
Micah

Thank you Micah, it really does.

At the moment the PTX backend does not have a PTXIntrinsicInfo class,
the only backend which does so is MBlaze.
If Justin agrees with the approach I will look on how to generate the
PTXGenIntrinsics.inc file (I am still learning TableGen)
required by PTXIntrinsicInfo and write the lookUp method.

Looks good to me. For OpenCL support in clang, we definitely need the built-in function support. And the total number of intrinsics like this should be relatively minimal.

One thing I forgot to mention: once these are implemented, it may be worth implementing some instruction selection patterns to collapse icmp/fcmp and select pairs into Max/min whenever it makes sense.

Hi Justin,

sorry for the delay, I have been busy.

Micah's proposal requires to move the definitions of the intrinsics
from include/llvm/IntrinsicsPTX.td to lib/Target/PTX/PTXIntrinsics.td
thus allowing the generation of the file PTXGenIntrinsics.inc which
will be included by PTXIntrinsicInfo.cpp.
This is a quite big modification, do you agree with this ?
Or do you have a better solution.

Also I don't know yet how to make llvm recognize the intrinsics
defined in lib/Target/PTX/PTXIntrinsics.td, the only other
backend that does so is MBlaze.

A tentative patch is attached.

Bye,
Alberto

max_builtin.patch (21.1 KB)

Hi Justin,

sorry for the delay, I have been busy.

Micah’s proposal requires to move the definitions of the intrinsics
from include/llvm/IntrinsicsPTX.td to lib/Target/PTX/PTXIntrinsics.td
thus allowing the generation of the file PTXGenIntrinsics.inc which
will be included by PTXIntrinsicInfo.cpp.
This is a quite big modification, do you agree with this ?
Or do you have a better solution.

I’m opposed to this, mainly because we need the intrinsic definitions to be defined during LLVM IR optimization and not just at code-gen time. This is particularly important for pure intrinsics, like llvm.ptx.read.tid.x(), where the optimizers can fold multiple calls to these functions into a single call. Without the intrinsic definitions in include/llvm/IntrinsicsPTX.td, this optimization would be illegal.

At the moment, I’m not seeing a clean solution to this. Overloading the intrinsics by writing custom code in PTXIntrinsicInfo.h/.cpp is only a partial solution, with the problems mentioned above. In my mind, the cleanest solution would be to just write out explicit intrinsics for each possible type. We can still use multiclasses to an extent:

multiclass PTXBinaryIntrinsic {
def _u16 : Intrinsic<[llvm_i16_ty], [llvm_i16_ty, llvm_i16_ty], [InstrNoMem]>,
GCCBuiltin<!strconcat(prefix, “_u16”)>;
// Repeat for s16, u32, s32, u64, s64, f32, f64
}

defm int_ptx_mad<"__builtin_ptx_mad">;

It’s not the cleanest, but it gets the job done (unless I’m missing something).

It is my understanding that all you need to do is specify let isTarget = 1 in your .td file and it will generate target specific intrinsics. This should allow you to keep the IntrinsicsPTX.td file in the same location.

Micah

It is my understanding that all you need to do is specify let isTarget = 1 in your .td file and it will generate target specific intrinsics. This should allow you to keep the IntrinsicsPTX.td file in the same location.

So we keep the intrinsics defined in include/llvm/IntrinsicsPTX.td? How do we then get at the generated files in the PTXIntrinsicInfo class in the back-end?

What exactly does isTarget do? It seems to remove a lot of the intrinsic information in the Intrinsics.gen file, but I can’t find any documentation on it.

If I do so something strange happens.
If I add these 5 lines to include/llvm/IntrinsicsPTX.td:

let TargetPrefix = "ptx", isTarget = 1 in {
  def int_ptx_max_signed : Intrinsic<[llvm_anyint_ty],
                                     [LLVMMatchType<0>, LLVMMatchType<0>],
                                     [IntrNoMem, Commutative]>;
}

I get the following compilation error:

In file included from MBlazeIntrinsicInfo.cpp:99:
MBlazeGenIntrinsics.inc: In function ‘llvm::FunctionType*
getType(llvm::LLVMContext&, unsigned int)’:
MBlazeGenIntrinsics.inc:651: error: ‘Tys’ was not declared in this scope

That's why the generated file MBlazeGenIntrinsics.inc contains a reference
to the ptx intrinsics. The error is due to the fact that MBlaze intrinsics
are not overloaded and therefore the variable Tys is not defined.

I am not sure if this is a limitation of the MBlaze backend of the PTX.

Anyway I noticed the isTarget (I included it in my previous patch) but I thought
it works only for the XXXIntrinsics.td file.

Alberto

isTarget means that the intrinsics are generated in the ###GenIntrinsic.inc file instead of in the Intrinsics.gen file. Because I don’t use the high level IntrinsicsPTX.td file, I’m not sure how exactly to access it if the isTarget is specified there.