Changes to the PTX calling conventions

Currently, PTX has its own calling conventions where they are split into kernel/device.

The AMDIL backend requires very similar calling conventions and I was wondering if

we could change the calling conventions from PTX_* to something more generic?

Maybe just Kernel/Device? Or would it be preferable to add a new calling convention

that is unique for each target, even though it duplicates functionality?

Thanks,

Micah

Currently, PTX has its own calling conventions where they are split into kernel/device.

The AMDIL backend requires very similar calling conventions and I was wondering if

we could change the calling conventions from PTX_* to something more generic?

Maybe just Kernel/Device? Or would it be preferable to add a new calling convention

that is unique for each target, even though it duplicates functionality?

I don’t see any reason why a generic calling convention would not work. We could do something like cl_device/cl_kernel. I hate to introduce OpenCL terms into a back-end where OpenCL is just one consumer, but it does map cleanly to the architecture model. Or perhaps something more generic like gpu_device/gpu_global.

Currently, PTX has its own calling conventions where they are split into kernel/device.
The AMDIL backend requires very similar calling conventions and I was wondering if
we could change the calling conventions from PTX_* to something more generic?

Maybe just Kernel/Device? Or would it be preferable to add a new calling convention
that is unique for each target, even though it duplicates functionality?

I don't see any reason why a generic calling convention would not work. We could do something like cl_device/cl_kernel. I hate to introduce OpenCL terms into a back-end where OpenCL is just one consumer, but it does map cleanly to the architecture model. Or perhaps something more generic like gpu_device/gpu_global.
[Villmow, Micah] Yeah, but this should apply to more than just gpu's. For example, AMD's OpenCL CPU implementation could utilize the calling conventions, along with projects like ocelot that have the device-only vs host/device differentiation. Maybe just device/host is good enough?

Thanks,
Micah

From: Justin Holewinski [mailto:justin.holewinski@gmail.com]
Sent: Tuesday, December 13, 2011 9:48 AM
To: Villmow, Micah
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] Changes to the PTX calling conventions

Currently, PTX has its own calling conventions where they are split into kernel/device.

The AMDIL backend requires very similar calling conventions and I was wondering if

we could change the calling conventions from PTX_* to something more generic?

Maybe just Kernel/Device? Or would it be preferable to add a new calling convention

that is unique for each target, even though it duplicates functionality?

I don’t see any reason why a generic calling convention would not work. We could do something like cl_device/cl_kernel. I hate to introduce OpenCL terms into a back-end where OpenCL is just one consumer, but it does map cleanly to the architecture model. Or perhaps something more generic like gpu_device/gpu_global.

[Villmow, Micah] Yeah, but this should apply to more than just gpu’s. For example, AMD’s OpenCL CPU implementation could utilize the calling conventions, along with projects like ocelot that have the device-only vs host/device differentiation. Maybe just device/host is good enough?

Device/host just seems vague. Maybe we could create a set of specific conventions, one set for OpenCL: cl_device/cl_kernel, and another set for general accelerators, e.g. accel_device/accel_global.

Currently, PTX has its own calling conventions where they are split into kernel/device.
The AMDIL backend requires very similar calling conventions and I was wondering if
we could change the calling conventions from PTX_* to something more generic?

Maybe just Kernel/Device? Or would it be preferable to add a new calling convention
that is unique for each target, even though it duplicates functionality?

I don't see any reason why a generic calling convention would not work. We could do something like cl_device/cl_kernel. I hate to introduce OpenCL terms into a back-end where OpenCL is just one consumer, but it does map cleanly to the architecture model. Or perhaps something more generic like gpu_device/gpu_global.
[Villmow, Micah] Yeah, but this should apply to more than just gpu's. For example, AMD's OpenCL CPU implementation could utilize the calling conventions, along with projects like ocelot that have the device-only vs host/device differentiation. Maybe just device/host is good enough?

Device/host just seems vague. Maybe we could create a set of specific conventions, one set for OpenCL: cl_device/cl_kernel, and another set for general accelerators, e.g. accel_device/accel_global.
[Villmow, Micah] Yeah, that is true. What about leaving the calling convention alone for 'device' and just having a calling convention for 'kernel'(i.e. functions callable from another device). The normal calling conventions handle calls from the same device, but there is no calling convention that handles functions that are callable from a seperate device. This would handle the CPU/GPU and accelerator cases. That I believe is the fundamental difference between the two calling conventions that OpenCL uses.

Thanks,
Micah

From: Justin Holewinski [mailto:justin.holewinski@gmail.com]
Sent: Tuesday, December 13, 2011 10:50 AM

To: Villmow, Micah
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] Changes to the PTX calling conventions

From: Justin Holewinski [mailto:justin.holewinski@gmail.com]
Sent: Tuesday, December 13, 2011 9:48 AM
To: Villmow, Micah
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] Changes to the PTX calling conventions

Currently, PTX has its own calling conventions where they are split into kernel/device.

The AMDIL backend requires very similar calling conventions and I was wondering if

we could change the calling conventions from PTX_* to something more generic?

Maybe just Kernel/Device? Or would it be preferable to add a new calling convention

that is unique for each target, even though it duplicates functionality?

I don’t see any reason why a generic calling convention would not work. We could do something like cl_device/cl_kernel. I hate to introduce OpenCL terms into a back-end where OpenCL is just one consumer, but it does map cleanly to the architecture model. Or perhaps something more generic like gpu_device/gpu_global.

[Villmow, Micah] Yeah, but this should apply to more than just gpu’s. For example, AMD’s OpenCL CPU implementation could utilize the calling conventions, along with projects like ocelot that have the device-only vs host/device differentiation. Maybe just device/host is good enough?

Device/host just seems vague. Maybe we could create a set of specific conventions, one set for OpenCL: cl_device/cl_kernel, and another set for general accelerators, e.g. accel_device/accel_global.

[Villmow, Micah] Yeah, that is true. What about leaving the calling convention alone for ‘device’ and just having a calling convention for ‘kernel’(i.e. functions callable from another device). The normal calling conventions handle calls from the same device, but there is no calling convention that handles functions that are callable from a seperate device. This would handle the CPU/GPU and accelerator cases. That I believe is the fundamental difference between the two calling conventions that OpenCL uses.

You mean having no calling convention for device functions, and a new, common calling convention for kernels?

While this would work in practice, my issue with this approach is that it goes against the LLVM reference:

**"`ccc`" - The C calling convention**:
This calling convention (the default if no other calling convention is specified) matches the target C calling conventions. This calling convention supports varargs function calls and tolerates some mismatch in the declared prototype and implemented declaration of the function (as does normal C).

Our devices do not really have a “C calling convention,” so the default does not make much sense. However, I have no objection to modifying the documentation to state that the C calling convention is the default for targets that support that convention.

I think that we should have the default calling convention map
to *something* on every target. On PTX, the ptx_device calling
convention makes sense.

The reason I would like the C calling convention to be supported is
that it allows us to write callable generic functions in LLVM bitcode.
In libclc I had to write identity wrappers for each target [1] around
functions implemented in LLVM bitcode just to be able to call them,
and it would be much more convenient if I didn't have to do this for
every target.

Thanks,

Hi all,

You mean having no calling convention for device functions, and a new, common
calling convention for kernels?

I think this might make sense.

One major issue with OpenCL C (and I suppose CUDA) kernels some
fail to see is that the functions are "directly callable"
(just by choosing a correct the calling convention) in general only for
SIMT/SPMD-style machines (like NVIDIA and I suppose AMD's GPUs).

For the MIMD (with possible SIMD/vector extensions) CPU-architectures
you need to transform the kernel function to a "work group function"
so it retains its parallel work item semantics whenever the kernel is
to be called with more than 1 parallel work items.

The transformation is not completely trivial due to the work
group (WG) barrier semantics. You can have barriers inside for-loops,
conditional blocks, etc. which makes it a more difficult compilation
problem than "just adding a loop around the whole WI kernel function".
Converting the "single WI kernel semantics" to work group
functions statically while avoiding threads for WI execution
is the main point of complexity the pocl project [1] has to go
through.

For OpenCL compilation I think it's common to inline everything to
the kernel functions so the "device functions" usually just disappear.
This makes sense for SIMT and also when you do vectorization across
WIs of a WG, or in general want to improve the DLP/ILP of the kernel.
That said, you might not want to fully inline with all targets
(e.g. with a CPU with SIMD + OoOE you might want to reduce the icache
footprint and not inline).

Therefore, the kernel functions in this sense are different from the
device functions and at least the metadata that marks the kernels is
still needed. In pocl the OpenCL compilation is now enabled for all
(CPU) targets supported by LLVM solely depending on the kernel metadata.
In case *only* the kernel functions are marked with this calling
convention, the kernel metadata might not be needed. But, you still
might need the calling convention for the device functions if you
assume them not to get always inlined.

[1] https://launchpad.net/pocl

Best regards,

2011/12/14 Pekka Jääskeläinen <pekka.jaaskelainen@tut.fi>

Hi all,

You mean having no calling convention for device functions, and a new, common
calling convention for kernels?

I think this might make sense.

To be clear, I do like the idea of using the default calling convention for device functions. My hesitation is from the LLVM specification that says the default calling convention is the C calling convention, which supports varargs. If the spec is changed to make the supported features of the C calling convention dependent on the target, then I’m fine with this.

Any core LLVM devs have any issues with this?

One major issue with OpenCL C (and I suppose CUDA) kernels some
fail to see is that the functions are “directly callable”
(just by choosing a correct the calling convention) in general only for
SIMT/SPMD-style machines (like NVIDIA and I suppose AMD’s GPUs).

For the MIMD (with possible SIMD/vector extensions) CPU-architectures
you need to transform the kernel function to a “work group function”
so it retains its parallel work item semantics whenever the kernel is
to be called with more than 1 parallel work items.

The transformation is not completely trivial due to the work
group (WG) barrier semantics. You can have barriers inside for-loops,
conditional blocks, etc. which makes it a more difficult compilation
problem than “just adding a loop around the whole WI kernel function”.
Converting the “single WI kernel semantics” to work group
functions statically while avoiding threads for WI execution
is the main point of complexity the pocl project [1] has to go
through.

For OpenCL compilation I think it’s common to inline everything to
the kernel functions so the “device functions” usually just disappear.
This makes sense for SIMT and also when you do vectorization across
WIs of a WG, or in general want to improve the DLP/ILP of the kernel.
That said, you might not want to fully inline with all targets
(e.g. with a CPU with SIMD + OoOE you might want to reduce the icache
footprint and not inline).

Therefore, the kernel functions in this sense are different from the
device functions and at least the metadata that marks the kernels is
still needed. In pocl the OpenCL compilation is now enabled for all
(CPU) targets supported by LLVM solely depending on the kernel metadata.
In case only the kernel functions are marked with this calling
convention, the kernel metadata might not be needed. But, you still
might need the calling convention for the device functions if you
assume them not to get always inlined.

We absolutely cannot rely on inlining. An OpenCL front-end is only one possible consumer of the PTX back-end, and general PTX supports recursion which cannot always be inlined.

I would favor calling conventions over metadata for the simple reason that this maps more cleanly to the device model. Device and kernel functions are represented differently in PTX, including (sometimes) the way parameters are passed.

For the record, marking the kernels with "calling conventions" instead
of metadata is fine also for the pocl use case. It's enough if there is a way
to differentiate OpenCL C kernels from the "device functions" for the reason
I discussed in the previous email. That is, in the pocl point of view we just
need a way to pick the "host-callable" kernel functions as they need the
special treatment before they can be called (like a C function).

BTW what about the other OpenCL data like required_wg_size which
affect the possible "kernel treatment" of pocl and can be converted to some
special instructions (I suppose) for the SIMT targets? Currently only the
TCE target in Clang adds metadata for the required_wg_size kernel
attribute (as we need it in "offline compilation") but IMHO that could be
useful in general, as a default metadata (to enable its support in pocl
for all targets, for example).

2011/12/14 Pekka Jääskeläinen <pekka.jaaskelainen@tut.fi>

I would favor calling conventions over metadata for the simple reason
that this maps more cleanly to the device model. Device and kernel
functions are represented differently in PTX, including (sometimes) the
way parameters are passed.

For the record, marking the kernels with “calling conventions” instead
of metadata is fine also for the pocl use case. It’s enough if there is a way
to differentiate OpenCL C kernels from the “device functions” for the reason
I discussed in the previous email. That is, in the pocl point of view we just
need a way to pick the “host-callable” kernel functions as they need the
special treatment before they can be called (like a C function).

BTW what about the other OpenCL data like required_wg_size which
affect the possible “kernel treatment” of pocl and can be converted to some
special instructions (I suppose) for the SIMT targets? Currently only the
TCE target in Clang adds metadata for the required_wg_size kernel
attribute (as we need it in “offline compilation”) but IMHO that could be
useful in general, as a default metadata (to enable its support in pocl
for all targets, for example).

Ideally, we would need some standard way of representing this in Clang. The back-end would then need to convert it to whatever form the target OpenCL run-time expects.

This is a question for cfe-dev.

Hi all,

I would favor calling conventions over metadata for the simple
reason that this maps more cleanly to the device model. Device and
kernel functions are represented differently in PTX, including
(sometimes) the way parameters are passed.

For the record, marking the kernels with "calling conventions"
instead of metadata is fine also for the pocl use case. It's enough
if there is a way to differentiate OpenCL C kernels from the "device
functions" for the reason I discussed in the previous email. That is,
in the pocl point of view we just need a way to pick the
"host-callable" kernel functions as they need the special treatment
before they can be called (like a C function).

Remember OpenCL kernels are also callable from inside another
kernels. It is not a big deal though, as calling conventions in LLVM
IR are just markers to the code generation, they do not have any
effect before that (AFAIK).

What it is needed is a way to differentiate at LLVM IR level between:
1) Normal functions
2) Functions callable from outside and inside (OpenCL kernels would fall
   in this category).
3) Functions callable only from outside (I there is such case; I am
   not so familiar with CUDA so I do not know if such functions exist on
   CUDA).

At least 1 and 2 are needed for OpenCL. Whether this is calling
conventions, metadata, or attributes, do not make such a big
difference, in practical terms. Code generation can apply different
calling conventions based on metadata/attributes, and can also detect
the kernels based on calling conventions, so the options are
interchangeable.

BTW what about the other OpenCL data like required_wg_size
affect the possible "kernel treatment" of pocl and can be converted
to some special instructions (I suppose) for the SIMT targets?
Currently only the TCE target in Clang adds metadata for the
required_wg_size kernel attribute (as we need it in "offline
compilation") but IMHO that could be useful in general, as a default
metadata (to enable its support in pocl for all targets, for
example).

Ideally, we would need some standard way of representing this in
Clang. The back-end would then need to convert it to whatever form
the target OpenCL run-time expects.

This is an interesting point. And there might be more information
present on .cl files that needs to get transported into LLVM IR. While
there has been the argument around that OpenCL "is C" so clang should
not need to generate extra stuff for OpenCL input files, the fact is
that it is not plain C. Basically there are two ways to go on:

a) OpenCL is a C-based language (C plus additions) and clang can parse
   it, so *all* the information on the .cl file has to be present in
   LLVM IR.
b) OpenCL is just C, so clang does not need to care about extra things
   and implementations should parse .cl files to get the extra
   information, and potentially preprocess to transform the non-C
   constructs into valid C code.

Just staying in between is good for nothing. An given clang has a CL
mode already (-x cl) recognizes the keywords and supports the non-C in
OpenCL (like vector swizzle), I think (b) can be discarded right away.
But then all the info should get in a generic way into the LLVM.

This is a question for cfe-dev.

So adding cfe-dev in copy.

BR

Carlos

Hi all,

I would favor calling conventions over metadata for the simple
reason that this maps more cleanly to the device model. Device and
kernel functions are represented differently in PTX, including
(sometimes) the way parameters are passed.

For the record, marking the kernels with “calling conventions”
instead of metadata is fine also for the pocl use case. It’s enough
if there is a way to differentiate OpenCL C kernels from the “device
functions” for the reason I discussed in the previous email. That is,
in the pocl point of view we just need a way to pick the
“host-callable” kernel functions as they need the special treatment
before they can be called (like a C function).

Remember OpenCL kernels are also callable from inside another
kernels. It is not a big deal though, as calling conventions in LLVM
IR are just markers to the code generation, they do not have any
effect before that (AFAIK).

What it is needed is a way to differentiate at LLVM IR level between:

  1. Normal functions
  2. Functions callable from outside and inside (OpenCL kernels would fall
    in this category).
  3. Functions callable only from outside (I there is such case; I am
    not so familiar with CUDA so I do not know if such functions exist on
    CUDA).

At least 1 and 2 are needed for OpenCL. Whether this is calling
conventions, metadata, or attributes, do not make such a big
difference, in practical terms. Code generation can apply different
calling conventions based on metadata/attributes, and can also detect
the kernels based on calling conventions, so the options are
interchangeable.

BTW what about the other OpenCL data like required_wg_size

affect the possible “kernel treatment” of pocl and can be converted
to some special instructions (I suppose) for the SIMT targets?
Currently only the TCE target in Clang adds metadata for the
required_wg_size kernel attribute (as we need it in “offline
compilation”) but IMHO that could be useful in general, as a default
metadata (to enable its support in pocl for all targets, for
example).

Ideally, we would need some standard way of representing this in
Clang. The back-end would then need to convert it to whatever form
the target OpenCL run-time expects.

This is an interesting point. And there might be more information
present on .cl files that needs to get transported into LLVM IR. While
there has been the argument around that OpenCL “is C” so clang should
not need to generate extra stuff for OpenCL input files, the fact is
that it is not plain C. Basically there are two ways to go on:

a) OpenCL is a C-based language (C plus additions) and clang can parse
it, so all the information on the .cl file has to be present in
LLVM IR.
b) OpenCL is just C, so clang does not need to care about extra things
and implementations should parse .cl files to get the extra
information, and potentially preprocess to transform the non-C
constructs into valid C code.

Just staying in between is good for nothing. An given clang has a CL
mode already (-x cl) recognizes the keywords and supports the non-C in
OpenCL (like vector swizzle), I think (b) can be discarded right away.
But then all the info should get in a generic way into the LLVM.

(b) can be also be discarded because the original OpenCL source is not always available. It is perfectly valid to compile OpenCL to a binary form (PTX in the case of nVidia GPUs), and then load the binary as an OpenCL program. In this case, the original .cl file may not even be available.

This is a question for cfe-dev.

So adding cfe-dev in copy.

Thanks. I forgot to add that. :slight_smile: