OpenCL support

Hi Peter and Mike,

Many thanks for your comments!

> @@ -1192,6 +1192,8 @@ private:
> bool IsTrivial : 1; // sunk from CXXMethodDecl
> bool HasImplicitReturnZero : 1;
>
> + bool IsOpenCLKernel : 1;

A better way of storing this information would be to use an attribute
(Attr subclass). See the include/clang/Basic/Attr.td file.
...
We should only add a new field if we expect it to be used often.

We have considered this option but attributes have too complicated a
semantics which would make handling this qualifier unwieldy. Bit-fields
probably get stored in a 32-bit word, which means there are 20 more unused
bits in every FunctionDecl object anyway. We do use Clang attributes for
the optional function qualifiers:
   * __attribute__((vec_type_hint(<typen>)))
   * __attribute__((work_group_size_hint(X, Y, Z)))
   * __attribute__((reqd_work_group_size(X, Y, Z)))

Mike Gist wrote:

You could also consider placing all kernel functions in a 'kernel'
section, or adding a function attribute for kernels.

Unlike in Clang, function attribute bit-fields in LLVM are pretty crowded
(only couple of bits are unused?). Besides, we do not want to represent
differently the kernel qualifier and optional kernel qualifiers, which
require storing <typen> and X, Y, Z values. (I don't even want to think how
one would mangle the optional qualifiers into a kernel name and then
demangle.)

That's why we propose to use metadata for this purpose. Does anyone have a
better idea?

Peter Collingbourne wrote:

I don't think it is a good idea to use LangOptions for this.
LangOptions is an input parameter so we shouldn't modify it during
parsing/semantic analysis. Modifying the LangOptions could also affect
clients which reuse a LangOptions object expecting it to be unchanged.

We have addressed this by using an OpenCLOptions struct in Preprocessor.

Mike Gist wrote:

cl_khr_byte_addressable_store is probably a fairly well supported 1.0
extension that you could add too :slight_smile: Those will have to be disabled
by default for 1.0 anyhow.

Done (+ 32-bit atomics).

Peter Collingbourne wrote:

> @@ -230,7 +232,7 @@ KEYWORD(__func__ , KEYALL)
>
> // C++ 2.11p1: Keywords.
> KEYWORD(asm , KEYCXX|KEYGNU)
> -KEYWORD(bool , BOOLSUPPORT|KEYALTIVEC)
> +KEYWORD(bool , BOOLSUPPORT|KEYALTIVEC|KEYOPENCL)
> ...
Isn't this rendered unnecessary by this code? (from
lib/Frontend/CompilerInvocation.cpp):

> // OpenCL and C++ both have bool, true, false keywords.
> Opts.Bool = Opts.OpenCL || Opts.CPlusPlus;

Indeed. We have removed this from our patch. (The drawback is that the
user of the OpenCL mode will also need to turn on the Bool mode if she
doesn't use CompilerInvocation. For AltiVec, bool gets supported
automatically.)

Please find attached the rework of the first patch plus the second patch on
OpenCL keywords.

Many thanks,
Anton.

00002-keywords.patch (13.3 KB)

00001-kernel-extensions-pragmas.patch (17.1 KB)

Hi Anton,

Thanks for the update and the new patches.

Hi Peter and Mike,

Many thanks for your comments!

> > @@ -1192,6 +1192,8 @@ private:
> > bool IsTrivial : 1; // sunk from CXXMethodDecl
> > bool HasImplicitReturnZero : 1;
> >
> > + bool IsOpenCLKernel : 1;
>
> A better way of storing this information would be to use an attribute
> (Attr subclass). See the include/clang/Basic/Attr.td file.
> ...
> We should only add a new field if we expect it to be used often.

We have considered this option but attributes have too complicated a
semantics which would make handling this qualifier unwieldy. Bit-fields
probably get stored in a 32-bit word, which means there are 20 more unused
bits in every FunctionDecl object anyway. We do use Clang attributes for
the optional function qualifiers:
   * __attribute__((vec_type_hint(<typen>)))
   * __attribute__((work_group_size_hint(X, Y, Z)))
   * __attribute__((reqd_work_group_size(X, Y, Z)))

We may in future want to merge FunctionDecl fields into Decl, which
will give us even less space to work with. We also plan to track
qualifier locations in the AST, which would add at least 32 bits to
the size of FunctionDecl. If this were an attribute the location
could be stored as an Attr field.

There are several examples of existing attributes in the Clang source
code which you should be able to follow. Perhaps you can explain in
more detail why you cannot use attributes here?

Mike Gist wrote:
> You could also consider placing all kernel functions in a 'kernel'
> section, or adding a function attribute for kernels.
Unlike in Clang, function attribute bit-fields in LLVM are pretty crowded
(only couple of bits are unused?). Besides, we do not want to represent
differently the kernel qualifier and optional kernel qualifiers, which
require storing <typen> and X, Y, Z values. (I don't even want to think how
one would mangle the optional qualifiers into a kernel name and then
demangle.)

That's why we propose to use metadata for this purpose. Does anyone have a
better idea?

I'll reply to this in a separate mail to llvmdev+cfe-dev.

Peter Collingbourne wrote:
> I don't think it is a good idea to use LangOptions for this.
> LangOptions is an input parameter so we shouldn't modify it during
> parsing/semantic analysis. Modifying the LangOptions could also affect
> clients which reuse a LangOptions object expecting it to be unchanged.
We have addressed this by using an OpenCLOptions struct in Preprocessor.

I still think that Sema is the best place to store this information.
That's because it is the only part of the system which needs to care
about which OpenCL extensions are enabled at any given time. The way
I imagined it would work would be to add a new pair of actions to Sema
(ActOnPragmaFPContract, ActOnPragmaCLExtension perhaps?) which you
could call from your pragma handler. Those actions would then set
the appropriate Sema/OpenCLOptions fields.

Please find attached the rework of the first patch plus the second patch on
OpenCL keywords.

More detailed comments on the first patch below...

+ unsigned cl_fp_contract : 1; // OpenCL FP_CONTRACT state.

I'm not sure this should be an OpenCL-specific option. In future
we would like to support #pragma STDC FP_CONTRACT, and ideally we'd
like to be able to reuse OpenCL's FP_CONTRACT support. I'd suggest
just making this a Sema field.

+ // OpenCL1.1 enables these extensions by default
+ cl_khr_global_int32_base_atomics =
+ cl_khr_global_int32_extended_atomics =
+ cl_khr_local_int32_base_atomics =
+ cl_khr_local_int32_extended_atomics =
+ cl_khr_byte_addressable_store = 1;

I presume you misread Table 4.3 from the spec, which states
that those extensions must be supported, which is not the same
thing as being enabled. In fact, section 9.1 states:

"The initial state of the compiler is as if the directive

             #pragma OPENCL EXTENSION all : disable

was issued, telling the compiler that all error and warning reporting
must be done according to this specification, ignoring any extensions."

Are you planning to have a mechanism for specifying which extensions
are supported by a particular implementation? My initial thoughts
are that we can add a member function to TargetInfo which returns an
OpenCLOptions struct containing this information.

+ SourceLocation CLLoc = CLTok.getLocation();
+
+ Token Tok;
+ PP.Lex(Tok);
+ if (Tok.isNot(tok::identifier)) {
+ PP.Diag(Tok.getLocation(), diag::warn_pragma_expected_identifier)
+ << "OPENCL";
+ return;
+ }
+
+ // Choose between EXTENSION and FP_CONTRACT
+
+ IdentifierInfo *cmd = Tok.getIdentifierInfo();
+
+ if(cmd->isStr("FP_CONTRACT")) {

You should split this PragmaHandler into separate handlers for
#pragma OPENCL FP_CONTRACT and #pragma OPENCL EXTENSION. Then
there's no need to do the string matching yourself, and it will
be easier to reuse the FP_CONTRACT support.

Comments on the second patch to come...

Thanks,

I agree that metadata should be used for function qualifiers;
a prerequisite being support for non-discardable function-level
metadata, which would need to be added to LLVM. I'm undecided on
whether __kernel should also be represented by metadata; there is
precedent (PTX backend) for using the calling convention.

I do have a concern though with the semantics of the inliner when it
needs to inline a function with metadata. One possibility would be to
discard the callee's metadata, or somehow merge it with the caller's.
Discarding seems like the right solution for OpenCL and a good starting
point (in future we may wish to add attributes to metadata nodes like
the 'appending' linkage for globals) but sounds like something that
should be discussed first.

Thanks,

However we record the fact that a function is a kernel, the mechanism
should handle the case of a kernel calling another kernel.
Recall that a kernel called by another kernel behaves more like a
regular function. For example it doesn't have workspace iteration
automatically applied to it; rather it just adopts the work item of
the caller.

About using a calling convention to mark a function as a kernel. It
seems a handy place to hang it, but is it really exclusive of the
other calling conventions?
In particular, does that approach nicely in the case where a CPU is
running the kernels? Does that lead to special casing or duplication
in the code generator? For example, you still have to know what
"real" calling convention to use when a kernel is running on a CPU.
(Forgive my ignorance.)

thanks,
david

Mike Gist wrote:

You could also consider placing all kernel functions in a ‘kernel’
section, or adding a function attribute for kernels.
Unlike in Clang, function attribute bit-fields in LLVM are pretty crowded
(only couple of bits are unused?). Besides, we do not want to represent
differently the kernel qualifier and optional kernel qualifiers, which
require storing and X, Y, Z values. (I don’t even want to think how
one would mangle the optional qualifiers into a kernel name and then
demangle.)

That’s why we propose to use metadata for this purpose. Does anyone have a
better idea?

I agree that metadata should be used for function qualifiers;
a prerequisite being support for non-discardable function-level
metadata, which would need to be added to LLVM. I’m undecided on
whether __kernel should also be represented by metadata; there is
precedent (PTX backend) for using the calling convention.

Being discardable is a design point of metadata. You might add something else to support this, but it won’t be metadata.

Why are you trying to preserve “kernel”-ness into the LLVM IR? What semantics does it have? What does __kernel actually mean to the optimizers and code generator?

Could you just make __kernel mean “externally visible” and undecorated functions be “linkonce_odr”? If that’s not enough, could you swing it around and maintain single named metadata node with a list of functions that are marked __kernel?

Nick

Regarding linkage:

A __kernel function is externally visible. It is callable from the
user program which is logically a separate compilation unit; and from
other functions in its own compilation unit.
The non-kernel functions have private linkage, I believe: they are
only callable by other functions in the same compilation unit.

However, a __kernel behaves differently when called from the user
program vs. another function in the compilation unit. In OpenCL the
user program can invoke a kernel as an NDRange, i.e. with an implied
loop around it to iterate over an index space of 1 to 3 dimensions.
(This is the "big idea" of OpenCL). (The index values are available
in the function body from intrinsic functions get_work_dim() and
get_global_id(uint workdim).)
But that implied loop is only applied when directly called from the
user program. When a kernel is called from another kernel, it behaves
as a regular function call and just adopts the caller's index point.

The spec does not specify whether or how that implied loop is
represented in the IR. I expect most implementations don't represent
the loop explicitly.

I would be happy to see an OpenCL-specific patch that always marked
non-kernel functions with internal linkage. Then you could
distinguish the kernel/non-kernel case just by the linkage attribute.
It might be a little unclean / unorthogonal, but I think it would be
ok.

(There are also other minor differences, e.g. the behaviour of a
function-scope-local-addr-space variable in a nested kernel is
implementation-defined. See the Notes in the functionQualifiers
reference.)

david

References
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/functionQualifiers.html
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueNDRangeKernel.html
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html

However we record the fact that a function is a kernel, the mechanism
should handle the case of a kernel calling another kernel.
Recall that a kernel called by another kernel behaves more like a
regular function. For example it doesn't have workspace iteration
automatically applied to it; rather it just adopts the work item of
the caller.

About using a calling convention to mark a function as a kernel. It
seems a handy place to hang it, but is it really exclusive of the
other calling conventions?
In particular, does that approach nicely in the case where a CPU is
running the kernels? Does that lead to special casing or duplication
in the code generator? For example, you still have to know what
"real" calling convention to use when a kernel is running on a CPU.
(Forgive my ignorance.)

As with __local variables, it may be that "kernelness" cannot be
represented in a standard form in LLVM. For example on a CPU a
kernel function may have an additional parameter which is a pointer to
__local memory space, which would not be necessary on GPUs. Then in
fact you would use a standard calling convention on a CPU.

But for GPUs, I think using the calling convention is appropriate.
If we standardise the calling convention number, this can be the
default behaviour.

I would be happy to see an OpenCL-specific patch that always marked
non-kernel functions with internal linkage. Then you could
distinguish the kernel/non-kernel case just by the linkage attribute.
It might be a little unclean / unorthogonal, but I think it would be
ok.

Some OpenCL implementations (including my own) may use runtime library
functions which live in a separate compilation unit. These would
need to be marked external but of course would not be kernel functions.

Nick Lewycky wrote:

Being discardable is a design point of metadata. You might add something
else to support this, but it won't be metadata.

There's nothing intrinsic about the concept of metadata which requires
it to be discardable. In particular, if the metadata is attached to
a function, the only case I can think of where an optimiser needs
to touch the metadata is if a function with metadata is inlined.
And as I mentioned in my previous mail I don't think this will be
any trouble for OpenCL.

The __kernel attribute isn't the only attribute we need to preserve.
There are also:

__attribute__((vec_type_hint(type)))
__attribute__((work_group_size_hint(X, Y, Z)))
__attribute__((reqd_work_group_size(X, Y, Z)))

which provide hints to the code generator regarding the specific work
load of a particular kernel.

Why are you trying to preserve "kernel"-ness into the LLVM IR? What
semantics does it have? What does __kernel actually mean to the optimizers
and code generator?

For PTX, if __kernel is set on a function it needs to be codegen'd
with a specific directive which marks it as a kernel entry point.
What this actually means at a lower level I don't know (the low level
machine code representation is undocumented). I believe there is
also something similar in the AMD Stream IL.

As for the other attributes mentioned above, I don't know off-hand,
but I believe there are PTX directives for at least some of them.

Could you just make __kernel mean "externally visible" and undecorated
functions be "linkonce_odr"?

I think the semantics of undecorated functions is closer to "internal"
than "linkonce_odr" here (kernel programs shouldn't be able to
provide a definition for functions in another module, such as a
runtime library module).

If that's not enough, could you swing it around
and maintain single named metadata node with a list of functions that are
marked __kernel?

Are you saying that named metadata nodes are non-discardable? Even
if this were true, it would still be difficult to represent the other
attributes unless the metadata were attached to the function.

Thanks,

I stand corrected. :slight_smile:

thanks,
david

There may be cases where an optimization pass doesn't know how to
update metadata, so discarding it is the only way to prevent invalid
metadata while still performing the transformation.
If metadata would be attached to functions, that means passes like
dead argument elimination may need to remove it since they create a
new function (with a different type), and the metadata may depend on
the function type staying the same. For instance: the metadata may
contain references to parameter numbers for all it knows, and it won't
know how to change that to account for the removed parameters.

David Neto wrote:

However, a __kernel behaves differently when called from the user
program vs. another function in the compilation unit. In OpenCL the
user program can invoke a kernel as an NDRange, i.e. with an implied
loop around it to iterate over an index space of 1 to 3 dimensions.

I'd like to emphasize that the work group loop cannot be simply applied
around the whole kernel function due to wg barriers. This is what the passes
I mentioned in my original email to this thread are about. The loops need
to be added to the regions between barriers separately to comply with the
barrier semantics which is not completely trivial with some barrier scenarios
(e.g. barriers inside loops or conditional blocks). These loops can be
vectorized or unrolled in case the wg dimensions are known at kernel
compilation time (so called "work item merging/chaining" optimization) and
if it's beneficial on the target architecture.

Of course, some architectures do not need the loops at all due to the OpenCL
"data parallel/threading semantics" implemented in hardware with some sort
of work item/thread aware SIMD-style hardware (AFAIU this is the case with
e.g. NVIDIA GPUs).

But that implied loop is only applied when directly called from the
user program. When a kernel is called from another kernel, it behaves
as a regular function call and just adopts the caller's index point.

I think in OpenCL kernel compilation it's common to fully inline everything
to the callable kernel, thus the loops would be applied to the fully inlined
version so you don't need separate versions of the kernel functions with and
without the loops.

From: Peter Collingbourne [mailto:peter@pcc.me.uk]
Sent: 20 December 2010 20:11
As with __local variables, it may be that "kernelness" cannot be
represented in a standard form in LLVM. For example on a CPU a
kernel function may have an additional parameter which is a pointer to
__local memory space, which would not be necessary on GPUs. Then in
fact you would use a standard calling convention on a CPU.

But for GPUs, I think using the calling convention is appropriate.
If we standardise the calling convention number, this can be the
default behaviour.

I don't think we want LLVM-IR coming from an OpenCL C frontend to be
different for GPU and CPU targets. In my view, the frontend should be
parameterised by only two (more or less) parameters: bitness (32/64) and
endianness (little/big). How one can even guarantee e.g. that a calling
convention for NVIDIA GPUs is appropriate for ATI GPUs? So using calling
conventions too early on (e.g. between invoking the clBuildProgram() and
clCreateKernel() API functions) is a path to implementation divergence,
rather than standardisation.

> If that's not enough, could you swing it around and maintain single
> named metadata node with a list of functions that are marked __kernel?

This is exactly what we would like to do if metadata "non-discardability"
could be guaranteed at least until the clCreateKernel() API function is
called.

Cheers,
Anton.

On Behalf Of Anton Lokhmotov

Hi Peter,

There are several examples of existing attributes in the Clang source
code which you should be able to follow. Perhaps you can explain in
more detail why you cannot use attributes here?

We feared that too much code would be generated when using attributes. (As
you know, we must be cautious of anything negatively affecting the compiler
footprint.) However, we've re-implemented the kernel function qualifier
using an attribute instead of a field. Please review.

> + // OpenCL1.1 enables these extensions by default
> + cl_khr_global_int32_base_atomics =
> + cl_khr_global_int32_extended_atomics =
> + cl_khr_local_int32_base_atomics =
> + cl_khr_local_int32_extended_atomics =
> + cl_khr_byte_addressable_store = 1;

I presume you misread Table 4.3 from the spec, which states
that those extensions must be supported, which is not the same
thing as being enabled.

Indeed. They are disabled now.

Are you planning to have a mechanism for specifying which extensions
are supported by a particular implementation? My initial thoughts
are that we can add a member function to TargetInfo which returns an
OpenCLOptions struct containing this information.

Yes, something along these lines.

> + unsigned cl_fp_contract : 1; // OpenCL FP_CONTRACT state.

I'm not sure this should be an OpenCL-specific option. In future
we would like to support #pragma STDC FP_CONTRACT, and ideally we'd
like to be able to reuse OpenCL's FP_CONTRACT support. I'd suggest
just making this a Sema field.

...

You should split this PragmaHandler into separate handlers for
#pragma OPENCL FP_CONTRACT and #pragma OPENCL EXTENSION. Then
there's no need to do the string matching yourself, and it will
be easier to reuse the FP_CONTRACT support.

Good point. We've refactored FP_CONTRACT support into a separate
ParseFPContractPragma method. Note that in OpenCL the default value is ON,
whilst in C99 it is undefined.

Comments on the second patch to come...

I'm attaching it again for convenience...

Many thanks,
Anton.

00002-keywords.patch (13 KB)

00001-kernel-extensions-pragmas.patch (15.7 KB)

Some observations about storing the kernel information in metadata:

In our CL implementation at Ziilabs we represented all the kernel attributes as separate NamedMDNodes. The kernel name is concatenated with a set of known postfixes. For example ‘kernel foo(…)’ produces nodes with names ‘foo_is_kernel’, ‘foo_work_group_size_hint’, ‘foo_vec_type_hint’ etc. The MDNodes carry Values appropriate to the attribute type (e.g. 3 integers for the work_group_size_hint).

At any point passes (including machine passes) can lookup up whatever interesting attribute it wants and obtain the values in a fairly straightforward manner.

I think this tends to work better than storing the names of kernels in a list hanging off from one NamedMDNode:

(1) passes can just concatenate the name of the function with a postfix and look up the metadata, there’s no second phase of then searching through a list to find the name of the function it’s interested in.

(2) it’s not obvious how you’re going to extend the list approach to carry attribute values. Suppose you add support for the work_group_size_hint (it carries 3 integer values) how are you proposing that the presence of the attribute and it’s data be represented?

Thanks, Krister.

Abstract:

This message is in response to messages by Krister Wombell and Mike Gist,
and contains a question to Chris Lattner and Devang Patel.

In general metadata should be designed in such a way as to be resilient
to changes to the IR. For example, function metadata shouldn't
refer to parameters by index, for the same reason that instruction
metadata shouldn't refer to other instructions by relative offset.
If metadata about parameters needs to be stored then it should be
either attached to the parameter itself or by some other mechanism
(e.g. the llvm.dbg.value function used by the debug metadata).
The end result is that if an optimiser needs to copy a function it
can just copy the metadata also.

As with instruction metadata, if there is a reference to a global
or a non-constant value in function metadata it will be discarded if
those values are deleted. For simpler use cases (including OpenCL)
the metadata will only need to use integer constants and maybe types,
so non-discardability shouldn't be an issue, as far as I'm aware.

Thanks,

> From: Peter Collingbourne [mailto:peter@pcc.me.uk]
> Sent: 20 December 2010 20:11
> As with __local variables, it may be that "kernelness" cannot be
> represented in a standard form in LLVM. For example on a CPU a
> kernel function may have an additional parameter which is a pointer to
> __local memory space, which would not be necessary on GPUs. Then in
> fact you would use a standard calling convention on a CPU.
>
> But for GPUs, I think using the calling convention is appropriate.
> If we standardise the calling convention number, this can be the
> default behaviour.
I don't think we want LLVM-IR coming from an OpenCL C frontend to be
different for GPU and CPU targets. In my view, the frontend should be
parameterised by only two (more or less) parameters: bitness (32/64) and
endianness (little/big).

Not only sizes but alignment requirements will change between
platforms. Also, what about __local on CPU?

How one can even guarantee e.g. that a calling
convention for NVIDIA GPUs is appropriate for ATI GPUs?

We have full control over the target code generators. There's nothing
stopping us defining a specific constant representing the 'kernel'
calling convention and harmonising the GPU targets to use that
calling convention.

Thanks,

Sorry for the late reply, as I have been on vacation for awhile.

One method which I haven't seen mentioned is to separate out the kernel semantics from the function definition.

All the kernel attribute does is specify that this function is an entry point to the device from the host. So, why not just create a separate entry point that is only callable by the host and everything from the device goes to the original entry point.

For example, you have two functions and one calls the other:

kernel foo() {
}
kernel bar() {
  foo();
}

If you separate kernel function from the function body, then handling this becomes easy.

You end up with four functions:

kernel foo_kernel() {
foo();
}

foo() {
}

kernel bar_kernel() {
bar();
}

bar(){
foo();
}

Then the issue is no longer a compilation problem, but just an entry point runtime issue. Instead of calling foo(), the runtime just calls foo_kernel() which handles all of the kernel setup issues and then calls the function body itself.

This removes the need to have any metadata nodes in the IR and allows the kernel function to handle any setup issues for the specific device such as __local's, id/group calculations, memory offsets, etc... without having to impact the performance of a kernel calling another kernel.

Micah

I like this idea. I think that the entry point should keep its
original name though, while we rename the body, because the fact that
we factor out the function body seems like an implementation detail.

To a certain extent it also removes the need to attach metadata for
reqd_work_group_size etc at the function level (if required by the
target), since this information can be attached to intrinsic calls
within the entry point. Example:

define void @foo() nounwind {
entry:
  call void @llvm.opencl.reqd.work.group.size(i32 4, i32 1, i32 1)
  ; .. other setup ..
  call void @foo_kernel()
  ret void
}

define internal void @foo_kernel() nounwind {
  ; ... body ...
}

These intrinsics wouldn't necessarily expand to target code directly,
but would be used to generate something appropriate for the target in
a similar fashion to the debug metadata intrinsics. Also, by keeping
the metadata in the entry point we guarantee that no more than one
intrinsic call may appear within a function even if the inliner
is used, allowing code generators to simply search for uses of the
@llvm.opencl.reqd.work.group.size (or whatever) intrinsic to create
a mapping from functions to attributes.

Thanks,

From: Peter Collingbourne [mailto:peter@pcc.me.uk]
Sent: Tuesday, January 04, 2011 11:51 AM
To: Villmow, Micah
Cc: Anton Lokhmotov; cfe-dev@cs.uiuc.edu; llvmdev@cs.uiuc.edu
Subject: Re: [LLVMdev] Function-level metadata for OpenCL (was Re:
OpenCL support)

> Sorry for the late reply, as I have been on vacation for awhile.
>
> One method which I haven't seen mentioned is to separate out the
kernel semantics from the function definition.
>
> All the kernel attribute does is specify that this function is an
entry point to the device from the host. So, why not just create a
separate entry point that is only callable by the host and everything
from the device goes to the original entry point.
>
> For example, you have two functions and one calls the other:
>
> kernel foo() {
> }
> kernel bar() {
> foo();
> }
>
> If you separate kernel function from the function body, then handling
this becomes easy.
>
> You end up with four functions:
>
> kernel foo_kernel() {
> foo();
> }
>
> foo() {
> }
>
> kernel bar_kernel() {
> bar();
> }
>
> bar(){
> foo();
> }
>
> Then the issue is no longer a compilation problem, but just an entry
point runtime issue. Instead of calling foo(), the runtime just calls
foo_kernel() which handles all of the kernel setup issues and then
calls the function body itself.
>
> This removes the need to have any metadata nodes in the IR and allows
the kernel function to handle any setup issues for the specific device
such as __local's, id/group calculations, memory offsets, etc...
without having to impact the performance of a kernel calling another
kernel.

I like this idea. I think that the entry point should keep its
original name though, while we rename the body, because the fact that
we factor out the function body seems like an implementation detail.

[Villmow, Micah] Well, if the entry point keeps its same name, and the body is renamed, then all of the call sites must also be modified to point to the body and not the entry point. Either way is fine, as long as it is something that I think everyone can agree with.

To a certain extent it also removes the need to attach metadata for
reqd_work_group_size etc at the function level (if required by the
target), since this information can be attached to intrinsic calls
within the entry point. Example:

define void @foo() nounwind {
entry:
  call void @llvm.opencl.reqd.work.group.size(i32 4, i32 1, i32 1)
  ; .. other setup ..
  call void @foo_kernel()
  ret void
}

define internal void @foo_kernel() nounwind {
  ; ... body ...
}

These intrinsics wouldn't necessarily expand to target code directly,
but would be used to generate something appropriate for the target in
a similar fashion to the debug metadata intrinsics. Also, by keeping
the metadata in the entry point we guarantee that no more than one
intrinsic call may appear within a function even if the inliner
is used, allowing code generators to simply search for uses of the
@llvm.opencl.reqd.work.group.size (or whatever) intrinsic to create
a mapping from functions to attributes.

[Villmow, Micah] Have you had any thoughts about bringing this up with Khronos about standardizing some of these ideas/conventions between the multiple vendors that are using LLVM for their OpenCL implementations?