OpenCL support - using metadata

Hi Guy,

I would like to gather more information about how different implementations
represent OpenCL C programs in LLVM-IR in order to collate the best practice
and make it into a standard.

We at Intel are also using metadata for our implementation. This
approach is very useful for a few reasons:
First, when compiling OpenCL C code to llvm, some important information
about the kernel arguments is lost. i.e. typedefs are lost, structures
are sometimes disassembled. We're adding a string containing the
interesting part original function signature.

Hmm, if we want to standardise this, we need to define which parts of the
signature are considered "interesting".

Second, we use a global metadata node, which enumerates all the kernel
functions. This can be used to quickly enumerate the kernels in a
program, instead of passing the whole module.

We are doing exactly the same:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2010-December/012741.html

Third, for each kernel we hold the kernel attributes in the same
metadata: vector type hint, required workgroup size and workgroup size
hint.

Do you use one metadata node per kernel? (It seems we are doing it
differently.) Do these nodes have the same format (three fields per
kernel)? What do you use for defaults?

Last, we also hold in the metadata a list of the local variables
defined in that kernel.

Do you mean kernel-scope variables declared in the local address space? How
about kernel-scope variables declared in the constant address space? We
are bringing both classes to the program scope, prefixing them with the
kernel's name. Do you have any views on that?

Best, Anton.

Hi Anton,
You can see the current metadata we've below.
Basically, it holds a list of the kernels, as in your patch, but for each kernel it also holds some additional data in the following order: required work group size (or (0,0,0) if not defined), work group size hint (or (0,0,0) if not defined), vec type hint (as a string), the kernel's signature (as a string too), and the name of a named metadata node, that holds a list of the kernel's locals' names.
We'd also like to know how your metadata works - I'm sure we can find a common solution that will be agreed on everyone.
BTW, local variables in our solution are treated much like "static" C variables; they are promoted to the program scope with prefixes (as you might see below). According to my understanding, function scope constant variables are prohibited, the OpenCL 1.1 spec says at 6.5.3: "Variables allocated in the __constant address space can only be defined as program scope variables and are required to be initialized."

kernel __attribute__(( vec_type_hint(char2) )) void foo(global char4* in, global int16* out)
{
  local float4 ff4;
  size_t id = get_global_id(0);
  out[id] = (int16)(as_int(in[id]));
}

kernel __attribute__(( work_group_size_hint(16,1,1) )) void foo2(global char4* in, global int16* out)
{
  local float4 ff4;
  size_t id = get_global_id(0);
  out[id] = (int16)(as_int(in[id]));
}

!opencl.kernels = !{!0, !2}
!opencl_foo_locals_anchor = !{!4}
!opencl_foo2_locals_anchor = !{!5}

!0 = metadata !{void (<4 x i8> addrspace(1)*, <16 x i32> addrspace(1)*)* @foo, metadata !1, metadata !1, metadata !"char2", metadata !"char4 __attribute__((address_space(1))) *, int16 __attribute__((address_space(1))) *", metadata !"opencl_foo_locals_anchor"}
!1 = metadata !{i32 0, i32 0, i32 0}
!2 = metadata !{void (<4 x i8> addrspace(1)*, <16 x i32> addrspace(1)*)* @foo2, metadata !1, metadata !3, metadata !"", metadata !"char4 __attribute__((address_space(1))) *, int16 __attribute__((address_space(1))) *", metadata !"opencl_foo2_locals_anchor"}
!3 = metadata !{i32 16, i32 1, i32 1}
!4 = metadata !{metadata !"opencl_foo_local_ff4"}
!5 = metadata !{metadata !"opencl_foo2_local_ff4"}

Thanks
      Guy Benyei
      Intel
      SSG - MGP OpenCL Development Center

Hi Guy,

We like the idea of a single node for each kernel. (We tried to avoid
providing default information but I don't think we should try and save space
on this.)

required work group size (or (0,0,0) if not defined), work group size
hint (or (0,0,0) if not defined), vec type hint (as a string),

I think the default for vec_type_hint is "int" (6.7.2).

kernel's signature (as a string too),

We are not sure whether this is required. It seems this information can be
obtained directly from the kernel's value?

and the name of a named metadata
node, that holds a list of the kernel's locals' names.

We need to discuss a bit more __local and __constant variables.

BTW, local variables in our solution are treated much like "static" C
variables; they are promoted to the program scope with prefixes (as you
might see below).

We use no metadata for this at all and Clang's conversion for static
variables. So for:

  kernel void foo(...) { local int i; ... }

instead of:

!opencl_foo_locals_anchor = !{!4}
!4 = metadata !{metadata !"opencl_foo_local_i"}

we generate something like:

  @foo.i = internal addrspace(2) global i32 0, align 4

(I assume you use metadata in addition to this; otherwise, how would you be
able represent the type of individual __local variables?)

I see that with your representation one immediately gets hold of all names
for __local variables for a kernel. Do you think it is important?

Is there any benefit in using underscores over dots? (In certain cases,
underscores could lead to a name clash.)

According to my understanding, function scope
constant variables are prohibited, the OpenCL 1.1 spec says at 6.5.3:
"Variables allocated in the __constant address space can only be
defined as program scope variables and are required to be initialized."

This statement has been removed since revision 37. This statement is deemed
correct: "Variables inside a function or in program scope can also be
declared with the __constant address qualifier." (Note that it doesn't say
that a function must be a kernel function.)

Best wishes,
Anton.

Last December I objected to the technique of transforming a __local
variable into a global-scope static variable. [1] [2]. In a private
email, Krister later explained how it all works out ok, as long as the
back end knows about the convention and puts those variables in a
relocatable data section. So for the record I withdraw my objection.
:slight_smile:

However, this is similar to the support for address space conventions
[3]. That is, the target should advertise its support for this
convention.

cheers,
david

[1] http://lists.cs.uiuc.edu/pipermail/cfe-dev/2010-December/012448.html
[2] http://lists.cs.uiuc.edu/pipermail/cfe-dev/2010-December/012467.html
[3] http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-March/013710.html

Hi,

Last December I objected to the technique of transforming a __local
variable into a global-scope static variable. [1] [2]. In a private
email, Krister later explained how it all works out ok, as long as the

Of course I do not know the explanation of Krister in the private email
but doesn't this approach introduce problems in multithreaded
execution of work groups in a single address space machine?

That is, in case locals are converted to global-scope static variables,
one cannot execute multiple work groups in parallel in the same process
due to the shared storage locations for locals?

I understand it works nicely if you have per core local address spaces
in the machine and can execute the WGs in different cores (like in
NVIDIA GPUs I've understood), but what about the execution in a GPP
SMP multicore execution with threads?

Is this known and accepted limitation or did I just misunderstand
something (which is very likely the case)?

No, it's not obvious how it works. :slight_smile:

The front end converts "local" variables into global-scope static
variables but still retains the distinct address space. The back end
recognizes such variables as special and collects them into a
relocatable section. Accesses are generated as offsets from a base
pointer. (You can discard the address space number at this point!)
When running multiple work groups in parallel, the different work
groups are given different values for the base pointer. That is what
keeps the work groups from stomping on each other's data.

This works out even on a CPU with a single address space. Come to
think of it, it's like using the old 8086 segment registers.

The whole system works provided the convention is consistently applied
all the way from front end to the code generator. The need for
coordination implies a requirement that the Clang+LLVM target inform
the Clang front end that it can compile __local variables in this way.

thanks,
david

OK,

I think I got the basic idea...

relocatable section. Accesses are generated as offsets from a base
pointer. (You can discard the address space number at this point!)
When running multiple work groups in parallel, the different work
groups are given different values for the base pointer. That is what
keeps the work groups from stomping on each other's data.

Going into practical details a bit to ensure I understood this
correctly. Say, one implements an OpenCL kernel launcher that scales
to the number of cores at runtime and implements the actual threading,
for example, with pthreads or some other lower level threading API.

Thus, when launching N parallel work group (WG) execution to
utilize N cores, one must

1) allocate space for the local variables for each
parallel WG thread and
2) somehow pass a pointer to this space as a base value to
the launched kernel so the WG threads do not overwrite each other's
data.

Part 2) is a bit unclear to me. Is the base pointer added as an
additional parameter to the kernel function which the "launcher" can
use? Or do you assume the kernel is loaded to the host program as
a runtime lib and assume the linker does the allocation via
its relocation functionality? Thus, to launch N WG threads one needs
to load the dynlib N times?

Thanks,

OK,

I think I got the basic idea...

relocatable section. Accesses are generated as offsets from a base
pointer. (You can discard the address space number at this point!)
When running multiple work groups in parallel, the different work
groups are given different values for the base pointer. That is what
keeps the work groups from stomping on each other's data.

Going into practical details a bit to ensure I understood this
correctly. Say, one implements an OpenCL kernel launcher that scales
to the number of cores at runtime and implements the actual threading,
for example, with pthreads or some other lower level threading API.

Thus, when launching N parallel work group (WG) execution to
utilize N cores, one must

1) allocate space for the local variables for each
parallel WG thread and
2) somehow pass a pointer to this space as a base value to
the launched kernel so the WG threads do not overwrite each other's
data.

Part 2) is a bit unclear to me. Is the base pointer added as an
additional parameter to the kernel function which the "launcher" can
use? Or do you assume the kernel is loaded to the host program as
a runtime lib and assume the linker does the allocation via
its relocation functionality? Thus, to launch N WG threads one needs
to load the dynlib N times?

I haven't gone through the details of an implementation for (2), but to me
the most clear implementation is to add an implicit parameter to each
kernel to pass the base pointer.

But it's an implementation choice for the backend/runtime system.

And just to be clear: The __local storage is shared between possibly multiple
work item threads in each work group. For example, if your work group is of
size 16, then those 16 threads will simultaneously use the same base pointer
for their __local storage. (For example, that's why it's meaningful to have
barrier(CLK_LOCAL_MEM_FENCE).)

Thanks,
--
Pekka

cheers,
david

But it's an implementation choice for the backend/runtime system.

OK, got it now.

And just to be clear: The __local storage is shared between possibly multiple
work item threads in each work group. For example, if your work group is of
size 16, then those 16 threads will simultaneously use the same base pointer
for their __local storage. (For example, that's why it's meaningful to have
barrier(CLK_LOCAL_MEM_FENCE).)

Yes, this is clear. The base pointer is used to share the local memory
space between the WGs only, not between WIs inside a WG.