Integration Question

Hello,

Thanks for taking on the libclc project, it is nice to start seeing more open
source OpenCL pieces appear. I would be interested in contributing.
Forgive me, I haven’t followed all of the discussions to this point.

I have a couple of questions (based on what I see available now/project description) on how you would see libclc used by other OpenCL components.

In the OpenCL runtime (not part of your project, but a project which would be
the consumer of your project), when someone calls clCreateKernel, what artifacts from libclc would they be using? Meaning, inlined headers? Already compiled .bc files for that platform to link against? A shared/static lib for that platform to load/link into the runtime? Something else?

I’ve thought a bit/dabbled in writing something similar to libclc, and my approach was compiling a (platform specific) .bc file which was loaded by
the runtime to link kernels against, and then the linked kernels were JIT-ed by the runtime. In some sense, this is close to what Clover does, except the builtins are loaded/linked to the kernels at run time, rather than linked into the runtime and made available through symbol look-up. (One thing that I’m not sure Denis has done is symbol lookup for the overloaded functions…which involves looking up the name-mangled symbol, IIUC).

I’m interested in the approach you plan taking, and pitching in!

Pete

Hi Pete,

Thanks for your interest in libclc. I've been thinking a lot in the
past few days about exactly which artifacts libclc should provide
and the entire compilation process.

Firstly, the declarations of builtin functions. Currently these live
in header files in libclc's include directory, with target specific
overrides possible by arranging the order of -I flags, and I intend to
keep it this way. Optionally, libclc may, as part of its compilation
process, produce a precompiled header (.pch) file for each target for
efficiency (reading one large serialised file is more efficient than
reading and parsing several small files).

Secondly, the implementation of builtin functions. This is a tricky
issue, mainly because we must support a wide variety of targets,
some of which have space restrictions and cannot support a large
runtime library contained in each executable, and we must support
inlining for efficiency and because many targets (especially GPUs)
require it. Initially I thought that the solution to this would be
to provide "static inline" function definitions in the header files.
Unfortunately I have since realised that the situation is more
complicated than that. Some builtin implementations must be written
in pure LLVM IR, because Clang currently lacks support for emitting
the necessary instructions. Some builtins use data, such as cosine
tables, which we should not duplicate in every translation unit.
As a consequence of this, the implementations of the builtins cannot
live in the header file.

Instead, the solution shall be to provide a .bc file providing all
of the builtin function implementations (similar to how you suggest
above). Clang's frontend will be modified to include support for
lazily linking bitcode modules (so that only used functions will be
loaded from the .bc and linked) before performing optimisations.
Each global in the .bc providing the builtins (this includes the
builtins themselves, plus any data they use) will use linkonce_odr
linkage. This linkage provides the same semantics as C++ "inline" --
it permits inlining, and at most one copy of the global will appear
in the final executable.

You mentioned overloaded functions. This is already handled by Clang's
IR generator. Any function marked with __attribute__((overloadable))
will have its name mangled according to the Itanium C++ ABI name
mangling rules.

Some targets, as part of their ABI, require a specific set of external
symbols to be present in every object file, and those symbols must
appear exactly once (an example being the _global_block_offset and
other symbols used by NVIDIA's OpenCL implementation). The solution
to this would be to provide those symbols in a separate .bc file.
That file would serve a similar role to glibc's crt0.o, and would be
linked into every final executable during the final link step.

How would clients use these artifacts? Another feature of libclc
will be that clients will not need to worry about any of this.
The Clang driver will be taught to pass the necessary flags to the
Clang frontend, and the intention is that a command line such as this:

$ clang -target ptx32--nvidiacl -o file.ptx file.cl

would just work -- the semantics of such a command line driver
invocation would be equivalent to the invocation of a program which
uses the OpenCL platform layer and runtime APIs to build an OpenCL C
program with the given flags (excluding -target, -o and input files)
using clCreateProgramWithSource and clBuildProgram, and then uses
clGetProgramInfo to dump the binaries. As a side effect of this, the
implementation of clBuildProgram would be very simple -- it would only
need to invoke the driver with a few command line options in addition
to the flags provided by the user as a parameter to clBuildProgram.

Clang provides an API for invoking its driver (see the
clang::createInvocationFromCommandLine function). There may also be
a small wrapper library for clBuildProgram implementations to use,
to simplify the entire process. This could be part of libclc or
perhaps a separate project.

Thanks,

Hello,

I see, that gives me a better idea. I was working on the builtin
function support, so these issues sound familiar :).

Firstly, the declarations of builtin functions. Currently these live
in header files in libclc’s include directory, with target specific
overrides possible by arranging the order of -I flags, and I intend to
keep it this way. Optionally, libclc may, as part of its compilation
process, produce a precompiled header (.pch) file for each target for
efficiency (reading one large serialised file is more efficient than
reading and parsing several small files).

When you say “libclc…may produce a precompiled header…”, do you
mean “one of the artifacts built with libclc is a .pch file”? (Just clarifying).
This seems like a good idea, I haven’t looked at how clang supports
.pch files. Preliminarily, I was essentially creating a monolithic
“builtin.h” header with all of the prototypes which got inserted before
compiling the .cl files. All of the tinkering I had done was with clang
embedded as a library, rather than executed as a separate process.
At a glance, pocl executes clang as a separate process, yes?

Secondly, the implementation of builtin functions. This is a tricky
issue, mainly because we must support a wide variety of targets,
some of which have space restrictions and cannot support a large
runtime library contained in each executable, and we must support
inlining for efficiency and because many targets (especially GPUs)
require it. Initially I thought that the solution to this would be
to provide “static inline” function definitions in the header files.
Unfortunately I have since realised that the situation is more
complicated than that. Some builtin implementations must be written
in pure LLVM IR, because Clang currently lacks support for emitting
the necessary instructions. Some builtins use data, such as cosine
tables, which we should not duplicate in every translation unit.
As a consequence of this, the implementations of the builtins cannot
live in the header file.

Instead, the solution shall be to provide a .bc file providing all
of the builtin function implementations (similar to how you suggest
above). Clang’s frontend will be modified to include support for
lazily linking bitcode modules (so that only used functions will be
loaded from the .bc and linked) before performing optimisations.
Each global in the .bc providing the builtins (this includes the
builtins themselves, plus any data they use) will use linkonce_odr
linkage. This linkage provides the same semantics as C++ “inline” –
it permits inlining, and at most one copy of the global will appear
in the final executable.

When you say clang’s frontend, does llvm-ld have support for this?
I’m less familiar with some of the link-time optimization things that
have been done. It seems that the bitcode modules could be linked
normally, and then a pass could be run to remove uncalled functions.

You mentioned overloaded functions. This is already handled by Clang’s
IR generator. Any function marked with attribute((overloadable))
will have its name mangled according to the Itanium C++ ABI name
mangling rules.

Right, the overloaded functions are mangled. What I meant is that in Clover,
some builtins are not linked in, so when the LLVM JIT refs an unknown function,
it calls an optional function resolver, which Clover also provides. I believe
that this resolver needs to understand the mangled name, rather than the
bare name. If you look at the resolver, it currently doesn’t deal with the
overloaded builtins.
http://cgit.freedesktop.org/~steckdenis/clover/tree/src/core/cpu/builtins.cpp:416

Some targets, as part of their ABI, require a specific set of external
symbols to be present in every object file, and those symbols must
appear exactly once (an example being the _global_block_offset and
other symbols used by NVIDIA’s OpenCL implementation). The solution
to this would be to provide those symbols in a separate .bc file.
That file would serve a similar role to glibc’s crt0.o, and would be
linked into every final executable during the final link step.

Could you explain this a bit further? I understand that some targets
may need other symbols. That’s ok.
I’m unclear as to what you mean by final executable. If I have a
file.cl with a number of kernels and support functions, the OpenCL
runtime needs to be able to execute the kernels. What executable
is coming into the picture? Or do you mean “program”?

How would clients use these artifacts? Another feature of libclc
will be that clients will not need to worry about any of this.
The Clang driver will be taught to pass the necessary flags to the
Clang frontend, and the intention is that a command line such as this:

$ clang -target ptx32–nvidiacl -o file.ptx file.cl

So, I’m a little unclear as to what exactly this is going to produce.
file.ptx will have all of the .ptx assembly for all of the referenced builtins,
so it can be assembled into the executable referenced above?

would just work – the semantics of such a command line driver
invocation would be equivalent to the invocation of a program which
uses the OpenCL platform layer and runtime APIs to build an OpenCL C
program with the given flags (excluding -target, -o and input files)
using clCreateProgramWithSource and clBuildProgram, and then uses
clGetProgramInfo to dump the binaries. As a side effect of this, the
implementation of clBuildProgram would be very simple – it would only
need to invoke the driver with a few command line options in addition
to the flags provided by the user as a parameter to clBuildProgram.

Ok, this gives me more of an idea where you’re headed. Thanks for the
explanation. Sounds great!

Clang provides an API for invoking its driver (see the
clang::createInvocationFromCommandLine function). There may also be
a small wrapper library for clBuildProgram implementations to use,
to simplify the entire process. This could be part of libclc or
perhaps a separate project.

Thanks,

Peter

Thank you for the detailed explanation.

Pete

Hello,

Hi Pete,

Sorry for the slow response.

> Firstly, the declarations of builtin functions. Currently these live
> in header files in libclc's include directory, with target specific
> overrides possible by arranging the order of -I flags, and I intend to
> keep it this way. Optionally, libclc may, as part of its compilation
> process, produce a precompiled header (.pch) file for each target for
> efficiency (reading one large serialised file is more efficient than
> reading and parsing several small files).
>
>
When you say "libclc...may produce a precompiled header...", do you
mean "one of the artifacts built with libclc is a .pch file"? (Just
clarifying).

Yes.

This seems like a good idea, I haven't looked at how clang supports
.pch files. Preliminarily, I was essentially creating a monolithic
"builtin.h" header with all of the prototypes which got inserted before
compiling the .cl files. All of the tinkering I had done was with clang
embedded as a library, rather than executed as a separate process.
At a glance, pocl executes clang as a separate process, yes?

It seems that way. libclc should be able to support both scenarios.

> Secondly, the implementation of builtin functions. This is a tricky
> issue, mainly because we must support a wide variety of targets,
> some of which have space restrictions and cannot support a large
> runtime library contained in each executable, and we must support
> inlining for efficiency and because many targets (especially GPUs)
> require it. Initially I thought that the solution to this would be
> to provide "static inline" function definitions in the header files.
> Unfortunately I have since realised that the situation is more
> complicated than that. Some builtin implementations must be written
> in pure LLVM IR, because Clang currently lacks support for emitting
> the necessary instructions. Some builtins use data, such as cosine
> tables, which we should not duplicate in every translation unit.
> As a consequence of this, the implementations of the builtins cannot
> live in the header file.
>

Instead, the solution shall be to provide a .bc file providing all
> of the builtin function implementations (similar to how you suggest
> above). Clang's frontend will be modified to include support for
> lazily linking bitcode modules (so that only used functions will be
> loaded from the .bc and linked) before performing optimisations.
> Each global in the .bc providing the builtins (this includes the
> builtins themselves, plus any data they use) will use linkonce_odr
> linkage. This linkage provides the same semantics as C++ "inline" --
> it permits inlining, and at most one copy of the global will appear
> in the final executable.
>
>
When you say clang's frontend, does llvm-ld have support for this?

Yes, and this change would essentially be incorporating the llvm-ld
functionality into Clang. Clang wouldn't be calling out to llvm-ld,
it would be using the LLVM module linker used internally by llvm-ld
directly, saving a round trip to disk as a .bc file.

I recently implemented the frontend requirements for this in Clang (by
adding a -mlink-bitcode-file flag), and committed it today as r143314.

I'm less familiar with some of the link-time optimization things that
have been done. It seems that the bitcode modules could be linked
normally, and then a pass could be run to remove uncalled functions.

This would work right now (the name of the pass is GlobalDCE), but the
problem with this is that it would involve materialising (reading from
disk) every builtin function and then deleting the vast majority of
them (most OpenCL C programs will not use more than a few builtins).
This is undesirable from an efficiency perspective.

A patch was recently proposed to the llvm-commits mailing list to add
the lazy linking functionality to LLVM's module linker (the author,
Tanya Lattner, is one of the authors of Apple's OpenCL implementation,
so it looks like Apple is solving this issue in the same or similar
way) so once it is merged, we will be able to do this.

> You mentioned overloaded functions. This is already handled by Clang's
> IR generator. Any function marked with __attribute__((overloadable))
> will have its name mangled according to the Itanium C++ ABI name
> mangling rules.
>
>
Right, the overloaded functions are mangled. What I meant is that in
Clover,
some builtins are not linked in, so when the LLVM JIT refs an unknown
function,
it calls an optional function resolver, which Clover also provides. I
believe
that this resolver needs to understand the mangled name, rather than the
bare name. If you look at the resolver, it currently doesn't deal with the
overloaded builtins.
http://cgit.freedesktop.org/~steckdenis/clover/tree/src/core/cpu/builtins.cpp:416

In the majority of cases, the implementation would not need to provide
any of the overloaded builtins, since the bitcode file providing the
builtins can provide them. In cases where it would need to provide
them (typically, builtins which need to interact with the external
environment, such as work-item functions), the bitcode file can
contain a straightforward wrapper around another function with a known
(unmangled) name, which the JIT (or other aspect of the implementation)
would provide.

Some targets, as part of their ABI, require a specific set of external
> symbols to be present in every object file, and those symbols must
> appear exactly once (an example being the _global_block_offset and
> other symbols used by NVIDIA's OpenCL implementation). The solution
> to this would be to provide those symbols in a separate .bc file.
> That file would serve a similar role to glibc's crt0.o, and would be
> linked into every final executable during the final link step.
>
>
Could you explain this a bit further? I understand that some targets
may need other symbols. That's ok.
I'm unclear as to what you mean by final executable. If I have a
file.cl with a number of kernels and support functions, the OpenCL
runtime needs to be able to execute the kernels. What executable
is coming into the picture? Or do you mean "program"?

I mean the final output of compiling the .cl file(s), such as a
.ptx file (so perhaps "executable" was a poor choice of word). In a
normal OpenCL compilation scenario (using clCreateProgramWithSource
and clBuildProgram) there would be only one .cl file per output file,
but with libclc I also intend to support separate compilation of .cl
files and linking of intermediate object files (which may in fact
be .bc files) into a single output file (this has the side effect
of forcing us to think about linkage very carefully). The builtins
would be linked into every object file at compile time, but clrt0.bc
(or whatever) would be linked once at link time.

> How would clients use these artifacts? Another feature of libclc
> will be that clients will not need to worry about any of this.
> The Clang driver will be taught to pass the necessary flags to the
> Clang frontend, and the intention is that a command line such as this:
>
> $ clang -target ptx32--nvidiacl -o file.ptx file.cl
>
>
So, I'm a little unclear as to what exactly this is going to produce.
file.ptx will have all of the .ptx assembly for all of the referenced
builtins,
so it can be assembled into the executable referenced above?

No, file.ptx is the final output. PTX is a strange target in that
there isn't really an assembler (I know about ptxas, but that isn't
really an assembler in the traditional sense). file.ptx can be used
directly by NVIDIA's tools, which will invoke ptxas if necessary.

Thanks,