Sorry for the slow response.
> Firstly, the declarations of builtin functions. Currently these live
> in header files in libclc's include directory, with target specific
> overrides possible by arranging the order of -I flags, and I intend to
> keep it this way. Optionally, libclc may, as part of its compilation
> process, produce a precompiled header (.pch) file for each target for
> efficiency (reading one large serialised file is more efficient than
> reading and parsing several small files).
When you say "libclc...may produce a precompiled header...", do you
mean "one of the artifacts built with libclc is a .pch file"? (Just
This seems like a good idea, I haven't looked at how clang supports
.pch files. Preliminarily, I was essentially creating a monolithic
"builtin.h" header with all of the prototypes which got inserted before
compiling the .cl files. All of the tinkering I had done was with clang
embedded as a library, rather than executed as a separate process.
At a glance, pocl executes clang as a separate process, yes?
It seems that way. libclc should be able to support both scenarios.
> Secondly, the implementation of builtin functions. This is a tricky
> issue, mainly because we must support a wide variety of targets,
> some of which have space restrictions and cannot support a large
> runtime library contained in each executable, and we must support
> inlining for efficiency and because many targets (especially GPUs)
> require it. Initially I thought that the solution to this would be
> to provide "static inline" function definitions in the header files.
> Unfortunately I have since realised that the situation is more
> complicated than that. Some builtin implementations must be written
> in pure LLVM IR, because Clang currently lacks support for emitting
> the necessary instructions. Some builtins use data, such as cosine
> tables, which we should not duplicate in every translation unit.
> As a consequence of this, the implementations of the builtins cannot
> live in the header file.
Instead, the solution shall be to provide a .bc file providing all
> of the builtin function implementations (similar to how you suggest
> above). Clang's frontend will be modified to include support for
> lazily linking bitcode modules (so that only used functions will be
> loaded from the .bc and linked) before performing optimisations.
> Each global in the .bc providing the builtins (this includes the
> builtins themselves, plus any data they use) will use linkonce_odr
> linkage. This linkage provides the same semantics as C++ "inline" --
> it permits inlining, and at most one copy of the global will appear
> in the final executable.
When you say clang's frontend, does llvm-ld have support for this?
Yes, and this change would essentially be incorporating the llvm-ld
functionality into Clang. Clang wouldn't be calling out to llvm-ld,
it would be using the LLVM module linker used internally by llvm-ld
directly, saving a round trip to disk as a .bc file.
I recently implemented the frontend requirements for this in Clang (by
adding a -mlink-bitcode-file flag), and committed it today as r143314.
I'm less familiar with some of the link-time optimization things that
have been done. It seems that the bitcode modules could be linked
normally, and then a pass could be run to remove uncalled functions.
This would work right now (the name of the pass is GlobalDCE), but the
problem with this is that it would involve materialising (reading from
disk) every builtin function and then deleting the vast majority of
them (most OpenCL C programs will not use more than a few builtins).
This is undesirable from an efficiency perspective.
A patch was recently proposed to the llvm-commits mailing list to add
the lazy linking functionality to LLVM's module linker (the author,
Tanya Lattner, is one of the authors of Apple's OpenCL implementation,
so it looks like Apple is solving this issue in the same or similar
way) so once it is merged, we will be able to do this.
> You mentioned overloaded functions. This is already handled by Clang's
> IR generator. Any function marked with __attribute__((overloadable))
> will have its name mangled according to the Itanium C++ ABI name
> mangling rules.
Right, the overloaded functions are mangled. What I meant is that in
some builtins are not linked in, so when the LLVM JIT refs an unknown
it calls an optional function resolver, which Clover also provides. I
that this resolver needs to understand the mangled name, rather than the
bare name. If you look at the resolver, it currently doesn't deal with the
In the majority of cases, the implementation would not need to provide
any of the overloaded builtins, since the bitcode file providing the
builtins can provide them. In cases where it would need to provide
them (typically, builtins which need to interact with the external
environment, such as work-item functions), the bitcode file can
contain a straightforward wrapper around another function with a known
(unmangled) name, which the JIT (or other aspect of the implementation)
Some targets, as part of their ABI, require a specific set of external
> symbols to be present in every object file, and those symbols must
> appear exactly once (an example being the _global_block_offset and
> other symbols used by NVIDIA's OpenCL implementation). The solution
> to this would be to provide those symbols in a separate .bc file.
> That file would serve a similar role to glibc's crt0.o, and would be
> linked into every final executable during the final link step.
Could you explain this a bit further? I understand that some targets
may need other symbols. That's ok.
I'm unclear as to what you mean by final executable. If I have a
file.cl with a number of kernels and support functions, the OpenCL
runtime needs to be able to execute the kernels. What executable
is coming into the picture? Or do you mean "program"?
I mean the final output of compiling the .cl file(s), such as a
.ptx file (so perhaps "executable" was a poor choice of word). In a
normal OpenCL compilation scenario (using clCreateProgramWithSource
and clBuildProgram) there would be only one .cl file per output file,
but with libclc I also intend to support separate compilation of .cl
files and linking of intermediate object files (which may in fact
be .bc files) into a single output file (this has the side effect
of forcing us to think about linkage very carefully). The builtins
would be linked into every object file at compile time, but clrt0.bc
(or whatever) would be linked once at link time.
> How would clients use these artifacts? Another feature of libclc
> will be that clients will not need to worry about any of this.
> The Clang driver will be taught to pass the necessary flags to the
> Clang frontend, and the intention is that a command line such as this:
> $ clang -target ptx32--nvidiacl -o file.ptx file.cl
So, I'm a little unclear as to what exactly this is going to produce.
file.ptx will have all of the .ptx assembly for all of the referenced
so it can be assembled into the executable referenced above?
No, file.ptx is the final output. PTX is a strange target in that
there isn't really an assembler (I know about ptxas, but that isn't
really an assembler in the traditional sense). file.ptx can be used
directly by NVIDIA's tools, which will invoke ptxas if necessary.