Hi guys,
Just catching up on an interesting thread
I believe this can be a way worth going,
but I doubt now is the right moment for it. I don't share your opinion
that it is easy to move LLVM-IR in this direction, but I rather believe
that this is an engineering project that will take several months of
full time work.
From a philosophical perspective, there can be times when it makes sense to do something short-term to gain experience, but we try not to keep that sort of thing in for a whole release cycle, because then we have to be compatible with it forever.
Also, I know you're not saying it but the "I don't want to do the right thing, because it is too much work" sentiment grates against me: that's a perfect case for keeping a patch local and out of the llvm.org tree. Again, I know that this is not what you're trying to get at.
David wrote:
Again, we have many of the changes to make this possible. I hope to
send them for review as we upgrade to 3.1.
A vague promise to release some code that may or may not be useful is also not particularly useful.
I want clang to automatically create executables that use CUDA/OpenCL to
offload core computations (from plain C code). This should be
implemented in an external LLVM-IR optimization pass.clang -Xclang -load -Xclang CUDAGenerator.so file.c -O3 -mllvm -offload-cuda
The very same should work for Pure, dragonegg and basically any compiler
based on LLVM. So I do not want to change clang at all (except of
possibly linking to -lcuda).
Ok, that *is* an interesting use case. It would be great for LLVM to support this kind of thing. We're clearly not set up for it out of the box right now.
In terms of the complexity. The only alternative proposal I have heard
of was making LLVM-IR multi module aware or adding multi-module support
to all LLVM-IR tools. Both of these changes are way more complex than
the codegen intrinsic. Actually, they are soo complex that I doubt that
they can be implemented any time soon. What is the simpler approach you
are talking about?
I also don't like the intrinsic, but not because of security ;-). For me, it is because embedding arbitrary blobs of IR in an *instruction* doesn't make sense. The position of the instruction in the parent function doesn't necessarily have anything to do with the code attached, the intrinsic can be duplicated, deleted, moved around, etc. It is also poorly specified what is allowed and legal.
Unlike the related-but-different problem of "multi-versioning", it also doesn't make sense for PTX code to be functions in the same module as X86 IR functions. If your desire was for a module to have an SSE2, SSE3, and SSE4 version of the same function, then it *would* make sense for them to be in the same module... because there is linkage between them, and a runtime dispatcher. We don't have the infrastructure yet for per-function CPU flags, but this is something that we will almost certainly grow at some point (just need a clean design). This doesn't help you though.
The design that makes sense to me for this is the multi-module approach. The PTX and X86 code *should* be in different LLVM Modules from each other. I agree that this makes a "vectorize host code to the GPU" optimization pass different than other existing passes, but I don't think that's a bad thing. Realistically, the driver compiler that this is embedded into (clang, dragonegg or whatever) will need to know about both targets to some extent to handle command line options for selecting PTX/GPU version, deciding where and how to output both chunks of code in the output file, etc.
Given that the driver has to have *some* knowledge of this anyway, it doesn't seem problematic for the second module to be passed into the pass constructor. Instead of something like:
PM.add(new OffloadToCudaPass())
You end up doing:
Module *CudaModule = new Module(...)
PM.add(new OffloadToCudaPass(CudaModule))
This also means that the compiler driver is responsible for deciding what to do with the module after it is formed (and of course, it may be empty if nothing is offloaded). Based on the compiler its embedded into, it may immediately JIT to PTX and upload to a GPU, it may write the IR out to a file, it may run the PTX code generator and output the PTX to another section of the executable, or whatever. I do agree that this makes it more awkward to work with "opt" on the command line, and that clang plugins are ideally suited for this, but opt is already suboptimal for a lot of things (e.g. anything that requires target info) and we should improve clang plugins, not workaround their limitations IMO.
What do you think?
-Chris