feasibility of C++ to inline-assembly using clang

Hi, I'm trying to get a rough opinion of the difficulty of the
following task from knowledgeable people. (I'm a scientific programmer
who knows about SIMD instructions but none of the other vagaries of
x86 or compiler construction.)

For a complete (C-style) C++ function (which doesn't call any other
"not mandatory inline" functions, so this function is pure
preallocated memory access and computation) that takes just scalars
and pointers as arguments, I'd like to modify clang++ to compile the
function (in the context of approprate headers) to full machine code
and then output a gcc-style inline assembly snippet that can be pasted
in to replace the body of the function in the source. (I do mean
assembly because I want to keep the scheduling and register
allocation.) Presumably the main body of the actual code is not going
to be a problem (although I'm unclear about potential label uniqueness
issues), but presumably there's going to be a trick to figuring out
which clang function setup and clearup instructions are needed in the
new context. As a quick opinion, do experts think this is a relatively
small task for a compiler novice?

In case anyone asks "why do I want to do this strange thing?", I'm
looking at the feasibility of writing compute heavy code web-apps
using Google's NativeClient technology (in C++). This requires that
the final binary be compiled from source by the latest version of the
native client C++ compiler, which is a patched version of some
uncontrollable version of g++ (unless I want to try and manually port
the particular patches to a different g++ version). Unfortunately the
g++ optimizers/schedulers are oscillate from release to release for
very performance sensitive inner loops. Given that I'd still very much
prefer to write C++ rather than raw assembly for the "inner loop-y"
functions I'm looking at ways to separate the generation and
optimisation of this code from the need to be via the latest native
client compiler. (Clang/llvm seem to have a much better separation of
concerns than gcc, so I'd think it'd be easier for me to modify it to
output appropriate asm constructs than g++. )

Many thanks for any insight

Umm, are you planning on making LLVM support NativeClient-style code
generation? Unless I'm mistaken, general inline asm will make your
program fail validation by the NativeClient sandbox.

Also, I don't see the point; why can't you just put the
performance-sensitive function in its own file, compile it to a .s
file, and add that to your project?

-Eli

Umm, are you planning on making LLVM support NativeClient-style code
generation? Unless I'm mistaken, general inline asm will make your
program fail validation by the NativeClient sandbox.

I don't understand much about general x86 assembly, so the following
reasoning may be complete and utter boll^H^H^H^Hrubbish: my
understandnig was that NaCl requires inserting NOP padding at various
points to ensure that you can't generate dangerous instructions by
jumping into the middle of an instruction, modifying the calling
convention and prohibiting certain dangerous instructions that I
didn't think ordinary computation would result in anyway. So I'd
assumed that the modified g++ would insert padding bytes into inline
assembly, which was the only thing that would potentially cause the
native client to not output a verifiable binary. But I ought to ask on
the NaCl list if that understanding is correct.

Also, I don't see the point; why can't you just put the
performance-sensitive function in its own file, compile it to a .s
file, and add that to your project?

Here again comes a possible weak link in my understanding: presumably
even a C-style function in C++ has to be name mangled (as you don't
know if there's another overloading in a different translation unit),
and I wasn't sure if the details of that were guaranteed to be
compatible between nearby compiler releases. (I only really know the
rule of thumb "compile everything C++ with the same compiler version
or expect trouble", but thinking about it things can't be being
changed that rapidly.) Just keeping all the uncontroversial,
performance intensive code separate from the non-performance
intenseive code that uses C++ functionality and compiling those to .s
files may be the best solution.

Thanks for the thoughts,

Just to correct something I misremembered from when I read the
NativeClient paper: it seem what happens is that certain kinds of
jumps (not the kind used in, eg, a do{ } while()" loop that might
occur in numeric code) are forced to be to addresses that are
multiples of 32 (which pesumably requires strategic insertion of NOPs)