Distribution in assembler format

One issue I've been looking at with regard to using LLVM as a compiler
backend is distribution of programs, particularly on Linux where
different distributions have different binary package formats and it
is usual to ship programs as source rather than binary; specifically,
I'm looking at the general case where the end user may not have (the
correct version of) LLVM installed, so the compiler can't simply be
run on the end user's machine.

A solution that occurs to me is to compile as far as assembler on the
programmer's machine, then ship the .s file (or a small number
thereof, one per CPU architecture) and assemble it on the user's
machine (which in most cases will have the GNU assembler installed).
It seems to me that this ought to work; are there any pitfalls I
should be aware of?

Russell Wallace wrote:

One issue I've been looking at with regard to using LLVM as a compiler
backend is distribution of programs, particularly on Linux where
different distributions have different binary package formats and it
is usual to ship programs as source rather than binary; specifically,
I'm looking at the general case where the end user may not have (the
correct version of) LLVM installed, so the compiler can't simply be
run on the end user's machine.

A solution that occurs to me is to compile as far as assembler on the
programmer's machine, then ship the .s file (or a small number
thereof, one per CPU architecture) and assemble it on the user's
machine (which in most cases will have the GNU assembler installed).
It seems to me that this ought to work; are there any pitfalls I
should be aware of?
  

A potential problem with this approach is that different Linux systems have different versions of header files and libraries. While the machine code will assemble correctly, it may not link (because a native code library is missing or is of an incorrect version) or the generated code will not work properly (because some structure in a header file is different between the system used for compilation and the system on which the program is installed).

Shipping assembly code is more or less equivalent to shipping a binary. If shipping native code will work, then shipping assembly code will work (in which case, why ship assembly code instead of a native binary?).

-- John T.

Ah! Thanks, that's exactly what I needed to know (albeit was hoping
not to hear :-)).

Hmm, that actually kills every possible solution I can think of, so
the good news is, it takes the problem off my to-do list (albeit via
the 'unsolvable' route rather than the 'solved' route), and unless
anyone else has a possible solution, I can take it off my to-do list
:slight_smile:

Hello Russell,

Major pitfall #1:
LLVM-GCC does certain optimizations even if all of the optimizations are turned off. These include endian-specific optimizations so to use LLVM as a cross-architecture bitcode, you'll need to wait until Clang supports C++ fully or just stick to C programs for now.

I've been looking forward to the day that LLVM can be used for cross-architecture development, myself.

Thanks for asking,

--Sam

I don't think I quite understand this... suppose for example you're
trying to use an LLVM-based toolchain running on an x86 PC to write
code for a device that uses an ARM processor in big endian mode, so
you tell the LLVM code generator "generate code for ARM, big
endian"... are you saying the optimizer will actually assume the
target device is little endian because the development system is
little endian (which would of course break things)?

FYI,

http://llvm.org/docs/FAQ.html#platformindependent

applies to clang just as much as llvm-gcc.

Dan

Hi again,

My point is that your code could not be written in C++ at this time because the only complete compiler for C++ is LLVM-GCC. It will do little endian optimizations on your x86 box and make the resultant bitcode file not work on the ARM processor. It is possible to write an endian-agnostic bitcode file but I don't think all modern LLVM compilers support it. Also the FAQ also indicates other problems with preprocessor macros and
other evil things that make C inherently not suitable for exporting to
platform-independent bitcode.

I'm in the process of trying to make a compiler that generates architecture independent bitcode but C and C++ don't have that functionality and cannot be made cross-architecture compliant. At the moment my partner is writing a customized parser generator for LLVM. Since the bitcode it generates was originally written using LLVM-GCC we'll have to audit the LLVM Assembly code for endian discrepancies by hand before we release the full version 1.0 of LLVM-PEG.

The endian problem is not with LLVM's optimizer, the problem is with the LLVM-GCC toolchain and certain aspects of the C programming language itself.

--Sam

Ah! Yes indeed that is true, and as Dan points out, the issues
described in the FAQ also apply when you're using Clang. Indeed as far
as I can see at least some such issues necessarily arise with any
toolchain that lets you write the sort of low-level/machine dependent
code you at least sometimes want for systems/embedded programming; I
don't think it's really a deficiency of LLVM.

Hello,

Samuel Crow wrote:

It is possible to write an endian-agnostic bitcode file but I don't think all modern LLVM compilers support it.
  

Which ones llvm assembler instructions is not endian-agnostic?

Hello Ivan,

The problem arises in LLVM-GCC not in LLVM Assembly directly. See http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-August/024812.html for details.

--Sam

I have seen the claim on this list numerous times, by people probably much
more knowledgeable than me, that C/C++ can't be compiled to platform
independent code.

I thinks this misses some subtleties.

One issue is calling conventions. There are two related issues I have seen
come up on the list. The first is that LLVM doesn't implement all the ABI (in
particular aggregate returns). The reason given for this is the second issue,
that LLVM doesn't have enough information to correctly implement the ABI. The
only example that I can recall seeing is that C complex values have a
different conventions from structures.

Now, most code doesn't use the complex data type, and so the second issue
doesn't affect it. I'd argue for code that doesn't LLVM IR can be made so that
it isn't platform dependent (with regard to calling conventions); although
LLVM won't "correctly" compile it currently.

Another major issue is data type size. I'll ignore for the moment interfacing
with external code. The C data types char, short, int, long are all platform
independent. There are also platform independent data types (u)int?_t, which
should clearly compile to platform-independent code. All that the C standard
says about the basic types is
char <= short <= int <= long
7 <= char, 16 <= short, 16 or 32 <= int
Now, if one were to fix the sizes of the types, the resulting LLVM code would
be platform independent (although not able to call external interfaces
properly). Alternatively, if LLVM IR were to be extended with platform
dependent integer types, the generated IR would be platform independent, and
still able to interface external libraries. Given the textual representation,
C code could be compiled to platform-independent IR using undefined named
types, which could be augmented with the platform-dependent definitions of the
named type later. The same issue crops up for size_t and ptrdiff_t, which
coulde be solved by the ptrint/intptr proposal.

The final issue is interfacing with external code. For some people, as they
can present a platform independent interface to the C code which they want to
compile to LLVM.

If one does care, then this is a much trickier issue, for which there are no
generic solutions. But I think that in a lot of cases, this isn't an
insurmountable issue.

For example, libjpeg uses the preprocessor to pick the types to use, and so it
generates non-portable code. There are at least two ways around this. One
could configure libjpeg to use platform independent types; then code compiled
against it would generate LLVM portable to any machine with libjpeg similarly
configured. On the other hand, libjpeg by default uses (the same) fundamental C types on nay sane modern system, so if LLVM IR could represent those types, then IR would again be portable.

I am not intimately familiar with the Linux Standard Base, but I get the
impression that the binary part of the interface is almost completely
specified in terms of the fundamental types and size_t and ptrdiff_t. That is,
that if those types has a portable representation, then most any program
compiled against the LSB would portable.

My summary would be that most C/C++ code can't be compiled to portable IR, but
that if one were careful, that it is possible to write C/C++ code that could
be compiled to portable IR. And if LLVM were extended with a few controversial
extensions, one could right carefully write C/C++ that could compile to
portable IR, that would correctly interface with most libraries.

  Tom Prince