LLVM IR is a compiler IR

Hi David,

I have been also trying to make one bitcode representation that
can be lowered to efficient ABIs on all target platforms.

I made new compilation strategy to achieve this goal as follows,

C/C++ source code
------------------------------------------ using front-end complier
Target Independent Bitcode
------------------------------------------ using translator
Traget Dependent Bitcode
------------------------------------------ using opt with optimization passes
Optimized Target Dependent Bitcode
------------------------------------------ using LLC
Target Assembly code

I can show you simple example with this strategy.

C/C++ source code

1 #include <stdio.h>
2
3 int main(void) {
4 long double a = 3.14;
5 long double b = 2.2;
6 long double c;
7
8 c = a + b;
9
10 printf(“c=%Lf\n”, c);
11 return 0;
12 }

Hi Talin,

I'd like to add a couple of additional items to your list - first, LLVM IR isn't
stable, and it isn't backwards compatible. Bitcode is not useful as an archival
format, because a bitcode file cannot be loaded if it's even a few months out of
sync with the code that loads it. Loading a bitcode file that is years old is
hopeless.

that sounds like a bug, assuming the bitcode was produced by released versions
of LLVM (bitcode produced with some intermediate development version of LLVM may
or may not be loadable in the final release). Maybe you should open some bug
reports?

Ciao, Duncan.

Hi Chris,

Ok, I think I (now) get your drift. Let's restart the conversation, then. :wink:

If I got it right this time, from your point of view, Dan's arguments
are not accurate because IR never intended to be anything else anyway.
PNaCl, OpenCL, RenderScript, Java-like VMs were aggregated over time
and tried to use IR for what it was not designed to be. I completely
agree with you in that one.

If I'm not mistaken, JIT was never intended to be portable, but to
test IR during a time back-ends were not stable enough. IR was never
intended to cover ABI issues, complex type system, endianness of the
read/write, etc. In effect, IR was never intended to be completely
language/target agnostic. Again, I completely agree on that one.

But there is a subliminal context that I don't think people
understand, and that was my point.

Communities are powerful things. It takes a while to create, but once
you have it, it has a life of its own. For the community, it doesn't
matter much what were the original goals of LLVM, just what you can do
with it, now. LLVM is the *only* compilation infrastructure I know
that is flexible and extensible enough to allow people to do such
radical things in radically different ways.

In a nutshell, Chris, you are a victim of your own success. If LLVM
wasn't that flexible, people wouldn't stretch it that much, and you
wouldn't have those problems.

LLVM's community is strong, active and passionate. But OpenCL folks
will be passionate about adding OpenCL features, and so on and that
creates tension (and I understand the defensive position you have
always been, protecting LLVM's integrity).

But if you read in between the lines of what people are saying (and
what I heard over and over during the Euro-LLVM), people are skeptical
of the portability issue. Almost everyone, including experienced
compiler engineers, comes to LLVM thinking the IR is portable. So,
either we're all doing the wrong advertising of what LLVM really is,
or we're not doing our jobs right.

I'm now reading David Sehr's answer (and John's reply) and I think
we're still missing the point. David listed the hacks he had to do to
make it somewhat portable. I've listed (mostly last year) the hacks we
had to do to implement the EDG bridge. Talin keeps reminding us what
he has to do to work on his compiler. James wrote his own bytecode
language on top of IR.

If LLVM IR is not portable, and never will be, that's fine. I can live
with that. But ignoring the crowd is not wise. I'm not saying you guys
should implement a higher-level IR or change the current IR to fit to
everyone's needs, that would be madness. What I'm asking is simply
that you stop ignoring the fact that there is a problem (that lots of
people want more portable IR) and that the solution is NOT to kludge
it into the current IR.

When people show hacks, don't say "excellent! problem solved". When
people ask for portability solutions, don't recommend kludges, or "do
like clang". Encourage people to contribute to portability, and be
willing to accept such contributions even if that generates a bit more
work to get it through than a kludge.

Hi Óscar,

There are places where compatibility with the native C ABI is taken too
far. For instance, time ago I noted that what the user sets through
Module::setDataLayout is simply ignored.

it's not ignored, it's used by the IR level optimizers. That way these
optimizers can know stuff about the target without having to be linked
to a target backend.

  LLVM uses the data layout

required by the native C ABI, which is hardcoded into LLVM's source
code. So I asked: pass the value setted by Module::setDataLayout to the
layers that are interested on it, as any user would expect.

There are two classes of information in datalayout: things which correspond
to stuff hard-wired into the target processor (for example that x86 is little
endian), and stuff which is not hard-wired in (for example the alignment of
x86 long double, which is 4 or 8 bytes on x86-32 depending on whether you are
on linux, darwin or windows). Hoping to have code generators override the
hard-wired stuff if they see something different in the data layout is just
too much to ask for - eg the x86 code generators are never going to produce big
endian code just because you set big-endianness in the datalayout. Even the
second class of "soft" parameters is not completely flexible: for example most
processors enforce a minimum alignment for types, and trying to reduce it by
giving types a lesser alignment in the datalayout just isn't going to work.
So given that the ways in which codegen could adapt to various datalayout
settings are quite limited and constrained by the target, does it really make
sense to try to parametrize the codegenerators by the datalayout at all?
In any case, it might be good if the code generators produced a warning if they
see that the datalayout string doesn't correspond to what codegen thinks it
should be (I though someone added that already?).

Ciao, Duncan.

  The response

Hi Dan,

I read five distinct requests in your well-written remarks, which may appeal to different people:

1. How can we make LLVM more portable? As Chris later pointed out, it's hard to achieve this goal on the input side while preserving C semantics, since even C source code doesn't really have that property. On the platform front, recent discussions about "non-standard" architectures highlighted that most of the LLVM effort is really around x86 and ARM, and platforms that deviate from these reference points tend to be second thoughts.

2. How can we make LLVM more stable over time? As a regular user of LLVM, I initially found the frequent changes in LLVM painful. On the other hand, that effort is not a high price if it keeps the code base fluid. It wouldn't hurt to take an approach like OpenGL where new stuff is tested through a shared "extensions" mechanism, and deprecation of old interfaces spans years. "It no longer works" is a message we see a little too often on LLVM-dev.

3. How can we clarify the specification of LLVM? In the good old Unix tradition, the source code is the documentation, and the "documentation" explains the bugs and gives simplistic examples. But standard-level specification is really hard and tends to spend inordinate amount of time on corner cases ordinary folks don't care about. To wit: C++ and C++ ABI standardization efforts. LLVM has the luxury to be able to just assert in the corner cases, and deal with it on demand.

4. How can we address minority needs in LLVM? Being a minority here, I can only second that. I'd say that LLVM has to keep their priorities right. As someone else pointed out, one reason to pick up LLVM is because it gives me interoperability with C. I'm not willing to give that up, and that means I have to learn a little bit of the C non-portable way of doing things. That being said, minorities are also the guys keeping you on your toes.

5. How can we avoid selfish kludges and self-imposed limitations in the LLVM source code base? Probably the more immediately actionable point. IMO, things tend to go in the right direction, at least in my experience. But it's always easy to lapse.

Overall, I see these not so really as technical or architectural issues. Rather, I'd say that LLVM is very "market driven", i.e. the largest communities (C and x86) tend to grab all the attention. Still, it has reached a level of maturity where even smaller teams like ours can benefit from the crumbs.

That being said, can we build a portable LLVM IR on top of the existing stuff without giving up C compatibility? I'm not sure. I would settle for a few sub-goals that may be more easily achieved, e.g. define a subset of the IR that is exactly as portable as C, or ensuring that object layout settings default to the target, but can effectively be overridden in a meaningful way (think: C++ ABI inheritance rules, HP28/HP48 internal object layout, ...)

My two bytes
Christophe

Hello Dan.

Duncan Sands <baldrick@free.fr> writes:

There are places where compatibility with the native C ABI is taken too
far. For instance, time ago I noted that what the user sets through
Module::setDataLayout is simply ignored.

it's not ignored, it's used by the IR level optimizers. That way these
optimizers can know stuff about the target without having to be linked
to a target backend.

Well, it is used by one layer, ignored by another. Anyways LLVM is not
doing what the user expects.

LLVM uses the data layout
required by the native C ABI, which is hardcoded into LLVM's source
code. So I asked: pass the value setted by Module::setDataLayout to the
layers that are interested on it, as any user would expect.

There are two classes of information in datalayout: things which correspond
to stuff hard-wired into the target processor (for example that x86 is little
endian), and stuff which is not hard-wired in (for example the alignment of
x86 long double, which is 4 or 8 bytes on x86-32 depending on whether you are
on linux, darwin or windows). Hoping to have code generators override the
hard-wired stuff if they see something different in the data layout is just
too much to ask for - eg the x86 code generators are never going to produce big
endian code just because you set big-endianness in the datalayout. Even the
second class of "soft" parameters is not completely flexible: for example most
processors enforce a minimum alignment for types, and trying to reduce it by
giving types a lesser alignment in the datalayout just isn't going to work.
So given that the ways in which codegen could adapt to various datalayout
settings are quite limited and constrained by the target, does it really make
sense to try to parametrize the codegenerators by the datalayout at all?
In any case, it might be good if the code generators produced a warning if they
see that the datalayout string doesn't correspond to what codegen thinks it
should be (I though someone added that already?).

You focus your reasoning on possible wrong uses of the data layout
setting (endianness) when, as you say, there are other uses which are
perfectly legit (using a specific alignment within the limits allowed by
the processor.) So if I need to align my data on a different way of
what the C ABI requires or generate code for a platform that LLVM still
does not know about, my only solution is to patch LLVM because the value
setted through one of its APIs is ignored on key places, as LLVM assumes
that everybody wants full interoperability with C. This is the kind of
logic that tells me that LLVM is a C-obsessed project: any requirement
that falls outside the needs of a C compiler writer is seen as
superfluous even if it does not conflict with the rest of LLVM.

Hi Oscar,

There are places where compatibility with the native C ABI is taken too
far. For instance, time ago I noted that what the user sets through
Module::setDataLayout is simply ignored.

it's not ignored, it's used by the IR level optimizers. That way these
optimizers can know stuff about the target without having to be linked
to a target backend.

Well, it is used by one layer, ignored by another. Anyways LLVM is not
doing what the user expects.

it's not doing what *you* expect: it doesn't match your mental model of what
it is for (or should be for). The question is whether LLVM should be changed
or your expectations should be changed. Just observing the mismatch between
your expectations and current reality is not in itself an argument that LLVM
should be changed.

LLVM uses the data layout
required by the native C ABI, which is hardcoded into LLVM's source
code. So I asked: pass the value setted by Module::setDataLayout to the
layers that are interested on it, as any user would expect.

There are two classes of information in datalayout: things which correspond
to stuff hard-wired into the target processor (for example that x86 is little
endian), and stuff which is not hard-wired in (for example the alignment of
x86 long double, which is 4 or 8 bytes on x86-32 depending on whether you are
on linux, darwin or windows). Hoping to have code generators override the
hard-wired stuff if they see something different in the data layout is just
too much to ask for - eg the x86 code generators are never going to produce big
endian code just because you set big-endianness in the datalayout. Even the
second class of "soft" parameters is not completely flexible: for example most
processors enforce a minimum alignment for types, and trying to reduce it by
giving types a lesser alignment in the datalayout just isn't going to work.
So given that the ways in which codegen could adapt to various datalayout
settings are quite limited and constrained by the target, does it really make
sense to try to parametrize the codegenerators by the datalayout at all?
In any case, it might be good if the code generators produced a warning if they
see that the datalayout string doesn't correspond to what codegen thinks it
should be (I though someone added that already?).

You focus your reasoning on possible wrong uses of the data layout
setting (endianness) when, as you say, there are other uses which are
perfectly legit (using a specific alignment within the limits allowed by
the processor.) So if I need to align my data on a different way of
what the C ABI requires or generate code for a platform that LLVM still
does not know about, my only solution is to patch LLVM because the value
setted through one of its APIs is ignored on key places, as LLVM assumes
that everybody wants full interoperability with C. This is the kind of
logic that tells me that LLVM is a C-obsessed project: any requirement
that falls outside the needs of a C compiler writer is seen as
superfluous even if it does not conflict with the rest of LLVM.

You are talking to the wrong person: I pretty much only use Ada not C, so I
don't think I'm C obsessed. Yet I never had any problems using LLVM with Ada.
LLVM gives you several mechanisms for aligning things the way you like. Are
they inadequate? Do you have a specific example of something you find
problematic?

Ciao, Duncan.

So given that the ways in which codegen could adapt to various datalayout
settings are quite limited and constrained by the target, does it really make
sense to try to parametrize the codegenerators by the datalayout at all?

PS: This wasn't a rhetorical question, i.e. I wasn't saying that what you are
looking for is wrong. It was a real question about the design of LLVM.

Hello Duncan.

Duncan Sands <baldrick@free.fr> writes:

it's not doing what *you* expect: it doesn't match your mental model of what
it is for (or should be for). The question is whether LLVM should be changed
or your expectations should be changed. Just observing the mismatch between
your expectations and current reality is not in itself an argument that LLVM
should be changed.

I see Module:setDataLayout, look at the documentation for the method,
see "Set the data layout" and think "gee, this is for setting the data
layout." I have no reason for thinking that the setting is only used on
parts of LLVM and ignored on others.

You focus your reasoning on possible wrong uses of the data layout
setting (endianness) when, as you say, there are other uses which are
perfectly legit (using a specific alignment within the limits allowed by
the processor.) So if I need to align my data on a different way of
what the C ABI requires or generate code for a platform that LLVM still
does not know about, my only solution is to patch LLVM because the value
setted through one of its APIs is ignored on key places, as LLVM assumes
that everybody wants full interoperability with C. This is the kind of
logic that tells me that LLVM is a C-obsessed project: any requirement
that falls outside the needs of a C compiler writer is seen as
superfluous even if it does not conflict with the rest of LLVM.

You are talking to the wrong person: I pretty much only use Ada not C, so I
don't think I'm C obsessed. Yet I never had any problems using LLVM
with Ada.

I guess that your Ada compiler is required to be compatible with the
platform's C ABI, then.

LLVM gives you several mechanisms for aligning things the way you
like.

My problem is with the aligment of struct members. The programmer's
guide says that I can create packed structs (1-byte aligned) and

"In non-packed structs, padding between field types is inserted as
defined by the TargetData string in the module, which is required to
match what the underlying processor expects."

which is not true, unless "TargetData string" refers to what is
hard-coded into the LLVM backend, not to what is setted by
Module::setDataLayout. In any case, a clarification is highly needed
both on the method's documentation and on the programmer's guide.

It is true that I could pack all structs generated by my compiler and
insert padding as necessary, but this is really undesirable.

Are
they inadequate? Do you have a specific example of something you find
problematic?

My compiler generates intructions for a virtual stack machine. It is
required to work wherever a good enough C++ compiler is available, with
no further intervention, so the arrangement of members in structures
assume a fixed data layout wich is conservative enough to not violate
any reasonable processor rules. The next key factor is that the language
allows execution of arbitrary code at compile-time (think of a Lisp-like
macro system) which includes the possibility of creating instances of
structs (not to confuse with C structs, which my implementation accesses
only through accessor functions.) Enter LLVM. As LLVM generates code
following the platform's C ABI, which is not necessarily the same as
what my compiler assumes, whenever code generated by LLVM accesses
structures instantiated by running code at compile-time (which was
executed by the virtual stack machine) nasty things happen.

Dan Gohman <gohman@apple.com> writes:

In this email, I argue that LLVM IR is a poor system for building a
Platform, by which I mean any system where LLVM IR would be a
format in which programs are stored or transmitted for subsequent
use on multiple underlying architectures.

I agree with all of this. But...so what? :slight_smile: It's a compiler IR.
That's not a bad thing.

Do you want to propose some changes to make LLVM IR more target
independent? I'm sure it would be welcome, but these are not easy
problems to solve, as you know. :slight_smile:

                                -Dave

Hi Renato,

I think you're overreacting here. There is nothing about OpenCL, RenderScript,
or VMKit that requires LLVM IR be used like a Platform, as I defined it in my
first paragraph. I'm aware that some people would like to use LLVM IR as a
Platform, and I'm saying that there are important high-level considerations
to make before doing so, and my impression is that there is little discussion of
issues I consider important.

Possibly it's too late for some though, and possibly people are getting too
caught up on the thorny ABI issues and missing my broader ideas.

Dan

Hi Renato,

I think you're overreacting here. There is nothing about OpenCL, RenderScript,
or VMKit that requires LLVM IR be used like a Platform, as I defined it in my
first paragraph. I'm aware that some people would like to use LLVM IR as a
Platform, and I'm saying that there are important high-level considerations
to make before doing so, and my impression is that there is little discussion of
issues I consider important.

Possibly it's too late for some though, and possibly people are getting too
caught up on the thorny ABI issues and missing my broader ideas.

Dan

Chris Lattner <clattner@apple.com> writes:

That said, I'm trying to also inject realism. C is an inherently
hostile language to try to get portability out of.

Aren't there (at least) two different aspects here? One is the source
language. C is not portable. There's nothing LLVM can do to fix that.
However, other HLLs are portable. What can LLVM do to help them?

The other aspect is the ABI. Right now LLVM pretty much assumes a
C-like ABI. That's only necessary for some C-derived languages.
Fortran, for example, has no defined ABI so the compiler is free to do
whatever it pleases. Fortran objects are not interchangeable among
compiler vendors and users are perfectly fine with that, though often us
compiler developers are not. :slight_smile:

So to me this seems like a language issue and an ABI issue, neither one
of which need be tied directly to LLVM.

That said, I think creating a portable LLVM IR is a huge project (much
bigger than many people realize) and in the end, I don't think there's
much to be gained. A portable IR built on top of LLVM makes more sense
to me. Anyone interested in that should check read up on projects like
ANDF that produced volumes of papers on what's required.

                            -Dave

As a would-be language implementer, I see a few choices:

JVM: Severe restrictions on generated code, no structs or stack objects or pass-by-reference of stack values, heavy runtime that generated code must run inside of, strict type hierarchy, very little control over optimization.
CLR: Lags behind the JVM on non-Microsoft platforms. Also a heavy runtime. However, it has more flexibility in its generated code.
LLVM: Much, much more flexibility and optimization hooks than either JVM or CLR, but with piecemeal GC support, ABI complications, portability issues, JIT issues, and all the rest.

From what I can tell, there’s a huge gap between LLVM on the one hand and JVM/CLR on the other… and having something in that gap would allow/encourage the development of a huge array of useful languages and runtimes that don’t exist right now.

It seems far more plausible to me for LLVM to evolve into that “something” than for the JVM or CLR to do so.

Hi Dan,

I probably am. And that got in the way of highlighting the real issue.

As Kenneth just highlighted, there is a gap, and people are trying to
fill that gap. I personally prefer to use LLVM for that job, instead
of Java bytecode, and it seems many people feel the same.

So, let's separate the issues here:

1) LLVM IR is a compiler IR, nothing else, nor should be. I agree
with David, this is not a bad thing. Case closed.

2) Has LLVM the potential to fill that gap via other routes?

Can LLVM do (one day) what you wanted it to do today? A higher level
representation for light-weight JITs, a rich type-system for complex
linkage of higher languages, special semantics for special purposes,
etc.

Today, LLVM IR is NOT portable, so why worry about the portability of
DSLs around it? OpenCL rep. can be different (in syntax, and
semantics) from C++ rep., but it should be similar to RenderScript
rep. If you want to link CL and C++ reps, lower them to LLVM IR, or
use the common subset.

I think you’re overreacting here. There is nothing about OpenCL, RenderScript,
or VMKit that requires LLVM IR be used like a Platform, as I defined it in my
first paragraph. I’m aware that some people would like to use LLVM IR as a
Platform, and I’m saying that there are important high-level considerations
to make before doing so, and my impression is that there is little discussion of
issues I consider important.

Hi Dan,

I probably am. And that got in the way of highlighting the real issue.

As Kenneth just highlighted, there is a gap, and people are trying to
fill that gap. I personally prefer to use LLVM for that job, instead
of Java bytecode, and it seems many people feel the same.

So, let’s separate the issues here:

  1. LLVM IR is a compiler IR, nothing else, nor should be. I agree
    with David, this is not a bad thing. Case closed.

  2. Has LLVM the potential to fill that gap via other routes?

Can LLVM do (one day) what you wanted it to do today? A higher level
representation for light-weight JITs, a rich type-system for complex
linkage of higher languages, special semantics for special purposes,
etc.

Today, LLVM IR is NOT portable, so why worry about the portability of
DSLs around it? OpenCL rep. can be different (in syntax, and
semantics) from C++ rep., but it should be similar to RenderScript
rep. If you want to link CL and C++ reps, lower them to LLVM IR, or
use the common subset.

It seems to me like that real issue is: should LLVM IR be platform/architecture-agnostic? It seems pretty clear that it currently is not, and as others have pointed out, I do not see this being a particular problem.

I see people comparing LLVM IR to Java/CLR bytecode, but I’m not sure that is the right comparison to be making. The way I see LLVM, it’s the lower-level, platform-specific equivalent of platform-independent Java/CLR bytecode. Some optimizations/transformations/analyses are better performed on high-level representations, and some on the low-level representations. So why must LLVM try to meet both goals? Instead, different types of front-ends can use custom intermediate representations that meet their needs, and then lower to platform-specific LLVM IR before final code emission. I’m afraid that if LLVM gets into the game of trying to be the intermediate representation for everything, then it will suffer.

There seems to be substantial confusion about this, here and elsewhere
on the internet. I personally am not proposing a new IR, or a new
project, or any new development effort here. I'm just making observations,
some of which are widely known, some of which are not, and proposing a
conclusion, for the purpose of promoting understanding.

In the paragraph where I discussed the task of an independent
implementaion, I meant it as a purely hypothetical situation.

Dan

Hi Justin,

You seem to be intermixing LLVM vs. LLVM IR.

I think LLVM can have as many sub-projects as people want to, and they
can create as many new shiny things as they want. LLVM IR, on the
other hand, has specific goals and should keep tight to it.

As I said before, IR is what it is. But LLVM is not *just* the IR...
There is a lot more that can be done, and Polly and OpenCL are just
the beginning...

So why must LLVM try to meet both goals? Instead, different types of
front-ends can use custom intermediate representations that meet their
needs, and then lower to platform-specific LLVM IR before final code
emission. I’m afraid that if LLVM gets into the game of trying to be the
intermediate representation for everything, then it will suffer.

Hi Justin,

You seem to be intermixing LLVM vs. LLVM IR.

Right, sorry, I meant LLVM IR. I’m not clear to me that there is any significant advantage to making LLVM IR platform/architecture-agnostic. The benefits may not outweigh the disadvantages.

I think LLVM can have as many sub-projects as people want to, and they
can create as many new shiny things as they want. LLVM IR, on the
other hand, has specific goals and should keep tight to it.

Yes, I agree 100%. I would much rather see LLVM IR stay platform-dependent, and let different higher-level representations be used for platform-agnostic work.

Now that the dust begins to settle... I'm wondering whether LLVM is for me.

I'm working on something that can be used to create software for different environments: C/C++, JVM, CLR, Parrot, etc.
I.e. one language for different environments, but not write once, run anywhere.

Now what would be the role of LLVM in such an infrastructure?
Just backend for C/C++ linkage, and I should go and look elsewhere for JVM/CLR/whateverVM?
Should I look into LLVM subprojects? Which ones?

Regards,
Jo