Passing and returning aggregates (who is responsible for the ABI?)

Hello,

I'm trying to port the XL compiler (http://xlr.sf.net) to use the LLVM back-end. So far, little trouble doing so. But there is one aspect of the semantics of the LLVM IR that surprises me. Why are the call, declare and define "halfway through" ABI conventions?

I think it's the right thing to have a single high level node for each call, as opposed to separate instructions for pushing individual argument, for example. But that implies that the call semantics include a good dose of ABI and calling conventions. This is explicit in the fact that you tell what the calling conventions are (e.g ccc, fastcc).

But then, why refuse aggregates as input or output of a call? What is the rationale? On x86, I think it does not make any difference. But for Itanium, it's clearly broken (e.g. Itanium can return a struct of up to 4 ints in registers, and packs input parameters in a "funny" way). Languages such as Ada or XL have output parameters, and they are similarly difficult to generate code for (you have to make it look like C).

I don't think adding aggregate support would break any current IR producer, and assuming the aggregates are expanded early on, it probably has very localized impact in the code. Are there other good reasons not to add this capability, or would a patch adding it stand a good chance to be accepted?

Thanks
Christophe

I'm trying to port the XL compiler (http://xlr.sf.net) to use the LLVM back-end. So far, little trouble doing so. But there is one aspect of the semantics of the LLVM IR that surprises me. Why are the call, declare and define "halfway through" ABI conventions?

I think it's the right thing to have a single high level node for each call, as opposed to separate instructions for pushing individual argument, for example. But that implies that the call semantics include a good dose of ABI and calling conventions. This is explicit in the fact that you tell what the calling conventions are (e.g ccc, fastcc).

But then, why refuse aggregates as input or output of a call? What is the rationale?

Probably in good part because, in LLVM, aggregates (or derived types) types exist only in memory, not in registers.

On x86, I think it does not make any difference. But for Itanium, it's clearly broken (e.g. Itanium can return a struct of up to 4 ints in registers, and packs input parameters in a "funny" way). Languages such as Ada or XL have output parameters, and they are similarly difficult to generate code for (you have to make it look like C).

I don't think adding aggregate support would break any current IR producer, and assuming the aggregates are expanded early on, it probably has very localized impact in the code. Are there other good reasons not to add this capability, or would a patch adding it stand a good chance to be accepted?

Chris has some notes about how to do this for return values here:

http://www.nondot.org/sabre/LLVMNotes/MultipleReturnValues.txt

— Gordon

I'm trying to port the XL compiler (http://xlr.sf.net) to use the
LLVM back-end. So far, little trouble doing so. But there is one
aspect of the semantics of the LLVM IR that surprises me. Why are the
call, declare and define "halfway through" ABI conventions?

Hrm?

I think it's the right thing to have a single high level node for
each call, as opposed to separate instructions for pushing individual
argument, for example. But that implies that the call semantics
include a good dose of ABI and calling conventions. This is explicit
in the fact that you tell what the calling conventions are (e.g ccc,
fastcc).

Right.

But then, why refuse aggregates as input or output of a call? What is
the rationale?

Because LLVM has no notion of aggregates as "values" that can be passed around as atomic units. This is a very important design point, and has many useful values.

On x86, I think it does not make any difference. But
for Itanium, it's clearly broken (e.g. Itanium can return a struct of
up to 4 ints in registers, and packs input parameters in a "funny"
way). Languages such as Ada or XL have output parameters, and they
are similarly difficult to generate code for (you have to make it
look like C).

I don't think adding aggregate support would break any current IR
producer, and assuming the aggregates are expanded early on, it
probably has very localized impact in the code. Are there other good
reasons not to add this capability, or would a patch adding it stand
a good chance to be accepted?

Unfortunately, this wouldn't solve the problem that you think it does. For example, lets assume that LLVM allowed you to pass and return structs by value. Even with this, LLVM would not be able to directly implement all ABIs "naturally". For example, some ABIs specify that a _Complex double should be returned in two FP registers, but that a struct with two doubles in it should be returned in memory.

By the time you lower to LLVM, all you have is {double,double}. In fact, there is no way, in general, to retain all the high level information in LLVM without flavoring the LLVM IR with target info.

-Chris

But then, why refuse aggregates as input or output of a call? What is
the rationale?

Because LLVM has no notion of aggregates as "values" that can be
passed around as atomic units. This is a very important design point,
and has many useful values.

I see. You explained one of them in a message on the XL mailing list, which I think is worth repeating here:

This doesn't fit naturally with the way that LLVM does things: In
LLVM, each instruction can produce at most one value. This means that
a pointer to the instruction is as good as a pointer to the value,
which dramatically simplifies the IR and everything that consumes or
produces it.

An additional constraint you did not mention is that all the values must be first-class. But what is "first class" actually depends on the hardware and ABI. An i64, for instance, is first class on 64-bit CPUs, but not on 32-bit CPUs. Is the following legal on a 32-bit target?

  declare i64 @foo(i128, i256)

  The "getaggregatevalue" is a localized hack to work
around this for the few cases that return multiple values.

As a matter of fact, what annoys me the most with the getaggregatevalue proposal is precisely that it does not seem too localized to me. What about:

    %Agg = call {int, float} %foo()
    %intpart = getaggregatevalue {int, float} %Agg, uint 0
    [insert 200 instructions here]
    %floatpart = getaggregatevalue {int, float} %Agg, uint 1

What about a downstream IR manipulation turning that into:

    %Agg = call {int, float} %foo()
    %intpart = getaggregatevalue {int, float} %Agg, uint 0
    br label somewhere
somewhere:
    %floatpart = getaggregatevalue {int, float} %Agg, uint 1

I am afraid that the hack would not remain localized for too long :wink: i.e. you probably will need to have stuff to keep the call and getaggregatevalue close together.

Unfortunately, this wouldn't solve the problem that you think it
does. For example, lets assume that LLVM allowed you to pass and
return structs by value. Even with this, LLVM would not be able to
directly implement all ABIs "naturally". For example, some ABIs
specify that a _Complex double should be returned in two FP registers,
but that a struct with two doubles in it should be returned in memory.

Even today, that must be special cased, i.e. the IR needs to be distinct between the two cases. As I understand it, the following is already legal, since vectors are first class:

  declare <2 x double> @builtin_complex_add (<2 x double>, <2 x double>)

That would be the built-in complex type. The user-defined complex-in-struct type could be one of the following depending on the ABI:

  declare void @user_complex_add (double, double, double, double, {double, double} *)
  declare void @user_complex_add ({double, double} *, double, double, double, double)
  declare void @user_complex_add ({double, double} *, {double, double} *, {double, double} *)

My proposal would not invalidate any of these, but allow the following, which would immediately be expanded to the appropriate choice of the above depending on the target calling conventions:

  declare {double, double} @user_complex_add({double, double}, {double, double})

It's possible that you want to allow some parameter attributes, i.e. be able to distinguish:

  declare sret {double, double} @user_complex_add({double, double}, {double, double})
  declare inreg {double, double} @user_complex_add({double, double}, {double, double})

By the time you lower to LLVM, all you have is {double,double}. In
fact, there is no way, in general, to retain all the high level
information in LLVM without flavoring the LLVM IR with target info

Agreed.

Anyway, for the moment, I will generate what LLVM accepts as input.

Thanks
Christophe

But then, why refuse aggregates as input or output of a call? What
is the rationale?

Probably in good part because, in LLVM, aggregates (or derived types)
types exist only in memory, not in registers.

Thanks, that's precisely where I see a problem. On many recent architectures (Itanium being the extreme case), small enough aggregates are passed and held in registers. Thinking or designing "aggregates == memory" is an obsolete approach :wink: I like the "call" instruction because, at least, it got rid of the "arguments == push to stack" approach you find in the Java or MISL bytecodes...

As an aside, why do I care? I wanted XL to be efficient on modern architectures, so I got rid of "implicit memory accesses" as much as I could, e.g. no "this pointer". At one point, I compiled a simple program manipulating complex numbers to draw a Julia set. At the lowest level of optimization, the XL version was at least 70% faster than the C++ version.

Why? Because the user-defined complex operations in XL were all done in registers, whereas at that level of optimization, the C++ compiler was not doing the memory aliasing analysis required to perform "register field promotion", elimintate the "this pointer", and turn the C++ complex class into registers. In other words, a complex addition was 4 loads, two fp adds, and 2 stores for C++, as opposed to only the fp adds for XL. Obviously, an IR assuming that aggregates are in memory does not help here.

Chris has some notes about how to do this for return values here:

http://www.nondot.org/sabre/LLVMNotes/MultipleReturnValues.txt

He pointed me to this earlier, thanks.

Thanks,
Christophe

This doesn't fit naturally with the way that LLVM does things: In
LLVM, each instruction can produce at most one value. This means that
a pointer to the instruction is as good as a pointer to the value,
which dramatically simplifies the IR and everything that consumes or
produces it.

An additional constraint you did not mention is that all the values
must be first-class. But what is "first class" actually depends on
the hardware and ABI. An i64, for instance, is first class on 64-bit
CPUs, but not on 32-bit CPUs. Is the following legal on a 32-bit target?

  declare i64 @foo(i128, i256)

Yes it is, LLVM explicitly defines what it considers to be a first-class type:
http://llvm.org/docs/LangRef.html#t_classifications

The language definition is orthogonal from how the language is mapped onto any particular hardware.

The "getaggregatevalue" is a localized hack to work
around this for the few cases that return multiple values.

As a matter of fact, what annoys me the most with the
getaggregatevalue proposal is precisely that it does not seem too
localized to me. What about:

   %Agg = call {int, float} %foo()
   %intpart = getaggregatevalue {int, float} %Agg, uint 0
   [insert 200 instructions here]
   %floatpart = getaggregatevalue {int, float} %Agg, uint 1

It is localized in the sense that it adds one feature to the IR, which will require very minor changes to optimizers and other pieces of the compiler. I don't mean localized in the code layout sense.

Unfortunately, this wouldn't solve the problem that you think it
does. For example, lets assume that LLVM allowed you to pass and
return structs by value. Even with this, LLVM would not be able to
directly implement all ABIs "naturally". For example, some ABIs
specify that a _Complex double should be returned in two FP registers,
but that a struct with two doubles in it should be returned in memory.

Even today, that must be special cased, i.e. the IR needs to be
distinct between the two cases. As I understand it, the following is
already legal, since vectors are first class:

  declare <2 x double> @builtin_complex_add (<2 x double>, <2 x >)

Complex != vectors.

You are right though that the front-end has to be aware of this. I think that darwin/ppc specifies that Complex float is returned in two integer GPRs (!!), where as a struct of two floats is returned by hidden reference.

llvm-gcc lowers the former to return an "i64" value, which it knows gets mapped onto two GPRs. The latter is lowered to pass a pointer to the return value as the first argument.

My point about ABIs is that there is no way to prevent the front-end from having to know this kind of magic, without teaching llvm about the full C type system (and every other language that targets it) something we don't want to do.

-Chris

Probably in good part because, in LLVM, aggregates (or derived types)
types exist only in memory, not in registers.

Thanks, that's precisely where I see a problem. On many recent
architectures (Itanium being the extreme case), small enough
aggregates are passed and held in registers. Thinking or designing
"aggregates == memory" is an obsolete approach :wink: I like the "call"
instruction because, at least, it got rid of the "arguments == push
to stack" approach you find in the Java or MISL bytecodes...

Sure. However, an IR is an abstraction layer, it doesn't necessarily specify how it gets mapped onto the hardware. Also, a variety of optimizations kick in to improve the code in various ways. For example, LLVM contains a "scalar replacement of aggregates" pass, which breaks up aggregates in memory into registers when possible. This is particularly important for C++ code, which uses lots of small aggregates.

If you have large aggregates, it is almost always better to put them in memory than in registers.

As an aside, why do I care? I wanted XL to be efficient on modern
architectures, so I got rid of "implicit memory accesses" as much as
I could, e.g. no "this pointer". At one point, I compiled a simple
program manipulating complex numbers to draw a Julia set. At the
lowest level of optimization, the XL version was at least 70% faster
than the C++ version.

Have you tried compiling the C++ version with llvm-gcc? :slight_smile: The complex number should certainly be promoted to live in FP registers.

Why? Because the user-defined complex operations in XL were all done
in registers, whereas at that level of optimization, the C++ compiler
was not doing the memory aliasing analysis required to perform
"register field promotion", elimintate the "this pointer", and turn
the C++ complex class into registers. In other words, a complex
addition was 4 loads, two fp adds, and 2 stores for C++, as opposed
to only the fp adds for XL. Obviously, an IR assuming that aggregates
are in memory does not help here.

LLVM is designed to do these sorts of things, and it is very good at it. :slight_smile:

The only significant current problem is when you have aggregates that are passed ore returned through function calls. In this case (assuming the call is not inlined) the optimizer is not able to promote the value from memory into registers. This is why we want to extend LLVM to support this in a first-class way.

-Chris