64bit MRV problem: { float, float, float} -> { double, float }

Hey everybody,

I am struggling to get rid of a problem which seems to be related to a
multi-return value optimization:
I generate bitcode for a c++-function with llvm-g++ which is then
linked, transformed and optimized at runtime using LLVM. The function
has quite a few parameters, including structs and struct-pointers with 3
float fields.
The problem is, that I require the function to preserve the number and
type of its arguments, but some optimization on 64bit breaks up some of
the struct parameters (I suppose only those that are just loaded and/or
stored somewhere) and inserts a double and a float.
Same goes for load and store instructions of the same struct-type: in
general it might be a good idea to only do two loads/stores (1 double, 1
float) instead of three (3 float), but I don't want this to happen.

Unfortunately, the optimization seems to happen in the frontend already
(llvm-gcc/gcc/config/i386/llvm-i386.cpp and/or
llvm-gcc/gcc/llvm-convert.cpp) - can I do something to prevent this?
One thing that works is generating code for 32bit (-m32), but this
obviously causes other problems.

Best regards,
Ralf

Hi Ralf,

The problem is, that I require the function to preserve the number and
type of its arguments, but some optimization on 64bit breaks up some of
the struct parameters (I suppose only those that are just loaded and/or
stored somewhere) and inserts a double and a float.

this is almost certainly *not* an optimization - the llvm-g++ front-end
generates this because the x86-64 ABI requires it. There's not much that
can be done about it if you want your functions to be callable from code
compiled with a different compiler, such as gcc.

Ciao,

Duncan.

Uh, sorry, did not pay attention where I was replying :wink:

Hey Duncan,

I do not understand why this behaviour is required. What is the problem
in having a function receive a single struct-parameter with three floats
compared to two scalar parameters?

source-code (C++):
struct Test3Float { float a, b, c; };
void test(Test3Float param, Test3Float* result) { ... }

bitcode:
%"struct.Test3Float" = type { float, float, float }
define void @_Z4test10Test3FloatPS_(double %param.0, float %param.1,
%"struct.Test3Float"* %result) { ... }

Best,
Ralf

Duncan Sands wrote:

Hi Ralf,

I do not understand why this behaviour is required. What is the problem
in having a function receive a single struct-parameter with three floats
compared to two scalar parameters?

source-code (C++):
struct Test3Float { float a, b, c; };
void test(Test3Float param, Test3Float* result) { ... }

if you compile this with GCC, you will see that it too passes the first two
floats in one double register, and the remaining float in a different register.
The GCC people didn't make this behaviour up, they are following the platform
ABI specification (which you can find floating around on the internet; the name
is something like "System V application binary interface"). In order to conform
to the standard, the object code produced by LLVM also needs to pass the struct
in the same way. That said, you could imagine that in the bitcode the struct
would be passed as a struct, rather than double+float, and the code generators
would take care of squirting out the appropriate double+float machine code.
Sadly this is not the case: ABI conformance is handled in the front-end. The
fundamental reason for this is that some ABI's (eg: the x86-64 one) specify
how parameters are passed based on type information that is available in C
but not in LLVM. For example, a complex number is passed differently to a
struct containing two floats, even though in LLVM a complex number is exactly
the same as a struct containing two floats. So if the code generators were to
take care of everything, then the entire C type system would somehow have to
be represented in LLVM bitcode, just for this. Instead the decision was taken
to require front-ends to deal with this kind of issue. Yes, this sucks - but
no-one came up with a better solution yet.

bitcode:
%"struct.Test3Float" = type { float, float, float }
define void @_Z4test10Test3FloatPS_(double %param.0, float %param.1,
%"struct.Test3Float"* %result) { ... }

Ciao,

Duncan.

Hey Duncan,

Duncan Sands wrote:

That said, you could imagine that in the bitcode the struct
would be passed as a struct, rather than double+float, and the code
generators
would take care of squirting out the appropriate double+float machine
code.
Sadly this is not the case: ABI conformance is handled in the front-end.

Ah, I see.

The fundamental reason for this is that some ABI's (eg: the x86-64
one) specify
how parameters are passed based on type information that is available
in C
but not in LLVM. For example, a complex number is passed differently
to a
struct containing two floats, even though in LLVM a complex number is
exactly
the same as a struct containing two floats. So if the code generators
were to
take care of everything, then the entire C type system would somehow
have to
be represented in LLVM bitcode, just for this. Instead the decision
was taken
to require front-ends to deal with this kind of issue. Yes, this
sucks - but
no-one came up with a better solution yet.

Thank's a lot for the detailed explanation.
So I guess there is no way to tell the front-end to break the
ABI-compliance in exactly this point... guess I have to figure out some
other workaround then.

Best,
Ralf

Hey Duncan, hey everybody else,

I just stumbled upon a problem in the latest llvm-gcc trunk which is
related to my previous problem with the 64bit ABI and structs:

Given the following code:

struct float3 { float x, y, z; };
extern "C" void __attribute__((noinline)) test(float3 a, float3* res) {
    res->y = a.y;
}
int main(void) {
    float3 a;
    float3 res;
    test(a, &res);
}

llvm-gcc -c -emit-llvm -O3 produces this:

%struct.float3 = type { float, float, float }
define void @test(double %a.0, float %a.1, %struct.float3* nocapture
%res) nounwind noinline {
entry:
  %tmp8 = bitcast double %a.0 to i64 ; <i64> [#uses=1]
  %tmp9 = zext i64 %tmp8 to i96 ; <i96> [#uses=1]
  %tmp1 = lshr i96 %tmp9, 32 ; <i96> [#uses=1]
  %tmp2 = trunc i96 %tmp1 to i32 ; <i32> [#uses=1]
  %tmp3 = bitcast i32 %tmp2 to float ; <float> [#uses=1]
  %0 = getelementptr inbounds %struct.float3* %res, i64 0, i32 1 ;
<float*> [#uses=1]
  store float %tmp3, float* %0, align 4
  ret void
}
define i32 @main() nounwind {
entry:
  %res = alloca %struct.float3, align 8 ; <%struct.float3*>
[#uses=1]
  call void @test(double undef, float 0.000000e+00, %struct.float3*
%res) nounwind
  ret i32 0
}

The former second value of the struct is casted from float to i64,
zero-extended, shifted, truncated and casted back to float.
Unfortunately, in my case, LLVM seems to be unable to remove this kind
of code (in more complex functions of course) even though it gets
inlined and optimized. I end up with functions like this one:

define void @xyz(float %aX, float %aY, float %aZ, float* noalias
nocapture %resX, float* noalias nocapture %resY, float* noalias
nocapture %resZ) nounwind {
entry:
  %0 = fadd float %aZ, 5.000000e-01 ; <float> [#uses=1]
  %1 = fadd float %aY, 5.000000e-01 ; <float> [#uses=1]
  %2 = fadd float %aX, 5.000000e-01 ; <float> [#uses=1]
  %tmp16.i.i = bitcast float %1 to i32 ; <i32> [#uses=1]
  %tmp17.i.i = zext i32 %tmp16.i.i to i96 ; <i96> [#uses=1]
  %tmp18.i.i = shl i96 %tmp17.i.i, 32 ; <i96> [#uses=1]
  %tmp19.i = zext i96 %tmp18.i.i to i128 ; <i128> [#uses=1]
  %tmp8.i = lshr i128 %tmp19.i, 32 ; <i128> [#uses=1]
  %tmp9.i = trunc i128 %tmp8.i to i32 ; <i32> [#uses=1]
  %tmp10.i = bitcast i32 %tmp9.i to float ; <float> [#uses=1]
  store float %2, float* %resX, align 4
  store float %tmp10.i, float* %resY, align 4
  store float %0, float* %resZ, align 4
  ret void
}

llvm-gcc4.2-2.5 generates the following code for the same example:

define void @test(double %a.0, float %a.1, %struct.float3* nocapture
%res) nounwind noinline {
entry:
  %a_addr = alloca %struct.float3, align 8 ; <%struct.float3*>
[#uses=3]
  %0 = bitcast %struct.float3* %a_addr to double* ; <double*> [#uses=1]
  store double %a.0, double* %0
  %1 = getelementptr %struct.float3* %a_addr, i64 0, i32 2 ; <float*>
[#uses=1]
  store float %a.1, float* %1, align 8
  %2 = getelementptr %struct.float3* %a_addr, i64 0, i32 1 ; <float*>
[#uses=1]
  %3 = load float* %2, align 4 ; <float> [#uses=1]
  %4 = getelementptr %struct.float3* %res, i64 0, i32 1 ; <float*> [#uses=1]
  store float %3, float* %4, align 4
  ret void
}

Apparently, the optimizer can work better with that code and after
inlining, it all goes away as expected.

Is this change intentional?
Any ideas where that code comes from or why it cannot be removed?

Best,
Ralf

Hi Ralf,

llvm-gcc -c -emit-llvm -O3 produces this:

%struct.float3 = type { float, float, float }
define void @test(double %a.0, float %a.1, %struct.float3* nocapture
%res) nounwind noinline {
entry:
  %tmp8 = bitcast double %a.0 to i64 ; <i64> [#uses=1]
  %tmp9 = zext i64 %tmp8 to i96 ; <i96> [#uses=1]
  %tmp1 = lshr i96 %tmp9, 32 ; <i96> [#uses=1]
  %tmp2 = trunc i96 %tmp1 to i32 ; <i32> [#uses=1]
  %tmp3 = bitcast i32 %tmp2 to float ; <float> [#uses=1]
  %0 = getelementptr inbounds %struct.float3* %res, i64 0, i32 1 ;
<float*> [#uses=1]
  store float %tmp3, float* %0, align 4
  ret void
}

it is reasonable to expect the optimizers to turn this at least into

   %tmp8 = bitcast double %a.0 to i64
   %tmp2 = lshr i64 %tmp8, 32
   %tmp3 = bitcast i32 %tmp2 to float

define void @xyz(float %aX, float %aY, float %aZ, float* noalias
nocapture %resX, float* noalias nocapture %resY, float* noalias
nocapture %resZ) nounwind {
entry:
  %0 = fadd float %aZ, 5.000000e-01 ; <float> [#uses=1]
  %1 = fadd float %aY, 5.000000e-01 ; <float> [#uses=1]
  %2 = fadd float %aX, 5.000000e-01 ; <float> [#uses=1]
  %tmp16.i.i = bitcast float %1 to i32 ; <i32> [#uses=1]
  %tmp17.i.i = zext i32 %tmp16.i.i to i96 ; <i96> [#uses=1]
  %tmp18.i.i = shl i96 %tmp17.i.i, 32 ; <i96> [#uses=1]
  %tmp19.i = zext i96 %tmp18.i.i to i128 ; <i128> [#uses=1]
  %tmp8.i = lshr i128 %tmp19.i, 32 ; <i128> [#uses=1]
  %tmp9.i = trunc i128 %tmp8.i to i32 ; <i32> [#uses=1]
  %tmp10.i = bitcast i32 %tmp9.i to float ; <float> [#uses=1]
  store float %2, float* %resX, align 4
  store float %tmp10.i, float* %resY, align 4
  store float %0, float* %resZ, align 4
  ret void
}

Likewise, here it is reasonable to expect the optimizers to be able to get rid
of all the mucking around with %1, and understand that %tmp10.i is the same as
%1.

So I think you should open a bug report for this. Please include everything
in this email, as well as the original C for the second testcase.

Any ideas where that code comes from or why it cannot be removed?

I think it comes directly from the frontend ABI logic. The optimizers should
be able to handle this, but since they aren't handling it they need to be
improved.

Ciao,

Duncan.

Hey Duncan,

Duncan Sands wrote:

So I think you should open a bug report for this. Please include
everything
in this email, as well as the original C for the second testcase.

I just opened a bug report ( 6194 – poor x86-64 ABI calling conv passing of struct with 3 or 4 floats )
and tried to put in some simple and reproducible test cases.
I hope this is all comprehensible enough :).

Best,
Ralf