Endianness for multi-word types

Hi,

I’m recently trying to investigate ppc_fp128 related problem. Here is a minimal C++ test case that seems wrongly compiled:

long double id(long double a) {
return a;
}
bool f(long double x) {
return id(__builtin_fabsl(x)) >= 0;
}
int main() {
if (f(-123.l)) {
return 0;
}
return 1;
}

The program compiled with command:
clang++ -static -target powerpc64le-linux-gnu bad.cc

Returns 1 on a ppc machine.

I did some investigation, it turns out in lib/CodeGen/SelectionDAG/DAGCombiner.cpp, a combination is wrongly assuming the endianness for the i128 bitcasted from a ppc_fp128 (two doubles bit-concatenated together):
// fold (bitconvert (fabs x)) → (and (bitconvert x), (not signbit))

This reveals that endianness conversion concerning ppc_fp128 or i128, and big-endian or little-endian target is too confusing to me.

I can certainly fix this case by using a “smarter signbit” (e.g. 0x00000000000000001000000000000000), but this seems more confusing. I wonder if anyone has the best background knowledge here?

Background I found: http://llvm.org/klaus/llvm/commit/1ef2cec146099f4bf46c31b913c7f1538935a867/

Thanks!

According to http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html,
"The high-order double-precision value (the one that comes first in storage) must have the larger magnitude."

So the order of the two doubles in your fp128 is not affected by the endianness; although each individual double is.
Hope that helps,
- Gao

According to http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html,
“The high-order double-precision value (the one that comes first in storage) must have the larger magnitude.”

So the order of the two doubles in your fp128 is not affected by the endianness; although each individual double is.

Of this part it’s pretty clear, so I guess my questions are, more specifically:

  1. Why

Eeh, sorry, premature email accidentally sent :frowning:

Well I’m still trying to understand the code base, so I may get this completely wrong, but here is my understanding:

Looking at the comment of TargetLowering::hasBigEndianPartOrdering, it seems that, all “internal types” (e.g. ppc_fp128, i128) are assumed to be little-endian, and for non-ppc_fp128 external values (raw register tuples or memory), the endianness follows the data layout endianness; for ppc_fp128 external values (a pair of float registers or memory chunk), it’s always “big-endian” due to the policy.

The problem is when we bitcast from ppc_fp128 to i128, do we exchange the two registers? Today we do, because ppc_fp128 requires a mandatory big-endian representation, but at the same time it’s a i128, which is interpreted as a little-endian 128-bit integer. That’s why the combiner I previously mentioned gets confused.

If we don’t exchange the two registers, combiners are happy because they get a little-endian view of the two doubles (e.g. the most significant bit is indeed the sign bit); but if someone stores this casted i128 to memory, the two words are in little-endian, which breaks the ppc_fp128 layout policy. It also breaks the bitcast semantic: “The conversion is done as if the value had been stored to memory and read back as type ty2.”

From: "Tim Shen via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Yunzhong Gao" <yunzhong_gao@playstation.sony.com>, llvm-dev@lists.llvm.org
Sent: Monday, November 30, 2015 10:14:05 PM
Subject: Re: [llvm-dev] Endianness for multi-word types

According to
http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html ,
"The high-order double-precision value (the one that comes first in
storage) must have the larger magnitude."

So the order of the two doubles in your fp128 is not affected by the
endianness; although each individual double is.

Well I'm still trying to understand the code base, so I may get this
completely wrong, but here is my understanding:

Looking at the comment of TargetLowering::hasBigEndianPartOrdering,
it seems that, all "internal types" (e.g. ppc_fp128, i128) are
assumed to be little-endian, and for non-ppc_fp128 external values
(raw register tuples or memory), the endianness follows the data
layout endianness; for ppc_fp128 external values (a pair of float
registers or memory chunk), it's always "big-endian" due to the
policy.

Right. ppc_fp128 is special in this regard. Its layout is fixed in an Endian-independent way. This can be confusing, however, because it means that casts from i128 <-> ppc_fp128 have an Endianness-dependent result. This requires a number of special cases in various places (unfortunately).

This is why we needed to disable ppc_fp128 constant folding through bitcasts, for example. From IR/ConstantFold.cpp:

  // Handle ConstantFP input: FP -> Integral.
  if (ConstantFP *FP = dyn_cast<ConstantFP>(V)) {
    // PPC_FP128 is really the sum of two consecutive doubles, where the first
    // double is always stored first in memory, regardless of the target
    // endianness. The memory layout of i128, however, depends on the target
    // endianness, and so we can't fold this without target endianness
    // information. This should instead be handled by
    // Analysis/ConstantFolding.cpp
    if (FP->getType()->isPPC_FP128Ty())
      return nullptr;

The problem is when we bitcast from ppc_fp128 to i128, do we exchange
the two registers? Today we do, because ppc_fp128 requires a
mandatory big-endian representation, but at the same time it's a
i128, which is interpreted as a little-endian 128-bit integer.
That's why the combiner I previously mentioned gets confused.
If we don't exchange the two registers, combiners are happy because
they get a little-endian view of the two doubles (e.g. the most
significant bit is indeed the sign bit); but if someone stores this
casted i128 to memory, the two words are in little-endian, which
breaks the ppc_fp128 layout policy. It also breaks the bitcast
semantic: "The conversion is done as if the value had been stored to
memory and read back as type ty2."

If that's what's happening, I agree, that seems broken. The ppc_fp128 <-> i128 bitcast should legitimately have different results on big-Endian and little-Endian systems. Swapping the registers when we do a bitcast in some attempt to remove this effect is not self consistent. However, we seem to get this right currently:

$ cat /tmp/ld.c
long foo(long double x) {
  return *(long *) &x;
  // returns f1 on both powerpc64 and powerpc64le
}

long bar(long double x) {
  __int128 y = *(__int128 *) &x;
  return (long) y;
  // returns f2 on powerpc64 and f1 on powerpc64le
}

Given that the higher-order double is defined to be the one that comes first in storage, in the first case, foo(), regardless of Endianness, should return the higher-order double as an integer. In the second case, bar(), we're returning the lower-order (integer) part of the i128 overlaying the ppc_fp128 value. On a little-Endian system, this should be the first double stored in memory, as an integer. That's the higher-order double, and so the code should be the same as for foo(). For a big-Endian system, the lower-order part of the i128 integer is stored in the second double (in layout), and so it should return the lower-order double, as an integer. This is the case that should differ from foo(), but only on big-Endian systems.

Luckily, this is what happens (and what GCC does as well).

I did some investigation, it turns out in
lib/CodeGen/SelectionDAG/DAGCombiner.cpp, a combination is wrongly
assuming the endianness for the i128 bitcasted from a ppc_fp128 (two
doubles bit-concatenated together):
// fold (bitconvert (fabs x)) -> (and (bitconvert x), (not signbit))

This reveals that endianness conversion concerning ppc_fp128 or i128,
and big-endian or little-endian target is too confusing to me.

I can certainly fix this case by using a "smarter signbit" (e.g.
0x00000000000000001000000000000000), but this seems more confusing.
I wonder if anyone has the best background knowledge here?

And, unfortunately, this is exactly what you should do. ppc_fp128 is a special case here (again).

-Hal

From: Hal Finkel [mailto:hfinkel@anl.gov]
Sent: Tuesday, December 01, 2015 1:01 AM
To: Tim Shen
Cc: Gao, Yunzhong; llvm-dev@lists.llvm.org; Kit Barton; Nemanja Ivanovic
Subject: Re: [llvm-dev] Endianness for multi-word types

> From: "Tim Shen via llvm-dev" <llvm-dev@lists.llvm.org>
> To: "Yunzhong Gao" <yunzhong_gao@playstation.sony.com>,
> llvm-dev@lists.llvm.org
> Sent: Monday, November 30, 2015 10:14:05 PM
> Subject: Re: [llvm-dev] Endianness for multi-word types
>
>
>
> According to
> http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html , "The
> high-order double-precision value (the one that comes first in
> storage) must have the larger magnitude."
>
> So the order of the two doubles in your fp128 is not affected by the
> endianness; although each individual double is.
>
> Well I'm still trying to understand the code base, so I may get this
> completely wrong, but here is my understanding:
>
> Looking at the comment of TargetLowering::hasBigEndianPartOrdering,
> it seems that, all "internal types" (e.g. ppc_fp128, i128) are assumed
> to be little-endian, and for non-ppc_fp128 external values (raw
> register tuples or memory), the endianness follows the data layout
> endianness; for ppc_fp128 external values (a pair of float registers
> or memory chunk), it's always "big-endian" due to the policy.
>

Right. ppc_fp128 is special in this regard. Its layout is fixed in an Endian-
independent way. This can be confusing, however, because it means that
casts from i128 <-> ppc_fp128 have an Endianness-dependent result. This
requires a number of special cases in various places (unfortunately).

This is why we needed to disable ppc_fp128 constant folding through bitcasts,
for example. From IR/ConstantFold.cpp:

  // Handle ConstantFP input: FP -> Integral.
  if (ConstantFP *FP = dyn_cast<ConstantFP>(V)) {
    // PPC_FP128 is really the sum of two consecutive doubles, where the first
    // double is always stored first in memory, regardless of the target
    // endianness. The memory layout of i128, however, depends on the
target
    // endianness, and so we can't fold this without target endianness
    // information. This should instead be handled by
    // Analysis/ConstantFolding.cpp
    if (FP->getType()->isPPC_FP128Ty())
      return nullptr;

> The problem is when we bitcast from ppc_fp128 to i128, do we exchange
> the two registers? Today we do, because ppc_fp128 requires a mandatory
> big-endian representation, but at the same time it's a i128, which is
> interpreted as a little-endian 128-bit integer.
> That's why the combiner I previously mentioned gets confused.
> If we don't exchange the two registers, combiners are happy because
> they get a little-endian view of the two doubles (e.g. the most
> significant bit is indeed the sign bit); but if someone stores this
> casted i128 to memory, the two words are in little-endian, which
> breaks the ppc_fp128 layout policy. It also breaks the bitcast
> semantic: "The conversion is done as if the value had been stored to
> memory and read back as type ty2."

If that's what's happening, I agree, that seems broken. The ppc_fp128 <->
i128 bitcast should legitimately have different results on big-Endian and little-
Endian systems. Swapping the registers when we do a bitcast in some
attempt to remove this effect is not self consistent. However, we seem to
get this right currently:

$ cat /tmp/ld.c
long foo(long double x) {
  return *(long *) &x;
  // returns f1 on both powerpc64 and powerpc64le }

long bar(long double x) {
  __int128 y = *(__int128 *) &x;
  return (long) y;
  // returns f2 on powerpc64 and f1 on powerpc64le }

Given that the higher-order double is defined to be the one that comes first
in storage, in the first case, foo(), regardless of Endianness, should return the
higher-order double as an integer. In the second case, bar(), we're returning
the lower-order (integer) part of the i128 overlaying the ppc_fp128 value. On
a little-Endian system, this should be the first double stored in memory, as an
integer. That's the higher-order double, and so the code should be the same
as for foo(). For a big-Endian system, the lower-order part of the i128 integer
is stored in the second double (in layout), and so it should return the lower-
order double, as an integer. This is the case that should differ from foo(), but
only on big-Endian systems.

Luckily, this is what happens (and what GCC does as well).

What Hal said is correct, when bitcasting from ppc_fp128 to i128, you need to exchange
the two registers, otherwise you will break the test case in Hal's email.

> I did some investigation, it turns out in
> lib/CodeGen/SelectionDAG/DAGCombiner.cpp, a combination is wrongly
> assuming the endianness for the i128 bitcasted from a ppc_fp128 (two
> doubles bit-concatenated together):
> // fold (bitconvert (fabs x)) -> (and (bitconvert x), (not signbit))
>
> This reveals that endianness conversion concerning ppc_fp128 or i128,
> and big-endian or little-endian target is too confusing to me.
>
> I can certainly fix this case by using a "smarter signbit" (e.g.
> 0x00000000000000001000000000000000), but this seems more confusing.
> I wonder if anyone has the best background knowledge here?

And, unfortunately, this is exactly what you should do. ppc_fp128 is a special
case here (again).

Actually, I think the smart signbit here is not correct. For the fneg case, you need to
toggle the sign bits of both components of a ppc_fp128: negate(d1 + d2) should
produce (-d1 - d2) instead of (-d1 + d2). It probably indicates that this pass is wrong
for both big-endian and little-endian ppc64 targets, but existing test cases were not strong
enough to catch such cases. For the fabs case, you can test positivity using only one
signbit, but to negate the number, you still need to toggle both sign bits.

I have a feeling that you will run into more cases in the DAG combiner where you want
to add special handling codes for ppc_fp128, and I also feel that this type is peculiar
enough that other backend owners upon trying to add more DAG combiner passes are
likely to miss adding the special handling codes as they probably should. In that case it
would be a better idea to handle the ppc_fp128 DAG combiner cases somewhere under
llvm::PPCTargetLowering::PerformDAGCombine() instead of in the target-independent
DAGCombiner.cpp,
- Gao

Actually, I think the smart signbit here is not correct. For the fneg case, you need to

toggle the sign bits of both components of a ppc_fp128: negate(d1 + d2) should
produce (-d1 - d2) instead of (-d1 + d2). It probably indicates that this pass is wrong
for both big-endian and little-endian ppc64 targets, but existing test cases were not strong
enough to catch such cases. For the fabs case, you can test positivity using only one
signbit, but to negate the number, you still need to toggle both sign bits.

Sure, I didn’t pay much attention on crafting the signbit.

I have a feeling that you will run into more cases in the DAG combiner where you want
to add special handling codes for ppc_fp128, and I also feel that this type is peculiar
enough that other backend owners upon trying to add more DAG combiner passes are
likely to miss adding the special handling codes as they probably should.

This is what I was worrying about.

In that case it
would be a better idea to handle the ppc_fp128 DAG combiner cases somewhere under
llvm::PPCTargetLowering::PerformDAGCombine() instead of in the target-independent
DAGCombiner.cpp,

It will work, but it seems doubling the maintenance for floating point combiners. I wonder if there is any better solution to make both combiners and endianness less painful - the problem sounds not as hard as what we have now.

As a simple solution, when see a LLVM IR bitcast, instead of generating (ISD::BITCAST x), can we generate (exchange_hi_lo (ISD::BITCAST x)) instead? ISD::BITCAST itself doesn’t swap the registers anymore. In such case:

(exchange_hi_lo (ISD::BITCAST (fabs x)) is transformed to:
(exchange_hi_lo (and (ISD::BITCAST x) (not signbit)), which is correct. I’m not sure if exchange_hi_lo exists or not.

Generate this for only when x is ppcf128, of course.

An LLVM bitcast is defined to be equivalent to a store/load pair.
Changing that for ISD::BITCAST would be very surprising, and I
wouldn't recommend it. It's a very useful invariant for reasoning
about what should happen.

When we had to work around similar endian nightmares in ARM I think we
ended up creating AArch64ISD::NVCAST to represent a true nop cast.

Cheers.

Tim.

This works as well, so:
(bitcast (fabs x)) → (exchange_hi_lo (and (nvcast x) (not signbit))).

Should I move AArch64ISD::NVCAST to ISD?

From: "Tim Northover" <t.p.northover@gmail.com>
To: "Tim Shen" <timshen@google.com>
Cc: "Yunzhong Gao" <yunzhong_gao@playstation.sony.com>, "Hal Finkel" <hfinkel@anl.gov>, llvm-dev@lists.llvm.org,
"Nemanja Ivanovic" <nemanja.i.ibm@gmail.com>
Sent: Tuesday, December 1, 2015 4:00:15 PM
Subject: Re: [llvm-dev] Endianness for multi-word types

> As a simple solution, when see a LLVM IR bitcast, instead of
> generating
> (ISD::BITCAST x), can we generate (exchange_hi_lo (ISD::BITCAST x))
> instead?

An LLVM bitcast is defined to be equivalent to a store/load pair.
Changing that for ISD::BITCAST would be very surprising, and I
wouldn't recommend it. It's a very useful invariant for reasoning
about what should happen.

Changing the semantics of ISD::BITCAST seems unfeasible.

I have a feeling that you will run into more cases in the DAG
combiner where you want
to add special handling codes for ppc_fp128, and I also feel that
this type is peculiar
enough that other backend owners upon trying to add more DAG combiner
passes are
likely to miss adding the special handling codes as they probably
should.

This is what I was worrying about.

This is a concern, but it is not new, nor unique to ppc_fp128. When writing code in DAGCombine (or any other part of the common code generator) you need to be aware of the requirements of parts of the IR used only by other targets. ppc_fp128 is not even the only custom FP type for which we have special handling (we have x86_fp80 too). While I agree that having some abstraction over these custom types could make things easier (if designed correctly), ppc_fp128 is part of the IR as it stands, and everyone needs to be aware of it when working with FP code that makes assumptions about the FP representation. Luckily, there are not actually a lot of such places.

-Hal

No; and how does this even help? As was pointed out, there are actually two sign bits to flip.

-Hal

Sorry, I should have read the “Extended precision” format. It turns out (bitcast (fabs x)) should translate to something like (pseudo code):

i128 x;

populate x…
u64 hi_signbit_mask = 0x8000000000000000;
x.lo ^= x.hi & hi_signbit_mask;

x.hi &= ~hi_signbit_mask;

return x;

…and (bitcast (fneg x)) should translate to (xor (bitcast x) 0x80000000000000008000000000000000).

These are quite different from the current generic combiners and probably deserve separate code.