While debugging an OOT issue with masked memory intrinsics I came across lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp where bitcasts of the following form are introduced
%scalar_mask = bitcast <8 x i1> %interleaved.mask to i8
That is when emulating masked stores on machine that is lacking hardware support the <8 x i1> mask vector is bitcasted to a i8 scalar type. Now the problem is that this appears to yield different results for big-endian and little-endian targets.
AFIK in general LLVM IR vectors are laid out in memory with the first element at the lowest address (i.e. independent of endianness) but for the i1 type (and possibly all sub-byte sized types) there seem to be a dependence on target endianness.
For example
define i8 @foo() {
entry:
%v = insertelement <8 x i1> zeroinitializer, i1 true, i8 0
%bc = bitcast <8 x i1> %v to i8
ret i8 %bc
}
$ llc -O3 bitcast.ll --mtriple arm -o - # lsb is set in scalar
$ llc -O3 bitcast.ll --mtriple armeb -o - # msb is set in scalar
with similar results for mips (big-endian) and amd64 (little-endian)
Now for ScalarizeMaskedMemIntrin.cpp this must surely be a problem since the mask gets reversed for big-endian targets. I tried addressing this by compensating for endianness when, later in the pass, checking the individual bits of the scalar. This compensation seemed to work well for our big-endian target but rather surprisingly (to me) ARM specific lit-tests then started failing
1. Is a bitcast <8 x i1> %v to i8 well defined and if so is the result supposed to be dependent on target endianness?
2. Is ScalarizeMaskedMemIntrin.cpp broken for big-endian targets?
3. If ScalarizeMaskedMemIntrin.cpp is broken for big-endian targets then aren't the three lit-tests also broken since they brake when I try to fix the alleged brokenness of ScalarizeMaskedMemIntrin.cpp?
From: llvm-dev <llvm-dev-bounces@lists.llvm.org> On Behalf Of Markus
Lavin via llvm-dev
Sent: den 11 januari 2021 11:21
To: llvm-dev@lists.llvm.org
Subject: [llvm-dev] bitcast <8 x i1> to i8 - dependence on endianness?
While debugging an OOT issue with masked memory intrinsics I came across
lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp where bitcasts of the
following form are introduced
%scalar_mask = bitcast <8 x i1> %interleaved.mask to i8
That is when emulating masked stores on machine that is lacking hardware
support the <8 x i1> mask vector is bitcasted to a i8 scalar type. Now
the problem is that this appears to yield different results for big-
endian and little-endian targets.
AFIK in general LLVM IR vectors are laid out in memory with the first
element at the lowest address (i.e. independent of endianness) but for
the i1 type (and possibly all sub-byte sized types) there seem to be a
dependence on target endianness.
For example
define i8 @foo() {
entry:
%v = insertelement <8 x i1> zeroinitializer, i1 true, i8 0
%bc = bitcast <8 x i1> %v to i8
ret i8 %bc
}
$ llc -O3 bitcast.ll --mtriple arm -o - # lsb is set in scalar
$ llc -O3 bitcast.ll --mtriple armeb -o - # msb is set in scalar
with similar results for mips (big-endian) and amd64 (little-endian)
Now for ScalarizeMaskedMemIntrin.cpp this must surely be a problem since
the mask gets reversed for big-endian targets. I tried addressing this by
compensating for endianness when, later in the pass, checking the
individual bits of the scalar. This compensation seemed to work well for
our big-endian target but rather surprisingly (to me) ARM specific lit-
tests then started failing
It would be nice if someone from ARM could acknowledge that the codegen actually is faulty for big-endian now (all I know is that David Green has done lots of changes to those test cases in the past according to git log, but anyone with mve knowledge could perhaps look at it).
@markus: Could you help out locating the functions that we think is wrong in those tests? Maybe even upload your fixes in ScalarizeMaskedMemIntrin.cpp to Phabricator to show the differences both to the LLVM code and the new codegen for those test cases?
Sorry - Yes. The ARM/MVE tests are correct as-is, in that they produce the correct output under big endian as far as I can tell. (The aligned test not being scalarized produces the same output as the unaligned case that is). When MVE is enabled the backend is assuming that low lanes end up in low bits of the predicate mask. So the two cancel each other out and we happen to end up with the correct code.
Apparently this is different to the rest of llvm, which assumes the opposite for non-byte sized vectors? That is surprising, we even have some instructions under MVE for storing predicates which under big endian assume the low lane is in the low bits. I would not be surprised if this was causing problems somewhere under big endian though, it does not get nearly as much use as little endian.
@markus: Could you help out locating the functions that we think is wrong in those tests? Maybe even upload your fixes in ScalarizeMaskedMemIntrin.cpp to Phabricator to show the differences both to the LLVM code and the new codegen for those test cases?
Yeah, If you can upload a phabricator review for the changes in the expansion of masked intrinsics, I can take a look into the MVE codegen and see if I can get it to store in the opposite order sensibly. I have not looked at what that would take yet, but I'm hoping it's not too difficult.
Apparently this is different to the rest of llvm, which assumes the opposite
for non-byte sized vectors? That is surprising, we even have some
instructions under MVE for storing predicates which under big endian
assume the low lane is in the low bits. I would not be surprised if this was
causing problems somewhere under big endian though, it does not get
nearly as much use as little endian.
I don't know but it seems unlikely that this pass can work correctly for general big-endian considering that
> @markus: Could you help out locating the functions that we think is wrong
in those tests? Maybe even upload your fixes in
ScalarizeMaskedMemIntrin.cpp to Phabricator to show the differences both
to the LLVM code and the new codegen for those test cases?
Yeah, If you can upload a phabricator review for the changes in the
expansion of masked intrinsics, I can take a look into the MVE codegen and
see if I can get it to store in the opposite order sensibly. I have not looked at
what that would take yet, but I'm hoping it's not too difficult.
Apparently this is different to the rest of llvm, which assumes the opposite
for non-byte sized vectors? That is surprising, we even have some
instructions under MVE for storing predicates which under big endian
assume the low lane is in the low bits. I would not be surprised if this was
causing problems somewhere under big endian though, it does not get
nearly as much use as little endian.
I don’t know but it seems unlikely that this pass can work correctly for general big-endian considering that
LLVM IR has first class vector types. In LLVM IR, the zero’th element of a vector resides at the lowest memory address. (from https://llvm.org/docs/BigEndianNEON.html)
The result of bitcast <8 x i1> to i8 experiments in my previous post.
I don’t necessarily trust that overly-wordy page on NEON, but ppc64le in bardardly C (AltiVec) is bit-Endian, but was switched to little-Endian in LLVM-IR so that LLVM-IR could be consistently little-Endian. I don’t think there are any big-Endian vector representations (and I don’t believe they are, for indeed “The books of the big-Endians have bee long forbidden.”–Lemur Gulliver’s Travels to Several Remote Nations of the World)
I also have extensively used <i1 x 8> to i8 and similar bitcasts.