I've hit a pretty nasty issue on SKX with ANDs of masks <= 4 bits.
In the IR, we represent a 4b vector mask as <4 x i1>. This assumes
that the storage container for this type is also 4b, but it's not. The
smallest mask register on SKX is 8b. This also implies that the
smallest load/store moves 8b.
We run into problems when we try to optimize ANDs (full test case attached):
%r1 = and <4 x i1> %r0, <i1 -1, i1 -1, i1 -1, i1 -1>
At the IR level the all1s mask looks like the Identity for this
operation, so LLVM will remove it. But it is not the Identity since
this operation should clear the top 4 bits of the 8 bit hardware
register in play. E.g.
I began tracking down this issue and found that InstCombine will
incorrectly remove the AND. Then I noticed that the Reassociate pass
would also remove the AND if InstCombine did not. That made me
nervous. My current thinking is that this might be a larger problem
that shouldn't be patched up. Or maybe I made a faulty assumption with
the IR I choose for this operation.
I've hit a pretty nasty issue on SKX with ANDs of masks <= 4 bits.
In the IR, we represent a 4b vector mask as <4 x i1>. This assumes
that the storage container for this type is also 4b, but it's not.
The storage type is not relevant, these bits are “unreachable” from the IR point of view.
The backend is supposed to lower the operation in a safe way when it is needed to clear these bit.
For example if you were to perform some arithmetic operation on these, it is likely that they would get zero extended to 8bits first and this is where the upper bits would be cleared.
The
smallest mask register on SKX is 8b. This also implies that the
smallest load/store moves 8b.
We run into problems when we try to optimize ANDs (full test case attached):
%r1 = and <4 x i1> %r0, <i1 -1, i1 -1, i1 -1, i1 -1>
At the IR level the all1s mask looks like the Identity for this
operation, so LLVM will remove it. But it is not the Identity since
this operation should clear the top 4 bits of the 8 bit hardware
register in play. E.g.
No, this operation alone does not need to clear the upper bit, they are undefined before and after.
I began tracking down this issue and found that InstCombine will
incorrectly remove the AND. Then I noticed that the Reassociate pass
would also remove the AND if InstCombine did not. That made me
nervous. My current thinking is that this might be a larger problem
that shouldn't be patched up. Or maybe I made a faulty assumption with
the IR I choose for this operation.
There might be a legitimate issue, but your example fails short to illustrate it right now: you’re not showing how these upper bits are leaking into the computation somewhere?
Yes, good point. Updated test case exhibiting the dirty bits attached.
Notice that the kortest will operate on the dirty bits that should
have been zeroed.
Perhaps the problem is that the zext of the i4 to i16 does not get
generated correctly.
The problem with your IR is actually the load. When you load a value whose size in bits is not a multiple of 8 (like i1, or <4 x i1>, the result is undefined unless the unused bits are zero. You can see this in the debug output from llc:
I should have attached the generated asm to save some trouble.
Apologies for that and attaching now…
Hey guys,
I’ve hit a pretty nasty issue on SKX with ANDs of masks <= 4 bits.
In the IR, we represent a 4b vector mask as <4 x i1>. This assumes
that the storage container for this type is also 4b, but it’s not.
The storage type is not relevant, these bits are “unreachable” from the IR point of view.
The backend is supposed to lower the operation in a safe way when it is needed to clear these bit.
For example if you were to perform some arithmetic operation on these, it is likely that they would get zero extended to 8bits first and this is where the upper bits would be cleared.
The
smallest mask register on SKX is 8b. This also implies that the
smallest load/store moves 8b.
We run into problems when we try to optimize ANDs (full test case attached):
%r1 = and <4 x i1> %r0, <i1 -1, i1 -1, i1 -1, i1 -1>
At the IR level the all1s mask looks like the Identity for this
operation, so LLVM will remove it. But it is not the Identity since
this operation should clear the top 4 bits of the 8 bit hardware
register in play. E.g.
No, this operation alone does not need to clear the upper bit, they are undefined before and after.
I began tracking down this issue and found that InstCombine will
incorrectly remove the AND. Then I noticed that the Reassociate pass
would also remove the AND if InstCombine did not. That made me
nervous. My current thinking is that this might be a larger problem
that shouldn’t be patched up. Or maybe I made a faulty assumption with
the IR I choose for this operation.
There might be a legitimate issue, but your example fails short to illustrate it right now: you’re not showing how these upper bits are leaking into the computation somewhere?
—
Mehdi
Hi Mehdi,
Yes, good point. Updated test case exhibiting the dirty bits attached.
Notice that the kortest will operate on the dirty bits that should
have been zeroed.
Perhaps the problem is that the zext of the i4 to i16 does not get
generated correctly.
The problem with your IR is actually the load. When you load a value whose size in bits is not a multiple of 8 (like i1, or <4 x i1>, the result is undefined unless the unused bits are zero.
Almost: LangRef says "When loading a value of a type like i20 with a size that is not an integral number of bytes, the result is undefined if the value was not originally written using a store of the same type.”
This is even stronger than what you described: even if the bits are explicitly 0, the results can be undefined.
Also, while I guess the most straightforward lowering of the store would clear the upper bits, that’s not a requirement of LangRef I believe (it seems legal to me to clear the bits on the load instead for example).