Hi all,
When trying to generate a VFCmp instruction when UnsafeFPMath is set to true I get an assert “Unexpected CondCode” on my x86 system. This also happens with UnsafeFPMath set to false and using an unordered compare. Could someone look into this?
While I’m at it, is there any reason why only the most significant bit of the return value of VFCmp is defined (according to the documentation)? Both AltiVec and SSE set the components of the result to either all 1’s or all 0’s. Having only the most significant bit doesn’t seem useful to me at all, and (arithmetic) shifting vectors to replicate the bit isn’t supported.
Thanks!
Nicolas Capens
Hi all,
When trying to generate a VFCmp instruction when UnsafeFPMath is set to true I get an assert “Unexpected CondCode” on my x86 system. This also happens with UnsafeFPMath set to false and using an unordered compare. Could someone look into this?
Have you filed a bug?
While I’m at it, is there any reason why only the most significant bit of the return value of VFCmp is defined (according to the documentation)? Both AltiVec and SSE set the components of the result to either all 1’s or all 0’s. Having only the most significant bit doesn’t seem useful to me at all, and (arithmetic) shifting vectors to replicate the bit isn’t supported.
Nate can probably explain this better than anyone.
Evan
Hi all,
When trying to generate a VFCmp instruction when UnsafeFPMath is set to true I get an assert “Unexpected CondCode” on my x86 system. This also happens with UnsafeFPMath set to false and using an unordered compare. Could someone look into this?
Please provide a testcase.
While I’m at it, is there any reason why only the most significant bit of the return value of VFCmp is defined (according to the documentation)? Both AltiVec and SSE set the components of the result to either all 1’s or all 0’s. Having only the most significant bit doesn’t seem useful to me at all, and (arithmetic) shifting vectors to replicate the bit isn’t supported.
LLVM is intended to support other vector instruction sets, including SPU, Alpha, etc. The commonality was that the MSB is set, and we would like to add support for vector shifting at some point.
-Chris
Hi Evan,
No, I haven’t filed a bug. I’d first like someone to confirm this behavior. Anyway, I’ll post some test code in a minute.
Can I contact Nate directly about the definition of VFCmp?
Thanks,
Nicolas
Hi Chris,
I’ve attached a replacement of fibonacci.cpp that reproduces the issue on x86/SSE systems.
Regarding the definition of the VFCmp instruction, I think it would really be a lot more valuable to define it as returning all 1’s or all 0’s per element. Just setting the most significant bit is pretty much worthless (someone correct me if I’m wrong). I checked and I couldn’t actually find any instruction set that only sets the MSB when comparing vectors, except by actually doing just a subtract. Since people need full masks before they can do anything useful with it (requiring a shift or conditional replace) I was thinking why not make that part of VFCmp? Note that this change in VFCmp’s definition won’t break compatibility.
Kind regards,
Nicolas
fibonacci.cpp (1.25 KB)
I’ll take a look at this, thanks!
There are other architectures which don’t do this, so defining it as such would over-constrain the problem. The bits are undefined, and may be set to any value by the target arch. The goal here is that you can essentially treat each element of the vector as a signed integer and select (or other operation) if the value is less than zero, rather than specifically equal to -1. This matches things like SSE’s blend and PPC’s fsel.
Hi Nate!
I don’t see how that would work. Select doesn’t work per element.
Say we’re trying to vectorize the following C++ code:
if(v[0] < 0) v[0] += 1.0f;
if(v[1] < 0) v[1] += 1.0f;
if(v[2] < 0) v[2] += 1.0f;
if(v[3] < 0) v[3] += 1.0f;
With SSE assembly this would be as simple as:
movaps xmm1, xmm0 // v in xmm0
cmpltps xmm1, zero // zero = {0.0f, 0.0f, 0.0f, 0.0f}
andps xmm1, one // one = {1.0f, 1.0f, 1.0f, 1.0f}
addps xmm0, xmm1
With the current definition of VFCmp this seems hard if not impossible to achieve. Vector compare instructions that return all 1’s or all 0’s per element are very common, and they are quite powerful in my opinion (effectively allowing to implement a per-element Select). It seems to me that for the few architectures that don’t have such instructions it would be useful to have LLVM generate a short sequence of instructions that does result in useful masks of all 1’s or all 0’s. Or am I missing something?
Thanks a lot,
Nicolas
Well, this is a bit of a hack, but one way to write it would be
something like the following:
define <4 x float> @a2(<4 x float> %in) nounwind {
%cmpres = vfcmp olt <4 x float> %in, zeroinitializer
%cmpresshift = udiv <4 x i32> %cmpres, <i32 2147483648, i32
2147483648, i32 2147483648, i32 2147483648>
%cmpmask = sub <4 x i32> zeroinitializer, %cmpresshift
%andmask = and <4 x i32> %cmpmask, bitcast (<4 x float> <float 1.0,
float 1.0,float 1.0,float 1.0> to <4 x i32>)
%andmask2 = bitcast <4 x i32> %andmask to <4 x float>
%result = add <4 x float> %in, %andmask2
ret <4 x float> %result
}
And ideally, that would optimize down to the code as you wrote it.
It'll take some tweaks to codegen to get that working efficiently,
though: first, the div+neg doesn't get opimized to an arithmetic
shift, and second, the combiner can't tell that the shift is a noop.
-Eli
I think you’re missing a target flag that says “vfcmp sets all bits”. Clearly if you are implementing a bit-select in terms of and/andn/or on SSE2, you want all the bits to be set, but there are also useful things like blend and maskmov which look only at the top bit. I’m pretty sure we’ll get to the code you want without constraining vfcmp to set all bits.
Nate