branch on vector compare?

Hi all, llvm newbie here.

I'm trying to branch based on a vector compare. I've found a slow way (below)
which goes through memory. Is there some idiom I'm missing so that it would use
for instance movmsk for SSE or vcmpgt & cr6 for altivec?

Or do I need to resort to calling the intrinsic directly?

Thanks,
Stephen.

  %16 = fcmp ogt <4 x float> %15, %cr
  %17 = extractelement <4 x i1> %16, i32 0
  %18 = extractelement <4 x i1> %16, i32 1
  %19 = extractelement <4 x i1> %16, i32 2
  %20 = extractelement <4 x i1> %16, i32 3
  %21 = or i1 %17, %18
  %22 = or i1 %19, %20
  %23 = or i1 %21, %22
  br i1 %23, label %true1, label %false2

Hi Stephen,

Hi all, llvm newbie here.

welcome!

I'm trying to branch based on a vector compare. I've found a slow way (below)
which goes through memory. Is there some idiom I'm missing so that it would use
for instance movmsk for SSE or vcmpgt & cr6 for altivec?

I don't think you are missing anything: LLVM IR has no support for horizontal
operations like or'ing the elements of a vector of boolean together. The code
generators do try to recognize a few idioms and synthesize horizontal
operations from them, but I think only addition is currently recognized, and
it expects the addition to be done (IIRC) by using shufflevector to split the
vector into two, followed by an addition of the two halves, repeatedly. In
fact for your case you could do something similar:
   %lo1 = shufflevector <4 x i1> %16, <4 x i1> undef, <2 x i32> <i32 0, i32 1>
   %hi1 = shufflevector <4 x i1> %16, <4 x i1> undef, <2 x i32> <i32 2, i32 3>
   %join = or <2 x i1> %lo1, %hi1
   %lo2 = extractelement <2 x i1> %join, i32 0
   %hi2 = extractelement <2 x i1> %join, i32 1
   %final = or i1 %lo2, %hi2
Currently I would expect the code generators to produce something nasty for
this. Feel free to open a bugreport requesting that the code generators do
something better.

Ciao, Duncan.

> which goes through memory. Is there some idiom I'm missing so that it would

use

> for instance movmsk for SSE or vcmpgt & cr6 for altivec?

I don't think you are missing anything: LLVM IR has no support for horizontal
operations like or'ing the elements of a vector of boolean together. The code
generators do try to recognize a few idioms and synthesize horizontal
operations from them, but I think only addition is currently recognized, and

Thanks Duncan,

you're right - that does compile to a mess of spills to memory not
unlike the original.

I went to have a look at this further: It seems the existing SelectInst
is pretty close to what is needed.
Value IRBuilder::*CreateSelect(Value *C, Value *True, Value *False,
const Twine &Name)
Currently, this asserts that the True & False are both vector types of
the same size as "C". I was thinking of weakening this condition so that
if True and False are both i1 types, it will be allowed and will result
in something which can be branched on.

I have quite a bit of reading ahead it seems!
Stephen.

This looks quite similar to something I filed a bug on (12312). Michael
Liao submitted fixes for this, so I think
if you change it to
  %16 = fcmp ogt <4 x float> %15, %cr
  %17 = sext <4 x i1> %16 to <4 x i32>
  %18 = bitcast <4 x i32> %17 to i128
  %19 = icmp ne i128 %18, 0
  br i1 %19, label %true1, label %false2

should do the trick (one cmpps + one ptest + one br instruction).
This, however, requires sse41 which I don't know if you have - you say
the extractelements go through memory which I've never seen then again
our code didn't try to extract the i1 directly (even without fixes for
ptest the above sequence will result in only 2 extraction steps instead
of 4 if you're on x64 and the cpu supports sse41 but I guess without
sse41 and hence no pextrd/q it probably also will go through memory).
Though on altivec this sequence might not produce anything good, the
free sext requires llvm 2.7 on x86 to work at all (certainly shouldn't
be a problem nowadays but on other backends it might be different) and
for the ptest sequence very recent svn is required.
I don't think the current code can generate movmskps + test (probably
the next best thing without sse41) instead of ptest though if you only
got sse.

Roland

Hi Roland,

This, however, requires sse41 which I don't know if you have - you say
the extractelements go through memory which I've never seen

maybe Stephen is targetting something generic like i386 by accident.

Ciao, Duncan.

Roland Scheidegger <sroland <at> vmware.com> writes:

This looks quite similar to something I filed a bug on (12312). Michael
Liao submitted fixes for this, so I think
if you change it to
  %16 = fcmp ogt <4 x float> %15, %cr
  %17 = sext <4 x i1> %16 to <4 x i32>
  %18 = bitcast <4 x i32> %17 to i128
  %19 = icmp ne i128 %18, 0
  br i1 %19, label %true1, label %false2

should do the trick (one cmpps + one ptest + one br instruction).
This, however, requires sse41 which I don't know if you have - you say
the extractelements go through memory which I've never seen then again
our code didn't try to extract the i1 directly (even without fixes for
ptest the above sequence will result in only 2 extraction steps instead
of 4 if you're on x64 and the cpu supports sse41 but I guess without
sse41 and hence no pextrd/q it probably also will go through memory).
Though on altivec this sequence might not produce anything good, the
free sext requires llvm 2.7 on x86 to work at all (certainly shouldn't
be a problem nowadays but on other backends it might be different) and
for the ptest sequence very recent svn is required.
I don't think the current code can generate movmskps + test (probably
the next best thing without sse41) instead of ptest though if you only
got sse.

Thanks Roland, sign extending gets me part of the way at least.
I'm on version 3.1 and as you say in bug report, there are a
few extraneous instructions. For the record, casting to a <4 x i8>
seems to do a better job for x86 (shuffle, movd, test, jump). Using
<4 x i32> seems to issue a pextrd for each element. For x64, it seems
to be the same for either. I suppose it's all academic seeing as the
ptest patch looks good.

Looking at it again, I'm not sure how I saw memory spills. Certainly
I can't reproduce them without using -O0. It's possible I was did
that accidentally when investigating the issue.

Thanks,
Stephen.

Yes <4 x i8> cast looks like a good idea. Just be careful though if you
also need to target cpus without ssse3, IIRC without pshufb this will
create some horrible code (could have been with older llvm version
though). Though if you don't have ssse3 you also won't have pextrd,
which means more shuffling to extract the values if you sign-extend them
to <4 x i32> too (if you're targeting altivec, probably no such issue as
I think it doesn't have such blatantly missing shuffle instructions).
But yes ptest looks like the obvious winner. For cpus not having sse41
(and there's tons of them still in use not to mention still sold) it
would be nice if llvm could come up with pmovmskb/movmskps/movmskpd +
test (these instructions look like they were intended for exactly that
use case after all). But the <4 x i8> sign-extend solution shouldn't
hurt performance too much neither, if you've got ssse3.

Roland

Roland Scheidegger <sroland <at> vmware.com> writes:

This looks quite similar to something I filed a bug on (12312). Michael
Liao submitted fixes for this, so I think
if you change it to
  %16 = fcmp ogt <4 x float> %15, %cr
  %17 = sext <4 x i1> %16 to <4 x i32>
  %18 = bitcast <4 x i32> %17 to i128
  %19 = icmp ne i128 %18, 0
  br i1 %19, label %true1, label %false2

should do the trick (one cmpps + one ptest + one br instruction).
This, however, requires sse41 which I don't know if you have - you say
the extractelements go through memory which I've never seen then again
our code didn't try to extract the i1 directly (even without fixes for
ptest the above sequence will result in only 2 extraction steps instead
of 4 if you're on x64 and the cpu supports sse41 but I guess without
sse41 and hence no pextrd/q it probably also will go through memory).
Though on altivec this sequence might not produce anything good, the
free sext requires llvm 2.7 on x86 to work at all (certainly shouldn't
be a problem nowadays but on other backends it might be different) and
for the ptest sequence very recent svn is required.
I don't think the current code can generate movmskps + test (probably
the next best thing without sse41) instead of ptest though if you only
got sse.

Thanks Roland, sign extending gets me part of the way at least.
I'm on version 3.1 and as you say in bug report, there are a
few extraneous instructions. For the record, casting to a <4 x i8>
seems to do a better job for x86 (shuffle, movd, test, jump). Using
<4 x i32> seems to issue a pextrd for each element. For x64, it seems
to be the same for either. I suppose it's all academic seeing as the
ptest patch looks good.

Yes <4 x i8> cast looks like a good idea. Just be careful though if you
also need to target cpus without ssse3, IIRC without pshufb this will
create some horrible code (could have been with older llvm version
though). Though if you don't have ssse3 you also won't have pextrd,
which means more shuffling to extract the values if you sign-extend them
to <4 x i32> too (if you're targeting altivec, probably no such issue as
I think it doesn't have such blatantly missing shuffle instructions).
But yes ptest looks like the obvious winner. For cpus not having sse41
(and there's tons of them still in use not to mention still sold) it
would be nice if llvm could come up with pmovmskb/movmskps/movmskpd +
test (these instructions look like they were intended for exactly that
use case after all). But the <4 x i8> sign-extend solution shouldn't
hurt performance too much neither, if you've got ssse3.

If all you need is to test all flags are the same among elements, we
could add a pseudo PTEST support on CPU without SSE4.1, i.e.

we could replace

cmpltps %xmm0, %xmm1
ptest %xmm1, %xmm1
jz LABEL

to

cmpltps %xmm0, %xmm1
movmskps %xmm0, %r8d
test %r8d, %r8d
jz LABEL

It looks to me much more efficient and only relies on SSE. But, we
have to ensure the 2 operands to PTEST are the same and it's generated
from packed CMP.

I am figuring out how to simplify the checking of these pre-conditions.

Just off-topic issue, most vector IR so far operates on element-wise
or vertically. The generalized issue from here and PR12312 is that we
don't have simply way to express horizontal operations easily, like
primitives

float %s = reduce fadd <N x float> %x
i32 %m = reduce max <N x i32> %x
i1 %c = any <N x i1> %x or i1 %c = reduce or <N x i1>
i1 %c = all <N x i1> %x or i1 %c = reduce and <N x i1>

one more interesting example would be scan, horizontal operation but
still generate vector

<N x i32> %s = scan add <N x i32> %x, 0 ; exclusive scan
<N x i32> %s = scan add <N x i32> %x, 1; inclusive scan

With these primitives, some workloads may be simplified in IR and
backend (like X86) could support some directly.

- michael