AVX Status?

Hi,

The last time the AVX backend was mentioned on this list seems to be from November 2010, so I would like to ask about the current status. Is anybody (e.g. at Cray?) still actively working on it?

I have tried both LLVM 2.9 final and the latest trunk, and it seems like some trivial stuff is already working and produces nice code for code using <8 x float>.
Unfortunately, the backend gets confused about mask code as e.g. produced by VCMPPS together with mask operations (which LLVM requires to work on <8 x i32> atm) and corresponding bitcasts.

Consider these two examples:

define <8 x float> @test1(<8 x float> %a, <8 x float> %b, <8 x i32> %m) nounwind readnone {
entry:
   %cmp = tail call <8 x float> @llvm.x86.avx.cmp.ps.256(<8 x float> %a, <8 x float> %b, i8 1) nounwind readnone
   %res = tail call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float> %a, <8 x float> %b, <8 x float> %cmp) nounwind readnone
   ret <8 x float> %res
}

This works fine and produces the expected assembly (VCMPLTPS + VBLENDVPS).

On the other hand, this does not work:

define <8 x float> @test2(<8 x float> %a, <8 x float> %b, <8 x i32> %m) nounwind readnone {
entry:
   %cmp = tail call <8 x float> @llvm.x86.avx.cmp.ps.256(<8 x float> %a, <8 x float> %b, i8 1) nounwind readnone
   %cast = bitcast <8 x float> %cmp to <8 x i32>
   %mask = and <8 x i32> %cast, %m
   %blend_cond = bitcast <8 x i32> %mask to <8 x float>
   %res = tail call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float> %a, <8 x float> %b, <8 x float> %blend_cond) nounwind readnone
   ret <8 x float> %res
}

llc (latest trunk) bails out with:

LLVM ERROR: Cannot select: 0x2510540: v8f32 = bitcast 0x2532270 [ID=16]
   0x2532270: v4i64 = and 0x2532070, 0x2532170 [ID=15]
     0x2532070: v4i64 = bitcast 0x2510740 [ID=14]
       0x2510740: v8f32 = llvm.x86.avx.cmp.ps.256 0x2510640, 0x2511340, 0x2510f40, 0x2511140 [ORD=3] [ID=12]
...

The same counts for or and xor where VXORPS etc. should be selected. There seems to be some code for this because
xor <8 x i32> %m, %m
works, probably because it can get rid of all bitcasts.

Ideally, I guess we would want code like this instead of the intrinsics at some point:

define <8 x float> @test3(<8 x float> %a, <8 x float> %b, <8 x i1> %m) nounwind readnone {
entry:
   %cmp = fcmp ugt <8 x float> %a, %b
   %mask = and <8 x i1> %cmp, %m
   %res = select <8 x i1> %mask, <8 x float> %a, <8 x float> %b
   ret <8 x float> %res
}

-> VCMPPS, VANDPS, BLENDVPS

Nadav Rotem sent around a patch a few weeks ago in which he implemented codegen for the select for SSE, unfortunately I did not have time to look at it in more depth so far.

Can anybody comment on the current status of AVX?

Best,
Ralf

Hello Ralf,

Chris said AVX backend is not yet mature.

http://www.mail-archive.com/llvmbugs@cs.uiuc.edu/msg12442.html

I am also interested in AVX codegen backend and trying to write a
patch to fix current unusable AVX codegen.
I have just tried to submit a patch to fix fpextend(VCVTSS2SD) and
sitofp(VCVTSI2SD) codegen, and its in reviewing phase.

http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20110530/121689.html

It'd be definitely welcome for AVX committers, but at this time no one
is actively working AVX backend.

I am trying to write a AVX patch as much as possible(at least to run
my AVX code correctly), but my time is very limited, so I hope someone
else would actively work on AVX backend...

Hi Ralf

Hi,

The last time the AVX backend was mentioned on this list seems to be
from November 2010, so I would like to ask about the current status. Is
anybody (e.g. at Cray?) still actively working on it?

I don't think so!

I have tried both LLVM 2.9 final and the latest trunk, and it seems like
some trivial stuff is already working and produces nice code for code
using <8 x float>.

Almost everything that could be matched in tablegen files only by
extending the 128-bit PatFrags and PatLeafs to their 256-bit
counterparts should work, but besides that (which is where the
interesting stuff happens) there's no support yet!

Unfortunately, the backend gets confused about mask code as e.g.
produced by VCMPPS together with mask operations (which LLVM requires to
work on <8 x i32> atm) and corresponding bitcasts.

Consider these two examples:

define <8 x float> @test1(<8 x float> %a, <8 x float> %b, <8 x i32> %m)
nounwind readnone {
entry:
%cmp = tail call <8 x float> @llvm.x86.avx.cmp.ps.256(<8 x float> %a,
<8 x float> %b, i8 1) nounwind readnone
%res = tail call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>
%a, <8 x float> %b, <8 x float> %cmp) nounwind readnone
ret <8 x float> %res
}

This works fine and produces the expected assembly (VCMPLTPS + VBLENDVPS).

On the other hand, this does not work:

define <8 x float> @test2(<8 x float> %a, <8 x float> %b, <8 x i32> %m)
nounwind readnone {
entry:
%cmp = tail call <8 x float> @llvm.x86.avx.cmp.ps.256(<8 x float> %a,
<8 x float> %b, i8 1) nounwind readnone
%cast = bitcast <8 x float> %cmp to <8 x i32>
%mask = and <8 x i32> %cast, %m
%blend_cond = bitcast <8 x i32> %mask to <8 x float>
%res = tail call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>
%a, <8 x float> %b, <8 x float> %blend_cond) nounwind readnone
ret <8 x float> %res
}

llc (latest trunk) bails out with:

LLVM ERROR: Cannot select: 0x2510540: v8f32 = bitcast 0x2532270 [ID=16]
0x2532270: v4i64 = and 0x2532070, 0x2532170 [ID=15]
0x2532070: v4i64 = bitcast 0x2510740 [ID=14]
0x2510740: v8f32 = llvm.x86.avx.cmp.ps.256 0x2510640, 0x2511340,
0x2510f40, 0x2511140 [ORD=3] [ID=12]
...

The same counts for or and xor where VXORPS etc. should be selected.

Please file bug reports!

There seems to be some code for this because
xor <8 x i32> %m, %m
works, probably because it can get rid of all bitcasts.

Ideally, I guess we would want code like this instead of the intrinsics
at some point:

define <8 x float> @test3(<8 x float> %a, <8 x float> %b, <8 x i1> %m)
nounwind readnone {
entry:
%cmp = fcmp ugt <8 x float> %a, %b
%mask = and <8 x i1> %cmp, %m
%res = select <8 x i1> %mask, <8 x float> %a, <8 x float> %b
ret <8 x float> %res
}

That would be nice indeed

-> VCMPPS, VANDPS, BLENDVPS

Nadav Rotem sent around a patch a few weeks ago in which he implemented
codegen for the select for SSE, unfortunately I did not have time to
look at it in more depth so far.

Can anybody comment on the current status of AVX?

No codegen support yet (although some stuff works), but the assembler
support is complete!

Thanks Syoyo and Bruno for your replies.

As suggested, I filed a bug under http://llvm.org/bugs/show_bug.cgi?id=10073 .

I am not familiar with .td files and the LLVM backend infrastructure yet, but I might give it a try and solve it myself if I find the time.

Best,
Ralf

I am not familiar with .td files and the LLVM backend infrastructure yet,
but I might give it a try and solve it myself if I find the time.

Nice. Patches are always welcome! :slight_smile:

Bruno Cardoso Lopes <bruno.cardoso@gmail.com> writes:

Hi Ralf

Hi,

The last time the AVX backend was mentioned on this list seems to be
from November 2010, so I would like to ask about the current status. Is
anybody (e.g. at Cray?) still actively working on it?

I don't think so!

Yes, we are! I am doing a lot of tuning work at the moment. We have
been rather swamped with work for new products and I am now just getting
out from under that. Expect to see more patches flowing in over the
next several weeks. There's a LOT left to send up.

I have tried both LLVM 2.9 final and the latest trunk, and it seems like
some trivial stuff is already working and produces nice code for code
using <8 x float>.

Almost everything that could be matched in tablegen files only by
extending the 128-bit PatFrags and PatLeafs to their 256-bit
counterparts should work, but besides that (which is where the
interesting stuff happens) there's no support yet!

Indeed. The bulk of the work is in shuffle generation.

We have a full implementation. I just have to get enough time to get it
merged. :-/

define <8 x float> @test2(<8 x float> %a, <8 x float> %b, <8 x i32> %m)
nounwind readnone {
entry:
%cmp = tail call <8 x float> @llvm.x86.avx.cmp.ps.256(<8 x float> %a,
<8 x float> %b, i8 1) nounwind readnone
%cast = bitcast <8 x float> %cmp to <8 x i32>
%mask = and <8 x i32> %cast, %m
%blend_cond = bitcast <8 x i32> %mask to <8 x float>
%res = tail call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>
%a, <8 x float> %b, <8 x float> %blend_cond) nounwind readnone
ret <8 x float> %res
}

llc (latest trunk) bails out with:

LLVM ERROR: Cannot select: 0x2510540: v8f32 = bitcast 0x2532270 [ID=16]
0x2532270: v4i64 = and 0x2532070, 0x2532170 [ID=15]
0x2532070: v4i64 = bitcast 0x2510740 [ID=14]
0x2510740: v8f32 = llvm.x86.avx.cmp.ps.256 0x2510640, 0x2511340,
0x2510f40, 0x2511140 [ORD=3] [ID=12]
...

The same counts for or and xor where VXORPS etc. should be selected.

Please file bug reports!

It's a problem with integer code. There are no 256-bit integer bitwise
instructions in AVX. There are no 256-bit integer instructions period.
What's missing is the legalize code to handle this. I have it in our
tree.

There seems to be some code for this because
xor <8 x i32> %m, %m
works, probably because it can get rid of all bitcasts.

And it can use xorps to implement the operation.

Ideally, I guess we would want code like this instead of the intrinsics
at some point:

define <8 x float> @test3(<8 x float> %a, <8 x float> %b, <8 x i1> %m)
nounwind readnone {
entry:
%cmp = fcmp ugt <8 x float> %a, %b
%mask = and <8 x i1> %cmp, %m
%res = select <8 x i1> %mask, <8 x float> %a, <8 x float> %b
ret <8 x float> %res
}

That would be nice indeed

Some lowering code would be needed to convert from i1 masks to i8 masks
(the so-called packed vs. sparse mask issue). I don't think I've added
anything to do this as our vectorizer doesn't generate code this way.

-> VCMPPS, VANDPS, BLENDVPS

Nadav Rotem sent around a patch a few weeks ago in which he implemented
codegen for the select for SSE, unfortunately I did not have time to
look at it in more depth so far.

Can anybody comment on the current status of AVX?

No codegen support yet (although some stuff works), but the assembler
support is complete!

There's some codegen support, but it's very, very, very incomplete.

                            -Dave

Hi David,

The last time the AVX backend was mentioned on this list seems to be
from November 2010, so I would like to ask about the current status. Is
anybody (e.g. at Cray?) still actively working on it?

Yes, we are! I am doing a lot of tuning work at the moment. We have
been rather swamped with work for new products and I am now just getting
out from under that. Expect to see more patches flowing in over the
next several weeks. There's a LOT left to send up.

We have a full implementation. I just have to get enough time to get it
merged. :-/

This sounds great!

For my case, I only require some basic support, so I am optimistic that your next few patches will provide everything I need.

It's a problem with integer code. There are no 256-bit integer bitwise
instructions in AVX. There are no 256-bit integer instructions period.
What's missing is the legalize code to handle this. I have it in our
tree.

There seems to be some code for this because
xor<8 x i32> %m, %m
works, probably because it can get rid of all bitcasts.

And it can use xorps to implement the operation.

Yes, that makes sense. But why does the same not work with "and" and "or" (-> VANDPS/VORPS) ?
Anyway, I am looking forward to testing your patches.

Would it be possible to send around a notification when the stuff goes upstream?
Thanks a lot :).

Best,
Ralf

Ralf Karrenberg <Chareos@gmx.de> writes:

This sounds great!

For my case, I only require some basic support, so I am optimistic
that your next few patches will provide everything I need.

If my evil plan works out, within the next 10 or so patches we should be
in a place where pushing everything up goes pretty quickly. It's about
8 TableGen patches and then a patch to do ADD or some other simple thing
like that to start the so-called SIMD reorg. Basically, if I can get
the SIMD reorg patch settled, everything after that is really simple
because it all looks uniform. Of course, that reorg/ADD patch is going
to cause a lot of discussion, I suspect. :wink:

There seems to be some code for this because
xor<8 x i32> %m, %m
works, probably because it can get rid of all bitcasts.

And it can use xorps to implement the operation.

Yes, that makes sense. But why does the same not work with "and" and
"or" (-> VANDPS/VORPS) ?

It can. Maybe the pattern for ANDPS isn't there yet. I'd have to dig
deeper into the failure. The fact that there are inconsistencies like
this is one of the motivations behind the SIMD reorg. There are plenty
of such inconsistencies in the existing SSE spec. Hopefully after the
reorg, implementing a pattern like VANDPS given an existing one for
VXORPS is trivial.

Anyway, I am looking forward to testing your patches.

So am I. :slight_smile:

Would it be possible to send around a notification when the stuff goes
upstream?
Thanks a lot :).

I try to put [AVX] in the subject of patch mailings (to -commits) and
commit messages. Once in a while I forget. I'll try to remeber to send
semething to -dev when major stuff appears.

                                 -Dave

Hello David,

I try to put [AVX] in the subject of patch mailings (to -commits) and
commit messages. Once in a while I forget. I'll try to remeber to send
semething to -dev when major stuff appears.

Good. I am also trying to sending a patch to llvm-commits.
It'd be better for me to use [AVX] prefix in the subject so that we
can easily identify "This is an AVX patch" to avoid duplicated work?

I've sent a fpext codegen patch. Next I am working is sitofp codegen path.

Syoyo Fujita <syoyofujita@gmail.com> writes:

Hello David,

I try to put [AVX] in the subject of patch mailings (to -commits) and
commit messages. Once in a while I forget. I'll try to remeber to send
semething to -dev when major stuff appears.

Good. I am also trying to sending a patch to llvm-commits.
It'd be better for me to use [AVX] prefix in the subject so that we
can easily identify "This is an AVX patch" to avoid duplicated work?

Yes, that would be helpful. Thanks!

I've sent a fpext codegen patch. Next I am working is sitofp codegen path.

Great! Glad to have the help!

                          -Dave