LLVM 2.8 and MMX

The implementation of MMX is in a state of flux: the plan is to stop generic vectors from being selected to mmx operations, and add intrinsics for every mmx operations (including add). However, 2.8 shouldn't be broken, that would be a serious regression. Please file a bug with a small example. Thanks!

-Chris

Hi Chris,

It’s not broken, but the performance is crippled.

I noticed that the code still contains some MMX instructions, but several operations get expanded (apparently swizzling and such get expanded to a large number of byte moves).

I could use intrinsics, but they wouldn’t be optimized like other vector operations. I could use SSE operations, but they would increase SSE register pressure while MMX registers are left unused.

So ideally I would like to inform LLVM that selecting MMX instructions is fine. I’m inserting emms instructions in the right spots myself.

Thanks,

Nicolas

Hi Chris,

It's not broken, but the performance is crippled.

I noticed that the code still contains some MMX instructions, but several
operations get expanded (apparently swizzling and such get expanded to a
large number of byte moves).

I think some changes related to MMX landed before 2.8 branched which
shouldn't have... please file a bug.

I could use intrinsics, but they wouldn't be optimized like other vector
operations. I could use SSE operations, but they would increase SSE register
pressure while MMX registers are left unused.

So ideally I would like to inform LLVM that selecting MMX instructions is
fine. I'm inserting emms instructions in the right spots myself.

I think the direction going forward we're going to prefer is that
64-bit vectors get widened to 128-bit vectors, which might not be
quite ideal in some situations, but will avoid situations where MMX
instructions are incorrectly generated. That said, the work isn't
finished, so it shouldn't be in 2.8.

-Eli

Hi Chris,

It's not broken, but the performance is crippled.

I noticed that the code still contains some MMX instructions, but several
operations get expanded (apparently swizzling and such get expanded to a
large number of byte moves).

I think some changes related to MMX landed before 2.8 branched which
shouldn't have... please file a bug.

Right. There should be no major change before 2.8, so if something bad happened, it needs to be fixed on the branch.

I could use intrinsics, but they wouldn't be optimized like other vector
operations. I could use SSE operations, but they would increase SSE register
pressure while MMX registers are left unused.

So ideally I would like to inform LLVM that selecting MMX instructions is
fine. I'm inserting emms instructions in the right spots myself.

I think the direction going forward we're going to prefer is that
64-bit vectors get widened to 128-bit vectors, which might not be
quite ideal in some situations, but will avoid situations where MMX
instructions are incorrectly generated. That said, the work isn't
finished, so it shouldn't be in 2.8.

In 2.9, the only way to get MMX will be to use mmx intrinsics, generic vectors will not map onto MMX, sorry Nicolas. One major problem is that the optimizer introduces generic vectors (e.g. see r112696 in the SRoA pass) which use mmx where it was not previously used. This means that your frontend introducing emms is not enough.

-Chris

Hi all,

Sorry for the late reply. I got sidetracked by other fun projects. :wink:

I found that the performance regression is caused by revisions 112804,
112805 and 112806. Those changes were made 2 days prior to the 2.8
branching, so it may have not been the intention to include them there?
Either way they make my vector-intensive code two times slower so it would
be much appreciated to revert these changes for the 2.8 release.

Thanks,

Nicolas

Hi all,

Sorry for the late reply. I got sidetracked by other fun projects. :wink:

I found that the performance regression is caused by revisions 112804,
112805 and 112806. Those changes were made 2 days prior to the 2.8
branching, so it may have not been the intention to include them there?
Either way they make my vector-intensive code two times slower so it would
be much appreciated to revert these changes for the 2.8 release.

Thanks,

Nicolas

Interesting. These are all Bruno's patches, and I'm pretty sure they weren't intended to affect MMX. I doubt reverting them is right since the effect on SSE is presumably positive. Unfortunately Bruno is not here any more.

Hi Nicolas,

Are you able to narrow it down to one of those patches? From the comments, 112804 and 112805 seem fairly innocuous:

Hi Dale,

I suspect that these patches were intended to improve 128-bit vector
performance but caused certain 64-bit vector operations to no longer lower
to MMX instructions. Anyway, now that I've narrowed it down to these patches
I think I can narrow it down further to a specific case so I can file a
bug...

Will Bruno be back soon or is he no longer working on the project for good?

Cheers,

Nicolas

Bruno's internship is over and he's currently on vacation, but he will surely continue to work on LLVM. So we will know more once he returns from the vacation.

Sebastian

This thread confuses me. I thought Chris said that LLVM 2.8 will not
lower generic vectors to MMX because it breaks x87 code, and I didn't
see an answer to your question about a switch to tell the code
generator otherwise. However, you're complaining that MMX performance
is subpar, even though LLVM 2.8 isn't supposed to generate MMX
instructions.

Can someone clarify the situation for me?

Thanks,
Reid

LLVM isn't going to stop generating MMX instructions all together. We can't do that. :slight_smile: If the user specifically wants MMX (by, say, using the builtins), we have to support that still. The plan to cease generating MMX for generic vectors is a work-in-progress right now. It's not in 2.8.

-bw

LLVM isn't going to stop generating MMX instructions all together. We can't do that. :slight_smile: If the user specifically wants MMX (by, say, using the builtins), we have to support that still. The plan to cease generating MMX for generic vectors is a work-in-progress right now. It's not in 2.8.

-bw

Right, early on there was speculation that the early phases of this work were causing the problem Nicolas is seeing, but it now appears that that problem is unrelated.

Hi Bill,

I'm currently focusing on 112804 and it definitely results in a performance
regression.

If I understand the comment correctly, it is intended to remove a case that
should never be hit. However, when I check for isUNPCKL_v_undef_Mask(SVOp)
below it my code does hit it. In particular it happens when I generate
"shufflevector <8 x i8> %v1, <8 x i8> %v1, undef, {0, 8, 1, 9, 2, 10, 3,
11}". This used to lower to UNPCKLBW but with revision 112804 it becomes a
bunch of byte moves.

Also note that this change doesn't appear to improve SSE code generation.
I'll look at whether the other patches could also be reverted to fix MMX
performance without hurting SSE...

Cheers,

Nicolas

Hi all,

I think I figured it out:
112804 causes 64-bit UNPCKLBW to no longer be selected for certain cases.
112805 is benign.
112806 causes 64-bit UNPCKHBW to no longer be selected for certain cases.

I've attached a potential fix for the 2.8 branch.

The real problem is that the code above it which checks for
isUNPCK[L|H]_v_undef_Mask cases is only for when OptForSize is true. It
assumes that otherwise things can get lowered to PSHUFD (which is true for
v4i32 and v4f32 but nothing else - in particular MMX operations).

I'll file a bug now...

Nicolas

unpck-mmx.patch (1.26 KB)

Assign the bug to me and I'll fix it in TOT next week! Thanks for
narrowing it down!

Assign the bug to me and I'll fix it in TOT next week! Thanks for
narrowing it down!

Thanks Bruno, it's PR 8200.