How does SSEDomainFix work?

Hello. This is my 1st post.

I have tried SSE execution domain fixup pass.
But I am not able to see any improvements.

I expect for the example below to use MOVDQA, PAND &c.
(On nehalem, ANDPS is extremely slower than PAND)

Please tell me if something would be wrong for me.

Thank you.
Takumi

Host: i386-mingw32
Build: trunk@103373

foo.ll:
define <4 x i32> @foo(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z)
nounwind readnone {
entry:
  %0 = and <4 x i32> %x, %z
  %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
  %1 = and <4 x i32> %not, %y
  %2 = xor <4 x i32> %0, %1
  ret <4 x i32> %2
}

define <2 x i64> @bar(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z)
nounwind readnone {
entry:
  %0 = and <2 x i64> %x, %z
  %not = xor <2 x i64> %z, <i64 -1, i64 -1>
  %1 = and <2 x i64> %not, %y
  %2 = xor <2 x i64> %0, %1
  ret <2 x i64> %2
}

$ llc -mcpu=nehalem -debug-pass=Structure foo.bc -o foo.s
(snip)
    Code Placement Optimizater
    SSE execution domain fixup
    Machine Natural Loop Construction
    X86 AT&T-Style Assembly Printer
    Delete Garbage Collector Information

foo.s: (edited)
_foo:
  movaps %xmm0, %xmm3
  andps %xmm2, %xmm3
  andnps %xmm1, %xmm2
  movaps %xmm2, %xmm0
  xorps %xmm3, %xmm0
  ret

_bar:
  movaps %xmm0, %xmm3
  andps %xmm2, %xmm3
  andnps %xmm1, %xmm2
  movaps %xmm2, %xmm0
  xorps %xmm3, %xmm0
  ret

Hello. This is my 1st post.

ようこそ!

I have tried SSE execution domain fixup pass.
But I am not able to see any improvements.

Did you actually measure runtime, or did you look at assembly?

I expect for the example below to use MOVDQA, PAND &c.
(On nehalem, ANDPS is extremely slower than PAND)

Are you sure? The andps and pand instructions are actually the same speed, but on Nehalem there is a latency penalty for moving data between the int and float domains.

The SSE execution domain pass tries to minimize the extra latency by switching instructions.

In your examples, all the operations are available as either int or float instructions. The instruction selector chooses the float instructions because they are smaller. The SSE execution domain pass does not change them because there are zero domain crossings, zero extra latency. Everything takes place in the float domain which is just as fast.

If you use operations that are only available in one domain, the SSE execution domain pass kicks in:

define <4 x i32> @intfoo(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z)
nounwind readnone {
entry:
%0 = add <4 x i32> %x, %z
%not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = and <4 x i32> %not, %y
%2 = xor <4 x i32> %0, %1
ret <4 x i32> %2
}

_intfoo:
  movdqa %xmm0, %xmm3
  paddd %xmm2, %xmm3
  pandn %xmm1, %xmm2
  movdqa %xmm2, %xmm0
  pxor %xmm3, %xmm0
  ret

All the instructions moved to the int domain because the add forced them.

Please tell me if something would be wrong for me.

You should measure if LLVM's code is actually slower that the code you want. If it is, I would like to hear.

Our weakness is the shufflevector instruction. It is selected into shufps/pshufd/palign/... only by looking at patterns. The instruction selector does not consider execution domains. This can be a problem because these instructions cannot be freely interchanged by the SSE execution domain pass.

Dear Jakob-san,

ようこそ!

:smiley:

Thank you for reply. At first, I have to apologize you.
I misunderstood aim of SSEdomainfix.
Now I see what the pass does.

But anyway, the point that I would like to mention is "throughput"
rather than (inter-domain) latency.
In fact, FP operations are 3x slower than SI ops on Nehalem by my measurement.
It would be needed to prefer SI ops on Nehalem(and generic sse2), I think.
(Shorter instructions may be taken with -Os)

The attachment includes a simple(but stupid bogus) asm-C source and a
Win32 executable.
$ mingw32-gcc -msse2 -O4 -Wall -funroll-all-loops foo.c
It must be compiled on other x86 hosts.
But it would be needed to constrain processor's affinity to single ; )

Counts below are Cycles by million iteration on Core i7
982270 xorps
982231 movaps
371671 pxor
342628 movdqa

SI ops can be issued by 3-way but FP ops by only single way.
(as we know, they are nearly same on Conroe, Penryn)
Excuse me, loads by movdqa and movaps are not measured. : (

See also;
- Intel optimization manual
  http://www.intel.com/assets/pdf/manual/248966.pdf
- Agner's works
  Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X

Thank you,
Takumi

xmm.zip (2.84 KB)