Some caveats: I was looking at LLVM because it's jit capability ought

to make it useful for doing GA based stuff rather than the intrinsic

quality of the LLVM scheduler. I'm more of an image

processing/scientific computing guy rather than a compiler guy, but

I'd expect that for common unpredictable branchy code (especially on

OOO cpus) the difference between a good scheduling and a close to

optimal schedule is probably in the noise. For the kind of non-branchy

numeric code I'd expect that IF THE STOCHASTIC SEARCH CAN BE MADE TO

ACTUALLY WORK you might be able to get 5-10 percent consistently.

I don't have access to a high-end x86 at the moment; here's some

benchmark results on an Atom N270 that I don't completely understand,

and may well be wrong. I need to dig in further and see precisely

what's happening. The program evaluates both

uint64_t

intrinsicsFn(__m128i* __restrict in,__m128i* __restrict out){

uint64_t startTSC,endTSC;

int i=0;

rdtscll(startTSC);

do{

__m128i xmm0=in[0/16];

__m128i xmm1=in[16/16];

__m128i xmm2=in[32/16];

__m128i xmm3=in[48/16];

xmm0=_mm_add_epi16(in[64/16],xmm0);

xmm1=_mm_add_epi16(in[80/16],xmm1);

xmm2=_mm_add_epi16(in[96/16],xmm2);

xmm3=_mm_add_epi16(in[112/16],xmm3);

xmm0=_mm_add_epi16(xmm1,xmm0);

xmm2=_mm_add_epi16(xmm3,xmm2);

xmm0=_mm_add_epi16(xmm2,xmm0);

out[0/16]=xmm0;

in+=8;out+=1;

}while(++i<LIM);

rdtscll(endTSC);

return endTSC-startTSC;

}

and the 9072 different valid reorderings of the hopefully equivalent

inline assembly style function

uint64_t f9072(__m128i * __restrict in,__m128i * __restrict out){

uint64_t startTSC,endTSC;

int i=0;

rdtscll(startTSC);

do{

__asm__ __volatile__(

"\tmovdqa 0(%1),%%xmm0\n"

"\tmovdqa 16(%1),%%xmm1\n"

"\tmovdqa 32(%1),%%xmm2\n"

"\tmovdqa 48(%1),%%xmm3\n"

"\tpaddw 64(%1),%%xmm0\n"

"\tpaddw 80(%1),%%xmm1\n"

"\tpaddw 96(%1),%%xmm2\n"

"\tpaddw 112(%1),%%xmm3\n"

"\tpaddw %%xmm1,%%xmm0\n"

"\tpaddw %%xmm3,%%xmm2\n"

"\tpaddw %%xmm2,%%xmm0\n"

"\tmovdqa %%xmm0,(%0)\n"

:"=r" (out):"r" (in):"xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7");

in+=8;out+=1;

}while(++i<LIM);

rdtscll(endTSC);

return endTSC-startTSC;

}

Note that my inline asm isn't that good, so the array ptr increments

are forced to be at end for the asm versions, which is probably not

good for speed.

I'd like to try a bigger inner loop but the number of combinations

obviously grows too rapidly to try this way rather than via a

stochastic sampling approach. Running those through g++-4.4 (which

DOESN'T have an atom pipeline model) and clang++ trunk gives the

attached graph (y is proportional to runtime whilst x is scheduling

instance, ordered by increasing y value).

The minimum g++ inline assembly: 73200

The time for g++ -O3 optimisation of intrinsics fn: 93712

The minimum clang++ inline assembly: 79275

The time for clang++ -O3 optimisation of intrinsics fn: 72581

Speculations: the clang intrinsics is finding some optimisation that

it is being blocked on the asm version, so it's not just the SSE

scheduling that's being measured. g++ is doing systematically some

optimisation on the asm versions that clang isn't, hence the rough

"curve offset". For some reason g++ is picking a noticeably worse

ordering for the intrinsics; I need to find an ubuntu package for

g++-4.5 and see if the atom pipeline model improves things.

(I can make the hacked together benchmark code available by email if

anyone wants it.)

This probably raises as many questions as it answers... but it

suggests the idea may be worthwhile for really intense inner loop

code.

Regards,

David Steven Tweed

timeForSchedules.ps (250 KB)