What opt pass attempts implements this optimization?

I have a very simple kernel that is generating very very bad code.

The basic kernel pseudo-code is as follows:

forloop(1 to n) {

forloop(0 to j) {

A

}

B

}

C

It is generating very ugly and inefficient code for a vector system similar to the following pseudo-code:

if (n > 1) {

if (j) {

forloop(1 to n) {

forloop(0 to j) {

A

}

B

}

C

} else {

forloop(1 to n) {

B

}

C

}

} else {

C

}

I can understand how this would be good in a scalar system like x86, but this is just bad on a vector system.

The reason this is bad because if a single branch is taken by a work-item in a hardware thread(there are 64 work-items per hw thread), then every single work-item in a hardware thread must execute that branch. In this specific example, instead of every thread executing A, B and C once, in the worst case(which is also happens 100% of the time), every thread will execute C three times, B twice and A once. This also does not take into account the cost of managing flow control on the hardware, which is relatively expensive. This gets worse with the more flow control I add in.

It's possible that there's something going on with loop unswitching here. Do you
have a testcase which demonstrates this?

Dan

Dan,
After I sent out this email a co-worker pointed me to LoopUnswitch
optimization as being a possible culprit. We set it to zero and it no
longer generates all this code. I've attached optimized bc and
unoptimized bc for this specific example.

Micah

AESEncryptDecrypt-opt.bc (26.1 KB)

AESEncryptDecrypt.bc (7.24 KB)