Instructions that cannot be duplicated

Is there a current way to specify that an instruction or function call cannot be duplicated and thus any optimizations that might want to duplicate this instruction would fail?

The problem deals with barrier in OpenCL 1.0. One of the conditions of using barrier is that if a barrier exists inside of control flow, every thread in a work-group must execute the barrier instruction(6.11.9).

However, in this simple CL code:
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics: enable

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics: enable

__kernel void

KMeansMapReduceAtomic(const int num_attributes,

const int num_objects,

__global int* delta_d

)

{

__local int clusterCount[256];

__local int sTemp[1]; // amd opencl needed this to be an array

const unsigned int point_id = get_local_id(0);

int index = 0;

int i, addr;

int xx = get_local_id(0);

clusterCount[xx] = 0;

if(get_local_id(0) == 0){

sTemp[0] = 0; //sTemp is for prefix sum

}

barrier(CLK_LOCAL_MEM_FENCE);

int idWithinCluster = 300; // anthing other then zero

if (point_id < num_objects) {

idWithinCluster = atom_add(&clusterCount

[index],1);

}

barrier(CLK_LOCAL_MEM_FENCE);

int numMembers = 2;

if(idWithinCluster == 0) {

clusterCount[index] = atom_add(&sTemp[0], numMembers);//This holds the prefix offset

}

delta_d[xx] = clusterCount[index];

}

produces bitcode file which has 3 barriers.

The problem is now that the second if/barrier pair:

if (point_id < num_objects) {

idWithinCluster = atom_add(&clusterCount

[index],1);

}

barrier(CLK_LOCAL_MEM_FENCE);

is transformed into flow control equivalent to :

if (point_id >= num_objects) {

barrier(CLK_LOCAL_MEM_FENCE);

} else {

idWithinCluster = atom_add(&clusterCount

[index],1);

barrier(CLK_LOCAL_MEM_FENCE);

}

which violates opencl, which can cause undefined behavior on the underlying hardware, as each barrier is unique.

So we want to disable all optimizations around barrier instructions, but not in other cases when no barrier instruction exists. One way to do this is to mark an instruction as not being copyable, but is there a method of doing this in LLVM?

Also, this barrier does not map to llvm.barrier because llvm.barrier only seems to worry about memory operations and not synchronization between threads.

Thanks for any help,

Micah

If I may -an unrelated question perhaps- can clang in trunk compile
this CL program with the 2 OPENCL EXTENSION and keyword __kernel?

thanks
shrey

Is there a current way to specify that an instruction or function call
cannot be duplicated and thus any optimizations that might want to duplicate
this instruction would fail?

No. Anything can be duplicated. That could change, but you would
need to make a strong case for why other solutions won't work.

which violates opencl, which can cause undefined behavior on the underlying
hardware, as each barrier is unique.

If you need to maintain uniqueness, there are other ways to do that
without preventing the global from being duplicated. For example, you
could associate each barrier with a unique global by making the
intrinsic take it as a parameter.

-Eli

[Villmow, Micah] Well the problem is that the function in question
cannot get duplicated because it has side-effects that duplicating
causes undefined behavior on vector hardware. Also, moving the
instruction inside of flow control when it is originally outside of flow
control produces undefined behavior. There currently is no way to
specify this in LLVM that I know of. We've tried lowering it to an
intrinsic and setting MayWriteMem and this does not solve the problem.
After looking at the llvm IR, there is no equivalent method of
representing an instruction that is an execution barrier(not a memory
barrier, which llvm.barrier.[ss|ll|ls|sl] is). If you have any idea's,
we would be willing to give them a try.

On the unique barrier issue, even if the barrier is given a unique
global identifier, it is the function duplication that causes the
problem. A unique global identifier lets us identify that invalid
optimizations have occurred, but it does not guarantee correctness since
the barrier function is unique per function call. So any sort of
duplication is invalid.
Micah

From: Eli Friedman [mailto:eli.friedman@gmail.com]
Sent: Wednesday, October 07, 2009 5:50 PM
To: Villmow, Micah
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] Instructions that cannot be duplicated

> Is there a current way to specify that an instruction or function
call
> cannot be duplicated and thus any optimizations that might want to
duplicate
> this instruction would fail?

No. Anything can be duplicated. That could change, but you would
need to make a strong case for why other solutions won't work.

[Villmow, Micah] Well the problem is that the function in question
cannot get duplicated because it has side-effects that duplicating
causes undefined behavior on vector hardware. Also, moving the
instruction inside of flow control when it is originally outside of flow
control produces undefined behavior. There currently is no way to
specify this in LLVM that I know of. We've tried lowering it to an
intrinsic and setting MayWriteMem and this does not solve the problem.
After looking at the llvm IR, there is no equivalent method of
representing an instruction that is an execution barrier(not a memory
barrier, which llvm.barrier.[ss|ll|ls|sl] is). If you have any idea's,
we would be willing to give them a try.

Is the effect similar to pthread_barrier_wait(barrier_for($pc))
[http://linux.die.net/man/3/pthread_barrier_wait] where the
implementation automatically generates the barrier_for() function and
automatically calculates the number of threads to wait for?

If the barrier lowers to any sort of function call, it sounds like
you're currently looking up the PC of the caller and finding the
barrier that way. Instead, could specify the barrier as an explicit
argument to the function when your frontend generates the call
instruction, which would free you from worrying about whether the call
winds up in multiple places in the optimized IR.

If the barrier lowers to a magic instruction on your chip, and that
instruction doesn't take an ID of any sort besides its address, you
could generate a one-instruction function for each barrier() in the
source language and allow calls to that function to be duplicated.
There may be optimizations that merge "identical" functions, but
they'll be easier to turn off than optimizations that assume they can
rearrange control flow.

If your chip doesn't support function calls, that might constitute the
strong case Eli's asking for.

From: Jeffrey Yasskin [mailto:jyasskin@google.com]
Sent: Thursday, October 08, 2009 11:09 AM
To: Villmow, Micah
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] Instructions that cannot be duplicated

>
>
>> From: Eli Friedman [mailto:eli.friedman@gmail.com]
>> Sent: Wednesday, October 07, 2009 5:50 PM
>> To: Villmow, Micah
>> Cc: LLVM Developers Mailing List
>> Subject: Re: [LLVMdev] Instructions that cannot be duplicated
>>
>> > Is there a current way to specify that an instruction or function
>> call
>> > cannot be duplicated and thus any optimizations that might want to
>> duplicate
>> > this instruction would fail?
>>
>> No. Anything can be duplicated. That could change, but you would
>> need to make a strong case for why other solutions won't work.
> [Villmow, Micah] Well the problem is that the function in question
> cannot get duplicated because it has side-effects that duplicating
> causes undefined behavior on vector hardware. Also, moving the
> instruction inside of flow control when it is originally outside of
flow
> control produces undefined behavior. There currently is no way to
> specify this in LLVM that I know of. We've tried lowering it to an
> intrinsic and setting MayWriteMem and this does not solve the
problem.
> After looking at the llvm IR, there is no equivalent method of
> representing an instruction that is an execution barrier(not a memory
> barrier, which llvm.barrier.[ss|ll|ls|sl] is). If you have any
idea's,
> we would be willing to give them a try.

Is the effect similar to pthread_barrier_wait(barrier_for($pc))
[http://linux.die.net/man/3/pthread_barrier_wait] where the
implementation automatically generates the barrier_for() function and
automatically calculates the number of threads to wait for?

If the barrier lowers to any sort of function call, it sounds like
you're currently looking up the PC of the caller and finding the
barrier that way. Instead, could specify the barrier as an explicit
argument to the function when your frontend generates the call
instruction, which would free you from worrying about whether the call
winds up in multiple places in the optimized IR.

If the barrier lowers to a magic instruction on your chip, and that
instruction doesn't take an ID of any sort besides its address, you
could generate a one-instruction function for each barrier() in the
source language and allow calls to that function to be duplicated.
There may be optimizations that merge "identical" functions, but
they'll be easier to turn off than optimizations that assume they can
rearrange control flow.

If your chip doesn't support function calls, that might constitute the
strong case Eli's asking for.

[Villmow, Micah] Jeffrey thanks for the information on pthread. The barrier on our hardware is a single instruction, not a function call, with no arguments and everything is handled implicitly, including keeping track of the number of hardware threads that need to hit the barrier. If one of those hardware threads does not hit the barrier, then it causes undefined behavior. Hence why we need to guarantee that this instruction is not optimized around, moved or duplicated as the algorithm writer must place it following strict guidelines for it to work correctly.

So, my strong case for some sort of workaround is this:

Valid original code is:
flag = false
if (cond)
    { flag = bar();
}
foo()
if (flag) {bar}

transformes to

if (cond) {
    flag = bar()
    foo()
    if (flag)
        bar()
} else {
   foo()
}

Assumptions:
- foo() is the barrier
- each hardware thread is a 64wide vector
- two hardware threads are executing
- Each vector element gets a unique id between 0 and 127
- The condition is true if id > 32
- Predication is used on control flow
What happens in the original code:
The first half of the first hardware thread predicates computation on the first condition, second half executes bar and all threads in the second wavefront execute bar. Both hardware threads hit the barrier and wait for the other hardware thread to reach that point, then continue execution.

What happens in the optimized code:
first half of the first hardware thread predicates computation on the first condition, the second half executes bar and waits at the barrier. The second hardware thread executes bar and hits the barrier, forcing continuation of execution. Both the first and second hardware thread executes the rest of the if block. Once the block ends, the predication masks are flipped and the second half of the first hardware thread hits the barrier and blocks waiting for the second hardware thread. The second hardware thread skips execution of the else block thus not hitting the barrier, causing the first hardware thread to never return from barrier.

We have already detected two optimization passes(Loopunswitch and simplifycfg) that perform these type of transforms, and we are sure there might be more. We want to be able to enable these transforms for normal code, but have them respect the barrier instruction.

Thanks again,
Micah

... and even if a new barrier intrinsic that does not allow cloning
(not sure how, but anyway...) is introduced, you'll have to modify all
these optimization passes to take a note of this special barrier. New
barrier won't be respected automatically.

Have you thought about dividing code in separate functions and make
sure the inliner does not inline them ?

fn1() /* do not inline */
your barrier()
fn2() /* do not inline */

IMO Jeff's solution is the cleanest, simplest way to get code that
works. Just generate a separate function for every barrier in the
program, and mark it noinline. This way the instruction pointers will
be unique to the barrier.

Reid

For these particular targets, the hardware may not support function calls. So, at some point, the compiler will need to inline the function. It would need to do something special since it can't inline in each call site but need to introduce a jump to the barrier and a jump back to the various basic block it came from which could be messy. It would be nicer to avoid the transformation that copied these basic block.

-- Mon Ping

No, this gets rather nasty: to support an instruction like this, it
isn't legal to duplicate calls to functions containing a barrier
instruction.

Another proposal: add an executebarrier function attribute for
functions which directly or indirectly contain an execution barrier,
and adjust all the relevant transformation passes, like jump threading
and loop unswitching, to avoid duplicating calls to such functions.
This puts a slight burden on the frontend to mark functions
appropriately, but I don't see any other solution which doesn't affect
code which doesn't use execute barriers.

-Eli

Is inlining (which duplicates code) of functions containing OpenCL style barriers legal?
or e.g.

if you had some changed phase ordering where you had

if (cond) {
S1;
}
call user_func() // user_func has a barrier buried inside it.

you do tail splitting

if (cond) {
S1;
call user_func()
} else {
call user_func();
}

now you inline – oops now you might have a problem

so do you want IPA to propagate the barrier bit to the call sites?

you could do inlining before tail splitting

sounds messy…

Vinod

The requirement in OpenCL is that all threads (work-items) are required to hit the same barrier. If one does what you have shown below, it is not legal because some threads may go through the block with S1 and some other threads will go the other way. On some hardware, such a program will cause a hardware stall. If one is inlining, it is preferable to inline early assuming the rest of the transformations don’t mess with the barrier. Eli is correct that you can’t duplicate calls to a function containing these kind of barriers for the same reasons. From the discussions so far, it would be nice if such a concept where you don’t want to modify the control flow of a basic block containing such an execution barrier or a function containing such a barrier. This requires that all phases that does such optimizations would have to be made aware of it. Such a concept may be also useful for other things like inline assembly where one may not want to duplicate a block.

– Mon Ping

It's probably worth noting that I wasn't proposing a general
prohibition of duplication; it would be okay for inlining or loop
unrolling to duplicate a call to a function marked executebarrier.
It's not the same sort of prohibition that one might want for inline
assembly.

-Eli

Point taken :->. Inlining of these functions containing these barriers are required on some platforms. The only restriction is that any control flow optimization must preserve the property that all threads will hit the same barrier.

-- Mon Ping

Are the platforms with no function calls the same ones that have
optimization-hostile barrier instructions? If the two sets of
platforms are disjoint, OpenCL implementers can use my or Devang's
noinline-function technique on the optimization-hostile platforms, and
inject a unique argument into the barrier() call in the frontend on
the no-function platforms.

Yes, this is the case. The platforms I'm thinking of don't support function calls and have the optimization-hostile barrier instructions.

   -- Mon Ping

Vinod,

Depends on your reading of the spec. It states that if a work-item goes down a conditional path then all work-items in a work-group must also go down the conditional path. So in my interpretation, the call to user_func() in the true branch produces a different barrier during execution than the call to user_func() in the false branch, even though they both exist on the same line of source.

How about you do two stage inlining for such platforms ?

1. Inline everything except barrier functions
2. Apply usual transformations
3. Inline barrier functions

If the call to the noinline function was duplicated, when you did the
final inline to multiple locations, you would have duplicated the
original barrier instruction.

Reid