[CUDA/NVPTX] is inlining __syncthreads allowed?

Hi Justin,

Is a compiler allowed to inline a function that calls __syncthreads? I saw nvcc does that, but not sure it’s valid though. For example,

void foo() {
__syncthreads();
}

if (threadIdx.x % 2 == 0) {

foo();
} else {

foo();
}

Before inlining, all threads meet at one __syncthreads(). After inlining

if (threadIdx.x % 2 == 0) {

__syncthreads();
} else {

__syncthreads();
}

The __syncthreads call is duplicated, and it’s no longer guaranteed that all threads can meet one __syncthreads().

Any thoughts?

Jingyue

That’s an interesting case. AFAIK, inlining should be restricted here for the reason you mention. Inlining should only be valid if it doesn’t duplicate the barrier. Which nvcc shows this behavior?

Adding Vinod and Yuan for comment.

Perhaps it is semantics preserving so long as the __syncthreads callsite is marked noduplicate?

https://github.com/llvm-mirror/llvm/blob/896f064a4900458e3fb245ad3f6fc9e7a3d8c8cd/lib/Analysis/InlineCost.cpp#L1284

I’m using 7.0. I am attaching the reduced example.

nvcc sync.cu -arch=sm_35 -ptx

gives

// .globl _Z3foov
.visible .entry _Z3foov(

)
{
.reg .pred %p<2>;
.reg .s32 %r<3>;

mov.u32 %r1, %tid.x;
and.b32 %r2, %r1, 1;
setp.eq.b32 %p1, %r2, 1;
@!%p1 bra BB7_2;
bra.uni BB7_1;

BB7_1:
bar.sync 0;
bra.uni BB7_3;

BB7_2:
bar.sync 0;

BB7_3:
ret;
}

As you see, bar.sync is duplicated.

sync.cu (273 Bytes)

Hi David,

Just to make sure we are on the same page. We are talking about nvcc’s behavior here. LLVM does the right thing to me (i.e. not duplicating) on this example. I have no idea how nvcc is implemented.

Looking at this section in the PTX ISA, there’s a sentence saying:

In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the> condition identically (the warp does not diverge).

Does that mean __syncthreads should only be called uniformly when no threads diverge? If so, my sync.cu example is undefined. The reason is that, although every threads reach __syncthreads, they are reaching them divergently:

  1. threads diverge at the “if” statement
  2. the warp runs __syncthreads() with half of the threads enabled
  3. the warp jumps back to the “else” branch
  4. the warp runs __syncthreads() with the other half of the threads enabled

If my understanding is correct (__syncthreads() can only be called when the warp doesn’t diverge), unrolling a loop that contains a __syncthreads() and inlining a function that may call __syncthreads() are fine. Am I right?

Jingyue

Hi Justin, Yuan and Vinod,

It seems that what __syncthreads() requires in CUDA C++ (as opposed to PTX) is to be executed uniformly across all threads in the block and not just the warp. If so, it would be helpful if there were a precise statement about when a statement is considered to be executed uniformly in CUDA C++. Is there a precise statement somewhere from NVIDIA about this? I haven’t found one so far.

In particular, it’s not clear to me at what point diverging threads are considered to have joined up again in CUDA C++. My best guess is that this is at the immediate post-dominator of the statement that starts the divergence, with the caveat that there is an implicit shared CFG node following each return statement in a function.

Bjarke