Proposal: pragma for branch divergence

Jingyue_Wu1 · January 24, 2015, 6:29am

Hi,

I am considering a language extension to Clang for optimizing GPU programs. This extension will allow the compiler to use different optimization strategies for divergent and non-divergent branches (to be explained below). We have observed significant performance gain by leveraging this proposed extension, so I want to discuss it here to see how the community likes/dislikes the idea. I will focus on the CUDA language and the PTX backend for now, but I believe this addition will benefit OpenCL and its backends too.

Background on branch divergence

CUDA programs have a very different execution model for code with branches. A CUDA program is executed by an array of threads broken into groups called warps. A warp typically contains 32 threads, and all the threads in a warp execute instructions in lock-step, i.e., executing the same instruction at any given time. Therefore, if the code contains divergent branches (i.e., threads in a warp do not agree on which path of the branch to take), the warp has to execute all the paths from that branch with different subsets of threads enabled until they converge at a post-dominating BB of the paths. For example,

// threadIdx.x returns the index of a thread in the warp

if (threadIdx.x == 0) {

foo();

} else {

bar();

}

The warp that contains thread 0-31 needs to execute foo() with only thread 0 enabled and then bar() with the other 31 threads enabled. Therefore, the run time of the above code will be the run time of foo() + the run time of bar().

More details about branch divergence can be found in the CUDA C programming guide: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture

How branch divergence affects compiler optimizations

Due to CUDA’s different execution model, some optimizations in LLVM, such as jump threading, can be unfortunately harmful.

The above figure illustrates jump threading. In the original CFG (on the left), the first condition “if foo == bar” implies the second condition “if foo <= bar”. Therefore, jump threading redirects BB1 directly to BB2 so that the transformed code needn’t compute the second condition when the first condition is true.

One important complication here is that BB1 does not directly point to the second condition. Instead, the code needs to call baz() before computing “if foo <= bar”. Therefore, jump threading has to duplicate the function call to baz() to match the semantics that the code runs baz() regardless of “if foo == bar”.

For CPU programs, jump threading likely increases execution speed, because it makes some paths shorter. However, for CUDA programs, jump threading on divergent branches is almost certainly a bad idea. Suppose both conditions in the above example are divergent within a warp. To synchronize execution of all threads in a warp, the warp has to sequentially execute all basic blocks in the jump-threaded CFG. With baz() duplicated in the jump-threaded CFG, the warp needs to execute more code than for the original CFG. We have observed that jump threading incurs ~50% slowdown for some benchmarks.

Note that jump threading is not the only optimization that can hurt the performance of CUDA programs due to branch divergence. Loop unswitching on divergent branches can also hurt performance because it may duplicate code too.

Annotations for branch divergence

Ideally, we want the compiler to automatically figure out which branches are divergent or not. However, doing that precisely is extremely hard and can be expensive. Therefore, I am proposing a compromise to have programmers provide some optimization hints.

The annotation in my mind is in the format of “#pragma clang branch non_divergence”. Programmers can add this annotation right before a control statement (such as if, for, and while), indicating the branch derived from the control statement is not divergent.

For example,

#pragma clang branch non_divergence

if (a > 0) {

…

}

indicates the condition (a > 0) is uniform across all threads in a warp.

The optimizer can then enable certain optimizations such as jump threading and loop unswitching only on non-divergent branches. In longer term, the optimizer can even adopt some cheap data-flow analysis to conservatively compute whether a branch is non-divergent. For example, if a condition is not derived from blockIdx or threadIdx, it is guaranteed to hold the same value for all threads in a warp.

How the compiler can leverage these annotations

Similar to the annotations for loop optimizations (http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations), clang can attach metadata to the branch instructions following “#pragma clang non_divergent”. For example, the source code snippet in the previous section will be translated to:

%cond = icmp sgt i32 %a, 0

br i1 %cond, label %then, label %else, !llvm.branch !0
…
!0 = !{!0, !1}
!1 = !{!“llvm.branch.non_divergent”}

The llvm.branch metadata indicates %cond computes the same value for all threads in a warp.

This metadata can be leveraged by the IR optimizer and the NVPTX backend for better optimization. Besides the opportunities of tuning certain IR optimizations aforementioned, I also noticed the NVPTX backend could emit more efficient PTX instructions (such as bra.uni and ret.uni) for non-divergent branches.

Thanks for reading! Any feedbacks are welcomed.

Jingyue

resistor · January 24, 2015, 8:09am

Hi Jingyue,

Have you considered using dynamic uniformity checks? In my experience you can obtain most of the benefit you describe without the need for static information simply by inserting branch-if-none instructions that jump over the bodies of conditional regions.

This technique is described under Runtime Branch Uniformity Optimization in this paper, though I’m pretty confident it had been in use much longer than that: http://www.eecs.berkeley.edu/~yunsup/papers/predication-micro2014.pdf

-Owen

V_Grover · January 24, 2015, 5:06pm

In our experience, as Owen also suggests, a pragma or a language extension can be avoided by a combination of static and dynamic analysis. We prefer this approach in our compiler

Regards,
Vinod

resistor · January 24, 2015, 5:45pm

Additionally, it is worth pointing out that it is possible for the compiler to improve the effectiveness of dynamic uniformity checks by enforcing greater “structure”, generally at the cost of code duplication. Unfortunately, I’m not aware of any published descriptions of how to do this.

-Owen

Jingyue_Wu1 · January 25, 2015, 8:31pm

Hi Owen and Vinod,

Thanks for sharing the paper! I like the idea a lot. Regarding the paper itself, Vinod, are the consensual branches (e.g., cbranch.ifnone) you mentioned in the paper publicly available in PTX ISA?

Owen, could you explain more on the approach of using branch-if-none instructions in your mind? I believe you have lots of great insights, but I don’t see how cbranch.ifnone instructions directly solve my issue. The issue I am trying to solve is that certain CFG optimizations transform the CFG into a “bad” structure which hurts the performance of the compiled code in the presence of divergent branches. On the other hand, I don’t want to disable jump threading all together because it is still beneficial for non-divergent branches. As far as I can understand, cbranch.ifnone provides a fast path so that a warp can jump over the region that no threads in the warp ever execute. However, it doesn’t help the case where the branches are indeed divergent.

I can vaguely imagine consensual branches may help with my issue by speculative optimization. Given a code region that contains branches that may or may not be divergent, the compiler first emit two versions of it: the original version and the version with jump threading performed. Then, the compiler uses a set of consensual branches as a runtime switch that leads the execution to the jump-threaded version only when none of the branches in the original code region are divergent, i.e.,

if (branches in the code region are divergent) {

the original code region
} else {
the jump-threaded code region
}

Jingyue

resistor · January 26, 2015, 3:53am

I don’t have any specific advice to offer you about JumpThreading. My experience has been that it generally not worth using for GPU targets. I suspect that an enhanced model that allowed the target to assign a cost-per-duplication of instructions might make it more profitable.

—Owen

resistor · January 26, 2015, 3:55am

Also, if you’re interested in optimizations that reduce branch divergence, you might take a look at the following papers. I have no hands-on experience with them, but they seem like they could be useful, particularly to more general compute applications:

http://www.eecg.toronto.edu/~han/papers/brdiv-gpgpu11.pdf
http://www.eecg.toronto.edu/~han/papers/lm-gpgpu13.pdf

—Owen

Topic		Replies	Views
Parallelism TS implementation and feasibility of GPU execution policies Clang Frontend	7	90	April 10, 2015
LLVM reordering blocks breaks ptxas divergence analysis IR & Optimizations cuda , ptx , gpu	31	761	June 13, 2023
Questions about jump threading optimization and what we can do LLVM Dev List Archives	15	82	February 4, 2020
[libc++] Working on the parallel STL algorithms Clang Frontend	10	121	May 26, 2017
nested parallelism in libomptarget-nvptx OpenMP	17	213	September 10, 2018

Proposal: pragma for branch divergence

Related topics