LLVM reordering blocks breaks ptxas divergence analysis

I’m having some issues with LLVM reordering basic blocks in a way that results in invalid code for Pascal (and earlier) CUDA GPUs. I’ll demonstrate with a simple example:

function kernel(input::T, output, n) where T
    i = threadIdx().x
    temp = StaticSharedArray(T, 1)

    # thread 1 initializes the shared memory
    if i == 1
        1 <= n || throw_oob()
        temp[1] = input
    end

    sync_threads()

    # other threads read from shared memory
    1 <= n || throw_oob()
    output[i] = temp[1]

    return
end

Note that the code is simplified to the point that it looks a little nonsensical.
I generate the following optimized LLVM IR for this snippet (again, simplified for clarity):

; ModuleID = 'start'
source_filename = "start"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare void @llvm.nvvm.barrier0() #0
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

declare fastcc void @throw_oob();

@shmem = internal unnamed_addr addrspace(3) global i8 0, align 32

define ptx_kernel void @kernel(i8 signext %0, i64 zeroext %1, i64 signext %2) {
conversion:
  %3 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %.not = icmp eq i32 %3, 0
  br i1 %.not, label %maybe_initialize, label %cont1

; start of a divergent region

maybe_initialize:
  %4 = icmp slt i64 %2, 1
  br i1 %4, label %oob1, label %actually_initialize

actually_initialize:
  store i8 %0, i8 addrspace(3)* @shmem, align 32
  br label %cont1

oob1:
  call fastcc void @throw_oob()
  unreachable

; end of a divergent region

cont1:
  call void @llvm.nvvm.barrier0()
  %5 = icmp slt i64 %2, 1
  br i1 %5, label %oob2, label %read

read:
  %6 = load i8, i8 addrspace(3)* @shmem, align 32
  %7 = zext i32 %3 to i64
  %8 = inttoptr i64 %1 to i8*
  %9 = getelementptr inbounds i8, i8* %8, i64 %7
  store i8 %6, i8* %9, align 1
  ret void

oob2:
  call fastcc void @throw_oob()
  unreachable
}

attributes #0 = { convergent }

Note how I annotated a region of code where execution diverges, i.e., where not all threads execute the same code.

NVPTX produces the following PTX code:

.visible .entry kernel(
	.param .u8 kernel_param_0,
	.param .u64 kernel_param_1,
	.param .u64 kernel_param_2
)
{
	.reg .pred 	%p<5>;
	.reg .b16 	%rs<3>;
	.reg .b32 	%r<2>;
	.reg .b64 	%rd<5>;
	// demoted variable
	.shared .align 32 .u8 shmem;
// %bb.0:                               // %conversion
	ld.param.u64 	%rd2, [kernel_param_2];
	mov.u32 	%r1, %tid.x;
	setp.ne.s32 	%p1, %r1, 0;
	setp.lt.s64 	%p4, %rd2, 1;
	@%p1 bra 	$L__BB0_3;
// %bb.1:                               // %maybe_initialize
	@%p4 bra 	$L__BB0_6;
// %bb.2:                               // %actually_initialize
	ld.param.s8 	%rs1, [kernel_param_0];
	st.shared.u8 	[shmem], %rs1;
$L__BB0_3:                              // %cont1
	bar.sync 	0;
	@%p4 bra 	$L__BB0_5;
// %bb.4:                               // %read
	ld.param.u64 	%rd1, [kernel_param_1];
	ld.shared.u8 	%rs2, [shmem];
	cvt.u64.u32 	%rd3, %r1;
	add.s64 	%rd4, %rd1, %rd3;
	st.u8 	[%rd4], %rs2;
	ret;
$L__BB0_5:                              // %oob2
	{ // callseq 0, 0
	.reg .b32 temp_param_reg;
	call.uni
	throw_oob,
	(
	);
	} // callseq 0
$L__BB0_6:                              // %oob1
	{ // callseq 1, 0
	.reg .b32 temp_param_reg;
	call.uni
	throw_oob,
	(
	);
	} // callseq 1
                                        // -- End function
}

Here, the noreturn calls to throw_oob have been moved to the end of the function, which results in ptxas not being able to mark the code I annotated above as a single divergent region. Because of that, the barrier0 is executed divergently, which is undefined behavior (and causes miscomputations).

What can I do to avoid this? I’ve seen the machine IR passes reorder IR as part of the branch-folder and block-placement passes; I guess I could try disabling those, but I’m hoping for a more principled solution so that this issue doesn’t come up again in the future when another pass starts reordering IR or machine IR.

1 Like

Can you share the actual CUDA code (or some representative compilable reproducer)? Does the reordering happen in the backend, or earlier? Your question implies the former but I want to make sure.

The front-end is Julia with CUDA.jl, so I don’t have a CUDA C reproducer right now. The bitcode module I posted should reproduce it though, I just used llc (I had a LLVM !5 build available) to generate the PTX code shown below it. -print-after-all shows that branch-fold and block-placement are to blame, and indeed passing --disable-* for both keeps the calls in their original places.

I missed the IR. What happens if you replace the unreachable with ret void. Throw oob is not marked noreturn, so they should survive into the ptx. The blocks are still moved, but that is legal anyway, it might help ptxas though.

Then the blocks are not moved to the end of the function then, which makes the problem disappear, as expected. “Hiding” unreachable information like that isn’t a great solution though; it’s what we do right now, but every so often LLVM does still recover an unreachable from some IR operation, resulting in the above pattern surfacing.

The blocks aren’t moved by any back-end pass if I use ret void; and isn’t the problem that moving blocks like that is illegal, or at least in the sense that it may break ptxas’ divergence analysis and render otherwise valid IR into code that performs undefined operations?

To me, it very much looks like they are moved: Compiler Explorer

For context, this looks like another instance of
https://bugs.llvm.org/show_bug.cgi?id=27738

Julia happened to have the most concise reproducer for the issue:
https://github.com/JuliaGPU/CUDAnative.jl/issues/4

AFAICT we do not have a reliable solution. Tactical fixes for the previous occurrences of this problem either implemented proper handling of convergent or disabled certain optimizations which resulted in un-structuring CFG enough to trigger the issue. We may need to do something similar here, too.

2 Likes

Yes, I probably should have added some more context. Shared memory + multiple function exits cause invalid results · Issue #1746 · JuliaGPU/CUDA.jl · GitHub is the latest iteration of that issue (on our new repository), as it seems to have resurfaced after one of our LLVM upgrades.

You’re right, I guess I must have been looking at the wrong output. That’s too bad, as it means that even without unreachable these blocks are put in “invalid” places.

How would a proper fix for this look? Changes in block placement don’t alter the CFG, so AFAIU normal structurization passes won’t help (and enabling the structurizer pass indeed doesn’t do much for this example). I guess divergence analysis should be used to annotate branches or their destinations such that they are somehow laid out in a way that minimizes the divergent region, but I’m not sure how to best approach that.

In the mean time, is there a way to disable the branch folding and placement passes for just my target? I know that you can do this globally using the -disable-branch-fold etc command-line options, but I’m targeting NVPTX as part of a JIT session that also generates host code, so I don’t want to disable it globally.

Just a thought: If ptxas does inlining, or even if not, you could outline the divergent region and thereby help ptxas not to screw up.

That’s a good idea, but sounds very expensive, no? In Julia, we generate this kind of code pattern (unlikely branch to an unreachable block that reports an error) quite often, easily multiple times per function. It may be worth a try though, if only for some insights.

Some more details from NVIDIA: Apparently ptxas does use control-flow analysis to determine divergence regions, so moving basic blocks around shouldn’t be a problem. However, it does apparently not correctly model the trap instructions I was dealing with in my original example:

entry:
    @%p0 bra cont;
    // start of divergent region

    @%p1 bra throw;
    bra.uni cont

// ptxas incorrectly models `trap`, and thinks this block will fall through
// to `cont`. because of that, if it's placed here, it still falls in our
// divergent region, making it possible to emit SSY/SYNC instructions
// in the expected spots.
throw:
    trap;

cont:
    // end of divergent region

    // do something that requires uniform execution
    bar.sync 0;

    bra.uni exit;

// if we place the `trap` block here, its fallthrough successor falls
// outside of our divergent region. this results in a much larger
// divergent region, which can introduce undefined behavior, such as
// the barrier instructions being executed divergently.
throw:
    trap;

exit:
    ret;

With this insight, there’s an obvious workaround that seems to work: Adding a fake branch that jumps back to the continuation block within the divergent region:

throw:
    trap;
    bra.uni cont;

EDIT: I guess this also applies to code that doesn’t necessarily trap, but performs a call with unreachable after it. After moving that block to the end of the function, ptxas will think that its successor falls outside of the original divergent region, even though it would dynamically never reach that point. This seems like something the structurizer should handle?

1 Like

Very cool. Any chance you want to “fix” our ptx backend?

Does anyone know what’s the current state of CFG structurization in LLVM in general these days?

We have IR-level structurizer StructurizeCFG.cpp but it appears that it’s not always sufficient. We’ve ran into a handful of cases when other transforms broke re-convergence.

Then we have MI-level structurizers:

I do not think NVPTX has any MI-level structurizer and we seem to rely on IR-level structurizer + ptxas, which is not always up to the job. Considering that we have no robust testing for how ptxas handles different CFGs, I’d be very cautious about introducing a full MI-level structurizer as there’s a good chance we’ll find new ptxas quirks. We may implement a small pass dealing with this specific corner case for now, and longer-term implement a proper MI-level structurizer which we can test on a large code base and eventually phase in.

Yeah, I’ll have a look. I’m currently prototyping a function pass in Julia, over at PTX: Structurize unreachable control flow by maleadt · Pull Request #467 · JuliaGPU/GPUCompiler.jl · GitHub (using the C API, so we’re a bit limited).

Do you think this is best suited as an NVPTX specific pass, or could it be added to the CFG structurizer? It more and more starts to look like a legitimate LLVM issue, where back-ends that don’t have a notion of unreachable will end up with a different CFG after block placement/folding. Just to summarize my current understanding (in pseudo PTX code):

entry:
  // start divergence
  @%p0 bra cont;
  @%p1 bra throw;
  bra.uni cont;
throw:
  call throw()
  //unreachable;
cont:
  // end divergence
  bar.sync 0; // cannot be divergent
exit:
  ret;

becomes

entry:
  // start divergence
  @%p0 bra cont;
  @%p1 bra throw;
  bra.uni cont;
cont:
  bar.sync 0; // oops
throw:
  call throw()
  //unreachable;
exit:
  // end divergence
  ret;

i.e. the barrier is executed divergently because of block layout changes, as PTX doesn’t retain the notion of unreachable and the throw block now falls through to exit.

The disadvantage of adding this to the structurizer (which AFAIU is executed earlier than back-end specific passes would be) is that changing e.g. trap/call @noreturn; unreachable to trap/call; br quickly gets “optimized” again to trap/call; unreachable.

Right. I think it is unwise to fight the IR canonicalization, instead we fix it late only for ptxas.

I guess another alternative is to add some notion of “supports unreachable control flow” (or a more general property) to the target machine, and gate specific optimizations on that feature. Though I’m not familiar enough with LLVM’s set of Machine IR passes to know which ones this would require that, apart from branch-folding and block-placement which I’ve encountered while debugging this issue.

In any case, I’ll continue prototyping the IR pass I linked above, and if/when I get something that works robustly I’ll have a look at upstreaming it.

Hasn’t really changed. We would like a machine version now that machine uniformity exists. It somewhat depends on pulling in the rest of the convergence tokens work (and really switching to globalisel)

I feel it is unwise to lift a ptxas bug into our target machine / machine IR. All that code should not need to deal with the brokenness of one target toolchain. At the pax backend level we know the unreachable terminators (I would assume), and we can patch them up with fake branches without disturbing anything else.