I watched the talk on the design, and it all made sense to me. I can't claim to have a deep knowledge of the requirements of GPU architectures, but I can say that this is basically the kind of stuff we had in mind when the token type was designed. What you are saying about modelling these special GPU operations as accessing inaccessible memory makes sense to me, but again, I am not an expert.
One of the challenges we've faced when trying to create regions for unpacked call sequences is that unreachable code elimination can often "break" a region by deleting the token consumer. It's not clear if your proposal suffers from this problem, but it's something to keep in mind, since optimizers can often discover that a plain store is storing to a null pointer, that can become unreachable, and there goes the entire rest of the function, leaving you with a half-open region. I can't imagine a plausible series of transforms on a reasonable GPU program would end up in this situation, though.
Thanks for the feedback, Reid. I try not to make assumptions about what is a "reasonable" GPU program
One advantage that we have here is that there is no need to explicitly "end" a token region. Rather, we have a single token producer and 0 to any number of consumers. Deleting a consumer doesn't make a difference for the semantic guarantees we make for any of the non-deleted code. At least, I've thought about this quite a fair bit, including some very incomplete proof sketches, and I'm pretty confident about it at this point.
For what it's worth, `unreachable` and multiple return statements are things we see in GPU code in practice, and simple support of dead-code elimination was an explicit goal.
One interesting kind of thing can happen and needs to be kept in mind when the `anchor` intrinsic is used, because this intrinsic is largely implementation-defined. If you have:
%tok = call token @llvm.experimental.convergence.anchor()
... lots of code with control flow ...
call void @convergent_op() [ "convergencectrl"(token %tok) ]
%tok2 = call token @llvm.experiment.convergence.anchor()
call void @second_op() [ "convergencectrl"(token %tok2) ]
Deleting the @convergent_op could potentially cause a different set of threads to arrive together at the second anchor and for the @second_op during execution in practice. Whether the deletion makes a difference or not depends on the details of the underlying divergence / reconvergence mechanisms.
That's okay though, because it just ends up being a different refinement of the original IR semantics. Allowing this kind of freedom (even including non-determinism) is exactly the point of the `anchor` intrinsic, and if the programmer/frontend didn't want that, some alternative structure (typically based on `entry`) would have to be used instead.