Recently by investigating an issue reported for HIP (Question: Do warp cross-lane functions work in branching code at all? · Issue #2474 · ROCm-Developer-Tools/HIP · GitHub) we found there is an issue with SimpifyCFG with cross-lane functions.
SimplifyCFG assumes if(x) y=f(a) else y=f(b) is equivalent to y=f(x?a:b) and does this transformation when such a pattern is found. This is generally true for non-cross-lane functions but not true for cross-lane functions.
In GPU, multiple threads (lanes) are executed lock-step as a wavefront. Most functions do not depend on values from other lanes. Cross-lane functions depend on values from other lanes. E.g, __any(x) returns true if x from any active lane is true.
Consider if(x) y=f(a) else y=f(b). Let’s assume before executing this statement, all lanes are active and x has different values for different lanes. Some lanes will execute y=f(a) and other lanes with execute y=f(b). In either cases, f is executed with partially active lanes.
If the statement is transformed to ‘y=f(x?a:b)’. Then f is executed with all lanes active. Then the result is different from when f is executed with partially active lanes.
To fix this issue, I suggest to introduce a function attribute ‘cross-lane’ to mark cross-lane functions and intrinsic, and prevent SimplifyCFG to merge the calls of such functions.
Any comments are welcome. Thanks.