Thank you very much for your very quick answer, I have a better understanding of why it was designed this way now.
My real concern with this behavior (and what is costing me performance in my particular use case, which has the same CFG as my toy example) is that it increases the size of the outer loop significantly and prevents it from being fully unrolled when evaluated immediately after by the loop unrolling pass–even after optimizing the transformed loop nest it is still too large to be fully unrolled in my case. When no peeling happens and the outer loop can be unrolled, the copies of the inner loop later becomes fully unrollable as well. The if is also completely removed thanks to a combination of knowing the cond
value (due to inlining) and peeling the first iteration as showcased above. My actual loop nest mainly performs FP additions/multiplications and I am targeting AMDGPUs so fully unrolling the loop nest and removing all control flow is extremely beneficial in my case.
A simple workaround for my problem so far has been to increase the outer loop’s threshold in my target’s TTI based on the identification of an inner loop whose trip count would become runtime-independent following full unrolling of the outer loop. This works fine but I am wondering whether there is a more systematic way to handle this, perhaps with a new flag in TargetTransformInfo::PeelingPreferences
that controls whether the “peel to eliminate compares logic” only returns a non-0 value if it is able to fully determine all conditions as you suggest? What do you think?