Question about optimizing mem in loop

Is there a strong reason why this simple code:

for(rnd = 0; rnd < Nrnd - 1; ++rnd)
// round(inv_rnd, b1, b0, kp);

for (iter = 0; iter < 4; ++iter) {
round_i(inv_rnd, b1, b0, kp, iter);
l_copy(b0, b1); kp -= nc;

Produces the complicated control flow logic in the attached CFG?

If I unroll the loop I no longer have the crazy control flow logic. It seems that instead of calculating the GEPs one at a time inside the loop, it’s pulling all 4 out of the inner loop into the outer loop head and then branching in the inner loop depending on which iteration this is in. I can’t really think of a good reason to do it this way, I’m sure there is so I was hoping someone might explain why this is occuring (instead of simply looping over the round and calculating each GEP in each iteration depending on the index).

Secondly, what’s the best way to convince the compiler not to do this code/logic bloat? I’m pretty unclear how it’s saving any cycles.

If you look at the O2 CFG (the O3 is the same), it’s creating a switch who’s every branch ends up at the same BB, who’s pred goes to itself and the BB to which the switch points and the switch BB only contains itself? If you look at the other CFG it’s just run with default clang opts (ie clang source_file) and all it’s doing is lowering the switch (the same problem still exists).

Seems confusing to me?

ps. Round has no control flow logic in it and when the loop is unrolled there is no control flow at all (uncond branching, etc). The logic gets even more convoluted when simplycfg and other opts are applied. (18.1 KB) (41.9 KB)

To clarify, I understand "what’ is going on but I’d like to know why. Why is seemingly pre-computing the GEPs and then creating the control flow rather than doing it per iteration. Seems like a lot of code bloat for the exact same core operations?