Loop unrolling analysis next steps

So, I’ve spent some considerable time cleaning up and looking at the loop unrolling analysis that you started Michael, but I don’t have time to look at it further so I’m writing down the current state and what I think should happen next.

Currently, in tree, we have a reasonably clean framework now that visits each instruction for each iteration and tries to simplify it. Currently, the only meaningful simplification it does is that of folding loaded constants. There is a patch out by Michael that tries to also do DCE, but I think it is fundamentally the wrong approach.

The primary problem is that the current system is trying to use an optimization technique that relies on forward simulation of each iteration, but then applying backwards analyses to fold specific instructions through complex transforms. This fundamentally doesn’t scale and I suspect is the reason for the caching layers that were added on top of SCEV. I suspect those caching layers are more than just a minor code smell – they point to the flawed design.

I think the correct design needs to build a reasonably fast way of taking an instruction for which we have a SCEV model and constant folding it for a specific iteration of the loop. Then the visitor needs to be cast in terms of this: it should first try basic constant folding of operands, and then fall back on querying for a SCEV model and constant folding that if it has one for a particular iteration. The result should fold the GEPs that are currently analyzed ahead of time into constantexpr’s. Finally, the load folding needs to be expressed in terms of folding the load through a constantexpr GEP into a constant array. If the existing constant folding isn’t enough for this, I’d be rather surprised.

Without restructuring in this way, we’ll always be patching things up. For example, I added logic to skip dead control flows when analyzing the loop for unrolling and it is completely ineffective because we don’t analyze ‘icmp’ instructions against the induction variable. For cases where the dead control flow is predicated on the GEPs themselves (for example, any STL-style algorithm using pointers as iterators!) we can’t get an accurate analysis without applying the GEP simplification independently of the load simplification.

Anyways, I’m not a SCEV expert, and so I’m not the right person to re-cast the current visitor to directly build on top of a SCEV model and do substitution of the current iteration. But I think that is the design I’d like to see before any other work goes into the code here. Happy to review when there is a patch to this effect.

-Chandler

Given how far you pushed this analysis pass I’m surprised by some of your concerns. Please see below.

So, I've spent some considerable time cleaning up and looking at the loop unrolling analysis that you started Michael, but I don't have time to look at it further so I'm writing down the current state and what I think should happen next.

Currently, in tree, we have a reasonably clean framework now that visits each instruction for each iteration and tries to simplify it. Currently, the *only* meaningful simplification it does is that of folding loaded constants. There is a patch out by Michael that tries to also do DCE, but I think it is fundamentally the wrong approach.

DCE should not be thought as part of the framework. And it has to be a backward pass to account for (iteration-specific) dead instructions. Take for example: a = <some>; b = a * c[i]; When c[i] turns out to be zero zero, a becomes dead (assuming only one use of a).

The primary problem is that the current system is trying to use an optimization technique that relies on *forward simulation* of each iteration, but then applying backwards analyses to fold specific instructions through complex transforms. This fundamentally doesn't scale and I suspect is the reason for the caching layers that were added on top of SCEV. I suspect those caching layers are more than just a minor code smell -- they point to the flawed design.

I don’t see the scalability issue with DCE.

Why can’t the current design integrate GEP naturally w/o the need for the SCEV cache? I think the code could do away with the extra cache. In that case the cache is a current implementation artifact rather than a flawed design signal.

I think the correct design needs to build a reasonably fast way of taking an instruction for which we have a SCEV model and constant folding it for a specific iteration of the loop. Then the visitor needs to be cast in terms of this: it should first try basic constant folding of operands, and then fall back on querying for a SCEV model and constant folding that if it has one for a particular iteration. The result should fold the GEPs that are currently analyzed ahead of time into constantexpr's. Finally, the load folding needs to be expressed in terms of folding the load through a constantexpr GEP into a constant array. If the existing constant folding isn't enough for this, I'd be rather surprised.

Without restructuring in this way, we'll always be patching things up. For example, I added logic to skip dead control flows when analyzing the loop for unrolling and it is completely ineffective because we don't analyze 'icmp' instructions against the induction variable. For cases where the dead control flow is predicated on the GEPs themselves (for example, any STL-style algorithm using pointers as iterators!) we can't get an accurate analysis without applying the GEP simplification independently of the load simplification.

Could you share your test case? I would be surprised if the current design couldn’t handle it by adding a visitICmp, but, yes, it requires additional logic to handle dead conditional control flow.

So, I've spent some considerable time cleaning up and looking at the loop unrolling analysis that you started Michael, but I don't have time to look at it further so I'm writing down the current state and what I think should happen next.

Currently, in tree, we have a reasonably clean framework now that visits each instruction for each iteration and tries to simplify it. Currently, the *only* meaningful simplification it does is that of folding loaded constants. There is a patch out by Michael that tries to also do DCE, but I think it is fundamentally the wrong approach.

You’re right that it’s the wrong approach, but that doesn’t mean it is the wrong way to go.

The right approach for this (and many other transformations) is to actually do the transformation, do a suite of simplifications, then see if it was worth it. There are many compilers that take that approach (because that is the only and best way to get an accurate cost model for a transformation like this) but it is extremely expensive in compile time.

The primary problem is that the current system is trying to use an optimization technique that relies on *forward simulation* of each iteration, but then applying backwards analyses to fold specific instructions through complex transforms. This fundamentally doesn't scale and I suspect is the reason for the caching layers that were added on top of SCEV. I suspect those caching layers are more than just a minor code smell -- they point to the flawed design.

As the guy who added the caching scheme to SCEV, I can assure you that it had nothing to do with loop unrolling. I’m not arguing that caching ended up being a great thing in the end, just that it doesn’t have anything to do with unrolling.

Anyways, I'm not a SCEV expert, and so I'm not the right person to re-cast the current visitor to directly build on top of a SCEV model and do substitution of the current iteration. But I think that is the design I'd like to see before any other work goes into the code here. Happy to review when there is a patch to this effect.

The problem here is that SCEV isn’t enough. SCEV doesn’t handle things like DCE, doesn’t handle store->load forwarding, and doesn’t handle calls. SCEV is only really appropriate for simple chains of integer operations, and real loops get more complicated than that. For example, one of the major benefits of unrolling:

for (i) {
  a[i] = a[i-1]+1
}

is that you get trivial store->load forwarding without depending on our crazy memdep “PRE” stuff. Unless we add something like MemorySSA and teach SCEV about it, I don’t see how this can be painted as a SCEV problem.

-Chris