LSR

Hi,

I find that LSR is not helping enough on avoiding unfoldable offsets for SystemZ. When the loop has three stores with unfoldable offsets, LSR rewrites the IV in a good way. However, if adding another store with a foldable offset that fits already, LSR fails to rewrite the three stores.

And if I happen to add a too big *positive* offset (the first three were negative) instead of a foldable one, only the positive gets transformed.

* LSR is not rewriting the IV to have three foldable offsets rather than one.

* It would actually be preferred in this case to use a second address register for the offset that is too far away from the others.

Has anyone any idea on how to best handle this? Can LSR "split" an IV to use an extra register? Or would this need to be done in a target specific pass?

For a reduced test case for this problem, see https://bugs.llvm.org//show_bug.cgi?id=32548.

Thanks,

Jonas

Hi,

I find that LSR is not helping enough on avoiding unfoldable offsets for SystemZ. When the loop has three stores with unfoldable offsets, LSR rewrites the IV in a good way. However, if adding another store with a foldable offset that fits already, LSR fails to rewrite the three stores.

And if I happen to add a too big *positive* offset (the first three were negative) instead of a foldable one, only the positive gets transformed.

* LSR is not rewriting the IV to have three foldable offsets rather than one.

* It would actually be preferred in this case to use a second address register for the offset that is too far away from the others.

Has anyone any idea on how to best handle this? Can LSR "split" an IV to use an extra register? Or would this need to be done in a target specific pass?

When you say "an extra address register" would this imply LSR adding an additional PHI?

  -Hal

Has anyone any idea on how to best handle this? Can LSR "split" an IV to use an extra register? Or would this need to be done in a target specific pass?

When you say "an extra address register" would this imply LSR adding an additional PHI?

-Hal

Yes, that would have worked well at least in this type of loop. Can LSR do this?

I experimented with adding a check for 12 bit offsets distance in isProfitableIncrement() (checking against all members of the chain), which resulted in several chains being produced by LSR, instead of just one. The chains that now formed now had immediate offsets that were close to each other, so that they should result in addresses with 12 bit offsets. But, to my disappointment, LSR did not handle these different chains by generating new PHI-nodes for each of them (or by skipping those that ended up with just one store in the chain), but instead it still output the stores in the same way as before.

I also see that LSR is thinking in terms of increments between the memory accesses. In the loop I am working with it's disappointing to see that before each memory access, the base address is loaded into register, and then the offset is added, and then the access, which is 3 instructions. It should have been just an add/sub after the previous access before the memory access, per LSRs intentions. I wonder where this is supposed to be handled: In some sort of target pre-isel pass that chains the GEPs? Or is this just folded more often on other targets?

/Jonas

Has anyone any idea on how to best handle this? Can LSR "split" an IV to use an extra register? Or would this need to be done in a target specific pass?

When you say "an extra address register" would this imply LSR adding an additional PHI?

-Hal

Yes, that would have worked well at least in this type of loop. Can LSR do this?

No, LSR won't add new PHIs. This is a long-standing deficiency (and also, in part, prevents it from properly handling pre/post-inc addressing modes, which motives some target-specific passes such as lib/Target/PowerPC/PPCLoopPreIncPrep.cpp).

I experimented with adding a check for 12 bit offsets distance in isProfitableIncrement() (checking against all members of the chain), which resulted in several chains being produced by LSR, instead of just one. The chains that now formed now had immediate offsets that were close to each other, so that they should result in addresses with 12 bit offsets. But, to my disappointment, LSR did not handle these different chains by generating new PHI-nodes for each of them (or by skipping those that ended up with just one store in the chain), but instead it still output the stores in the same way as before.

I also see that LSR is thinking in terms of increments between the memory accesses. In the loop I am working with it's disappointing to see that before each memory access, the base address is loaded into register, and then the offset is added, and then the access, which is 3 instructions. It should have been just an add/sub after the previous access before the memory access, per LSRs intentions. I wonder where this is supposed to be handled: In some sort of target pre-isel pass that chains the GEPs? Or is this just folded more often on other targets?

As I recall, it does not do this now (although this is also needed for handling pre/post-inc addressing modes properly).

  -Hal

Hi Hal,

No, LSR won't add new PHIs. This is a long-standing deficiency (and also, in part, prevents it from properly handling pre/post-inc addressing modes, which motives some target-specific passes such as lib/Target/PowerPC/PPCLoopPreIncPrep.cpp).

I do find in this regression that gcc manages to use four address registers, while llvm uses just one, with a lot of extra address building instructions as a result, where the gcc loop looks very clean. I suspect that this could be a needed feature. Would you recommend doing this (splitting PHIs) as a "LSRPrep" pass, perhaps in the target, or would you try to extend LSR itself?

I also see that LSR is thinking in terms of increments between the memory accesses. In the loop I am working with it's disappointing to see that before each memory access, the base address is loaded into register, and then the offset is added, and then the access, which is 3 instructions. It should have been just an add/sub after the previous access before the memory access, per LSRs intentions. I wonder where this is supposed to be handled: In some sort of target pre-isel pass that chains the GEPs? Or is this just folded more often on other targets?

As I recall, it does not do this now (although this is also needed for handling pre/post-inc addressing modes properly).

Same question - It might be simpler to do a separate post-LSR GEP handling (in CodeGenPrepare, perhaps?), but I suspect it would also be possible to extend LSR to do this instead?

/Jonas

Hi Hal,

No, LSR won't add new PHIs. This is a long-standing deficiency (and also, in part, prevents it from properly handling pre/post-inc addressing modes, which motives some target-specific passes such as lib/Target/PowerPC/PPCLoopPreIncPrep.cpp).

I do find in this regression that gcc manages to use four address registers, while llvm uses just one, with a lot of extra address building instructions as a result, where the gcc loop looks very clean. I suspect that this could be a needed feature. Would you recommend doing this (splitting PHIs) as a "LSRPrep" pass, perhaps in the target, or would you try to extend LSR itself?

I recall thinking that the "right" way to do this is to extend LSR itself. This way the cost of adding extra PHIs could be weighed with addressing modes, register pressure, and other factors. I have not, however, looked enough into the details to say exactly how this would work - if I had, I probably would have done it myself :wink: -- cc'ing Andy in case he'd like to chime in.

I also see that LSR is thinking in terms of increments between the memory accesses. In the loop I am working with it's disappointing to see that before each memory access, the base address is loaded into register, and then the offset is added, and then the access, which is 3 instructions. It should have been just an add/sub after the previous access before the memory access, per LSRs intentions. I wonder where this is supposed to be handled: In some sort of target pre-isel pass that chains the GEPs? Or is this just folded more often on other targets?

As I recall, it does not do this now (although this is also needed for handling pre/post-inc addressing modes properly).

Same question - It might be simpler to do a separate post-LSR GEP handling (in CodeGenPrepare, perhaps?), but I suspect it would also be possible to extend LSR to do this instead?

Splitting is what, in practice, we have now. Targets have "fixup/prep" passes to account for the fact that LSR won't add new PHIs. This seems simpler, but it is probably also suboptimal (i.e. it works reasonably for targets with simpler addressing modes, like PPC, although there are still issues, but I don't see that it would work well for targets with complicated ones).

  -Hal