Loop Distribution pass

Hi,

I found with the help of the optimization remarks a loop that could not be vectorized, but if loop distribution was enabled this may happen, which it in fact did with a very significant benchmark improvement (~25%).

I tried (on SystemZ) to enable this pass, and found that it only affected a handful of files on SPEC. This means I could enable this without worrying about any regressions on SystemZ at least currently.

I wonder if there is something more to know about this. It seems that no other target has enabled this due to general mixed results, or? Is this triggering much more on other targets, and if so, why?

/Jonas

I found with the help of the optimization remarks a loop that could not
be vectorized, but if loop distribution was enabled this may happen,
which it in fact did with a very significant benchmark improvement (~25%).

Hi Jonas,

That's not surprising, given that LD only tries to enable
vectorisation. Performance improvements of course depends on the
target and the quality of LLVM's lowering and further vectorisation.

I tried (on SystemZ) to enable this pass, and found that it only
affected a handful of files on SPEC. This means I could enable this
without worrying about any regressions on SystemZ at least currently.

IIUC, it's all about compile time. Loop distribution analysis is not
terribly complex, but does have a cost (see [1]).

I don't think it will have many regressions because it's *very*
conservative (see [2]), perhaps too much. Shouldn't be too much of a
problem for SystemZ, but I'd wait for others closer to the LD pass to
chime in, before taking any decision. :slight_smile:

I wonder if there is something more to know about this. It seems that no
other target has enabled this due to general mixed results, or? Is this
triggering much more on other targets, and if so, why?

I think it's mostly about the success rate, given it's too
conservative. But in the past 2 years, improvements in (and around)
the LV have been slowed down a bit due to the move to VPlan.

Actually, I imagine LD would be a great candidate to be a
VPlan-to-VPlan pass, so that it can be combined with others in the
cost analysis, given that it's mostly meant to enable loop
vectorisation.

Adding some VPlan folks in CC.

Jonas/Renato,

I think it's mostly about the success rate, given it's too conservative. But in the past 2 years, improvements in (and around) the LV have been slowed down a bit due to the move >to VPlan.

It wasn't our intention to slow down LV improvements, but if the project ended up causing other developers take the stance of wait-and-see, that's an inevitable side effect of any infrastructure level work. We welcome others work with us to move things faster. I hope everyone will see that the end result is well worth the pain it has caused.

Actually, I imagine LD would be a great candidate to be a VPlan-to-VPlan pass, so that it can be combined with others in the cost analysis, given that it's mostly meant to enable >loop vectorisation.

There are other reasons why LD is good on its own, but I certainly agree that LD shines more when it enables vectorization. In my perspective, however, there is a value in the standalone LD, and in many cases vectorization oriented LD can still happen there. Performing LD in VPlan-to-VPlan would improve precision of the cost modeling, but given that vectorizer's cost model is "ball park"-based to begin with (we have a lot of optimziers running downstream!), having extra precision will be worth only by that much. I have a thought about moving vectorizer's analysis part (all the way to cost model) into Analysis. When extra precision is desired, we can utilize such an (heavier weight) Analysis.

In short, my preference is to make vectorizer's analysis more usable by other xforms than making more and more loop xforms happen inside LV.

In the meantime, if those who are working on LD needs our input in tuning LD cost model, I'm more than happy to pitch in. We can also discuss what part of vectorizer analysis is helpful in LD at the same time.

Thanks,
Hideki

Agreed! I wasn't alluding to moving the analysis inside LV, just using
it inside a V2V transform to find more options to vectorise.

--renato

Sorry for jumping from
  http://lists.llvm.org/pipermail/llvm-dev/2018-September/125853.html
but this is relevant. Sorry for not responding to that sooner. I was thinking about a longer reply, and time flied too quickly.

But, as I said back then, before we do so, we need to understand
exactly where to put it. That will depend on what other passes will
actually use it and if we want it to be a utility class or an analysis
pass, or both.

Have you compiled a list of passes that could benefit from such a move?

Many loop optimizers (Transforms) can benefit from knowing whether the loop is legal to vectorize (or loop will vectorize). Distribution and fusion are clear examples. Vectorizer has a lot in common with unroll and jam, and we should definitely share a lot. Where to place these analyses is debatable, but my preference is having them under the Analysis tree since they are indeed analysis and in principle they shouldn't depend on Transform. I think we should start from a utility but should implement it in such a way to make it easy to convert to an analysis pass.

Thanks,
Hideki

Sure, I agree with you on that. I'm just curious as tho which concrete
passes would benefit sooner.

--renato

I'm just curious as tho which concrete passes would benefit sooner.

This all depends on those who are working on other loop xforms, since we currently don't have bandwidth to drive that kind of changes into other loop xforms. That's why when this line of questions pops up, I offer to work together. Short of that, the best we can proactively do is to make vectorizer analyses available outside of vectorizer (and easy to find). In some sense, this is a chicken-egg problem. Once VPlan-based LV becomes good enough shape and if this problem still remains, we could expand into working on vectorization enabling transformations, but I really hope there are others who can work in that area before us.

Hideki

I understand. We are working on more fundamental levels (register
allocator, pipelining) before looking at the vectoriser, so it may
take a while, too. Once we start looking at that, I'll let you know.

cheers,
--renato

This all depends on those who are working on other loop xforms, since we currently don't have bandwidth to drive that kind of changes into other loop xforms. That's why when this line of questions pops up, I offer to work together. Short of that, the best we can proactively do is to make vectorizer analyses available outside of vectorizer (and easy to find). In some sense, this is a chicken-egg problem. Once VPlan-based LV becomes good enough shape and if this problem still remains, we could expand into working on vectorization enabling transformations, but I really hope there are others who can work in that area before us.

I understand. We are working on more fundamental levels (register
allocator, pipelining) before looking at the vectoriser, so it may
take a while, too. Once we start looking at that, I'll let you know.

We should probably plan to chat about this at the dev meeting next
month. We'll have a few talks dealing with the future of our
loop-optimization infrastructure, and the transformations therein, and
this is also something that feeds into that discussion.

-Hal

We should probably plan to chat about this at the dev meeting next month.

I second the idea.

"Saito, Hideki via llvm-dev" <llvm-dev@lists.llvm.org> writes:

We should probably plan to chat about this at the dev meeting next month.

I second the idea.

Count me in for that. Is one of the BoFs amenable to this or should we
schedule some ad hoc thing?

                            -David

Hi,

I found with the help of the optimization remarks a loop that could not be vectorized, but if loop distribution was enabled this may happen, which it in fact did with a very significant benchmark improvement (~25%).

I tried (on SystemZ) to enable this pass, and found that it only affected a handful of files on SPEC. This means I could enable this without worrying about any regressions on SystemZ at least currently.

I wonder if there is something more to know about this. It seems that no other target has enabled this due to general mixed results, or? Is this triggering much more on other targets, and if so, why?

The main thing that is missing from the pass right now is a serious analysis of profitability as it affects instruction- and memory-level parallelism. The easiest to see this that LD is a reverse transformation of Loop fusion so where LF helps LD may regress. MLP is the big one in my opinion which would totally reverse any gains from vectorization.

We would probably have to do similar things to the SW prefetch insertion pass in order to analyze accesses that are likely to be skipped by the HW prefetcher. Needless to say this is a very micro-architecture specific analysis/cost model. If we can establish that ILP/MPL is unaffected even in simplest cases and vectorization is enabled we could enable the transformation by default (in addition to the pragma-driven approach we have now).

Adam

Hi Adam,

Hi,

I found with the help of the optimization remarks a loop that could not be vectorized, but if loop distribution was enabled this may happen, which it in fact did with a very significant benchmark improvement (~25%).

I tried (on SystemZ) to enable this pass, and found that it only affected a handful of files on SPEC. This means I could enable this without worrying about any regressions on SystemZ at least currently.

I wonder if there is something more to know about this. It seems that no other target has enabled this due to general mixed results, or? Is this triggering much more on other targets, and if so, why?

The main thing that is missing from the pass right now is a serious analysis of profitability as it affects instruction- and memory-level parallelism. The easiest to see this that LD is a reverse transformation of Loop fusion so where LF helps LD may regress. MLP is the big one in my opinion which would totally reverse any gains from vectorization.

We would probably have to do similar things to the SW prefetch insertion pass in order to analyze accesses that are likely to be skipped by the HW prefetcher. Needless to say this is a very micro-architecture specific analysis/cost model. If we can establish that ILP/MPL is unaffected even in simplest cases and vectorization is enabled we could enable the transformation by default (in addition to the pragma-driven approach we have now).

Thanks for the reply.

Since this is today extremely conservative and nearly never triggers, at least on SystemZ, while still being very beneficial when it does happen, it seems that this could be used as-is now on SystemZ with a new TTI hook to enable it selectively per target.

The question now is if this is a wise idea? Do you think things will change significantly with the Loop Distribution pass in the direction that it gets much more enabled, which may then cause regressions on SystemZ? If that is the case, perhaps the idea now is that nobody activates it per default until some initial reasonable cost modeling has been made?

/Jonas

Having a buildbot publishing benchmark results to LNT would go a long
way of helping you track regressions, but if the cost model is not
there yet, you may find yourself going round in circles...

Hi,

Hi Adam,

Hi,

I found with the help of the optimization remarks a loop that could not be vectorized, but if loop distribution was enabled this may happen, which it in fact did with a very significant benchmark improvement (~25%).

I tried (on SystemZ) to enable this pass, and found that it only affected a handful of files on SPEC. This means I could enable this without worrying about any regressions on SystemZ at least currently.

I wonder if there is something more to know about this. It seems that no other target has enabled this due to general mixed results, or? Is this triggering much more on other targets, and if so, why?

The main thing that is missing from the pass right now is a serious analysis of profitability as it affects instruction- and memory-level parallelism. The easiest to see this that LD is a reverse transformation of Loop fusion so where LF helps LD may regress. MLP is the big one in my opinion which would totally reverse any gains from vectorization.

We would probably have to do similar things to the SW prefetch insertion pass in order to analyze accesses that are likely to be skipped by the HW prefetcher. Needless to say this is a very micro-architecture specific analysis/cost model. If we can establish that ILP/MPL is unaffected even in simplest cases and vectorization is enabled we could enable the transformation by default (in addition to the pragma-driven approach we have now).

Thanks for the reply.

Since this is today extremely conservative and nearly never triggers, at least on SystemZ, while still being very beneficial when it does happen, it seems that this could be used as-is now on SystemZ with a new TTI hook to enable it selectively per target.

The question now is if this is a wise idea? Do you think things will change significantly with the Loop Distribution pass in the direction that it gets much more enabled, which may then cause regressions on SystemZ? If that is the case, perhaps the idea now is that nobody activates it per default until some initial reasonable cost modeling has been made?

I think the loop interchange pass is in a similar situation: it gives substantial speedup on a few benchmarks without regressions (at least once the patch to turn it into a loop pass lands and for the benchmarks I run). It would definitely benefit from having a better way to check if we can vectorize if we would interchange loops too.

Cheers,
Florian

Hi Adam,

Hi,

I found with the help of the optimization remarks a loop that could not be vectorized, but if loop distribution was enabled this may happen, which it in fact did with a very significant benchmark improvement (~25%).

I tried (on SystemZ) to enable this pass, and found that it only affected a handful of files on SPEC. This means I could enable this without worrying about any regressions on SystemZ at least currently.

I wonder if there is something more to know about this. It seems that no other target has enabled this due to general mixed results, or? Is this triggering much more on other targets, and if so, why?

The main thing that is missing from the pass right now is a serious analysis of profitability as it affects instruction- and memory-level parallelism. The easiest to see this that LD is a reverse transformation of Loop fusion so where LF helps LD may regress. MLP is the big one in my opinion which would totally reverse any gains from vectorization.

We would probably have to do similar things to the SW prefetch insertion pass in order to analyze accesses that are likely to be skipped by the HW prefetcher. Needless to say this is a very micro-architecture specific analysis/cost model. If we can establish that ILP/MPL is unaffected even in simplest cases and vectorization is enabled we could enable the transformation by default (in addition to the pragma-driven approach we have now).

Thanks for the reply.

Since this is today extremely conservative and nearly never triggers, at least on SystemZ, while still being very beneficial when it does happen, it seems that this could be used as-is now on SystemZ with a new TTI hook to enable it selectively per target.

The question now is if this is a wise idea? Do you think things will change significantly with the Loop Distribution pass in the direction that it gets much more enabled, which may then cause regressions on SystemZ? If that is the case, perhaps the idea now is that nobody activates it per default until some initial reasonable cost modeling has been made?

It seems this hasn’t been answered. Apologies for my late ‘jump in’ otherwise. It is true the major reason LD is not on by default is because it lacks cost modeling to prevent regressions. This would hold for any hardware platform I think.