LoopVectorizer: Should the cost-model be used for legalisation?

Hi,

Last year we added the InstructionCost class which adds the ability to
represent that an operation cannot be costed, i.e. operations that cannot
be expanded by the code-generator will have an invalid cost.

We started using this information in the Loop Vectorizer for scalable
auto-vectorization. The LV has a legality- and a cost-model stage, which are
conceptually separate concepts with different purposes. But with the
introduction of having valid/invalid costs it's more inviting to use the
cost-model as 'legalisation', which leads us to the following question:

   Should we be using the cost-model to do legalisation?

'Legalisation' in this context means asking the question beforehand if the
code-generator can handle the IR emitted from the LV. Examples of
operations that need such legalisation are predicated divides (at least
until we can use the llvm.vp intrinsics), or intrinsic calls that have no
scalable-vector equivalent. For fixed-width vectors this legalisation issue
is mostly moot, since operations on fixed-width vectors can be scalarised.
For scalable vectors this is neither supported nor feasible [1].

This means there's the option to do one of two things:

[Option 1]

Add checks to the LV legalisation to see if scalable-vectorisation is
feasible. If so, assert the cost must be valid. Otherwise discard scalable
VFs as possible candidates.
* This has the benefit that the compiler can avoid
   calculating/considering VPlans that we know cannot be costed.
* Legalisation and cost-model keep each other in check. If something
   cannot be costed then either the cost-model or legalisation was
   incomplete.

[Option 2]

Leave the question about legalisation to the CostModel, i.e. if the
CostModel says that <operation> for `VF=vscale x N` is Invalid, then avoid
selecting that VF.
* This has the benefit that we don't need to do work up-front to
   discard scalable VFs, keeping the LV design simpler.
* This makes gaps in the cost-model more difficult to spot.

Note that it's not useful to combine Option 1 and Option 2, because having
two ways to choose from takes away the need to do legalisation beforehand,
and so that's basically a choice for Option 2.

Both approaches lead to the same end-result, but we currently have a few
patches in flight that have taken Option 1, and this led to some questions
about the approach from both Florian and David Green. So we're looking to
reach to a consensus and decision on what way to move forward.

I've tentatively added this as a topic to the agenda of the upcoming LLVM
SVE/Scalable Vector Sync-up meeting next Tuesday (June 15th, [2]) as an
opportunity to discuss this more freely if we can get enough people who
actively work on the LV together in that meeting (like Florian and David,
although please forward to anyone else who might have input on this).

Thanks,

Sander

[1] Expanding the vector operation into a scalarisation loop is currently
    not supported. It could be done, but we have done extensive
    experimentation with loops that handle each element of a scalable
    vector sequentially, but this has never proved beneficial, even when
    using special instructions to efficiently increment the predicate
    vector. I doubt this will be any different for other scalable vector
    architectures, because of the loop control overhead. Also the
    insertion/extraction of elements from a scalable vector is unlikely to
    be as cheap as for fixed-width vectors.

[2] LLVM SVE2 and Scalable Vector Sync-up Meetings - Google Docs

Please correct me if I am wrong, but I thought this discussion was brought up by a temporarily workaround in the cost-model, working around current codegen limitations that needs fixing.
I am asking because Option 1 is what we currently have, and I don’t see reasons to depart from this general idea, even if the cost-model can return Invalid due to a workaround that would hopefully disappear soon. That would mean the assert that the legalisation and cost-model are in sync would need to be skipped, and while that is not ideal, I don’t see that as a big problem and I don’t see it as a total departure from Option 1, especially if this is all temporarily.

And does this discussion disappear if the codegen issues are fixed? I don’t know the scale of the problem/work, but is it not easier to fix that avoiding this cost-model vs. legalisation discussion?

Hi Sjoerd,

The question actually came up on three patches (including the one you refer to).

The reason for focussing the attention to this is because this is a principled/design question of how we deal with legalisation, because if it is decided we don't want to add more `TTI->is<...>Legal()` to avoid vectorising, and instead implement this legalisation by returning InstructionCost::getInvalid(), then we'll want to change our in-flight patches to align with that approach. It also sets the precedent for future legalisation issues we'd need to fix.

If for Option 1 we are to relax the restriction that the cost must be valid after the legaliser has said it is okay to vectorise with a given VF (thus allowing Invalid costs to avoid vectorisation with a given VF), then that will undermine the need for any legalisation functions whatsoever, because returning Invalid has the same effect. It is therefore a de facto choice for Option 2.

That's the reason the specific patch you referred to uses some high-cost number but keeps the assertion in place. This patch is not really representative of the design question, it is purely a temporary workaround assuming we stick with Option 1. The alternative would have been to add proper legalisation interface, that we know would be removed again when we have full CodeGen coverage for <vscale x 1 x eltty> types.

Thanks,

Sander

Hi Sander,

Okay, that’s fair enough, and thanks for explaining that. About the in-flight patches:

but we currently have a few patches in flight that have taken Option 1, and this led to some questions
about the approach from both Florian and David Green.

which patches are that exactly? (so that I can have a look at that for completeness).
For the cost-model patch that I looked, it may look like Dave and I were arguing for Option 2, but I was trying to say that I still consider it Option1 for reasons I mentioned. So now I am curious if there are different opinions on the other patches.

Cheers,
Sjoerd.

Hi Sjoerd,

There was some discussion on these two patches:
* ⚙ D102253 [LV] Prevent vectorization with unsupported element types. (Prevent vectorization with unsupported element types)
* ⚙ D102437 [LV] NFC: Decouple foldTailByMasking from isScalarWithPredication. (This patch is actually not really related, but one of my comments there sparked similar questions and discussion).

There are also two other patches related to legalisation:
⚙ D101916 [LoopVectorize] Fix crash for predicated instruction with scalable VF (Avoid using scalable VFs if predicated instructions require scalarisation)
⚙ D102394 [LoopVectorize] Don't attempt to widen certain calls for scalable vectors (Don't attempt to widen certain intrinsic calls for scalable vectors)

These patches together already cover a majority of the cases that explicitly require legalisation.

Thanks,

Sander

Hi,

IIUC, for option 1, unless the legalizer and the cost-model follow the same logic to determine the feasibility of scalable-vectorization, it seems inaccurate to assert that if something is legal it must have a valid cost. But if that assertion is relaxed, then as you mentioned in another reply, it would be a de facto choice for option 2. If we don't relax this assertion then for it to be accurate legalizer and cost-model will both do redundant work. Also, IIRC, an invalid cost also models a too-expensive-to-be-useful operation but that doesn't really imply illegal operation.

I am not sure if this makes sense, but with option 2, since VPlans are not discarded upfront there is a chance that a transofrmation on VPlan might actually make it feasible.

I didn't think deeply about this so I might be missing some obvious and key facts. Please correct me if I am wrong.

Best,

Vineet

Hi Vineet,

IIUC, for option 1, unless the legalizer and the cost-model follow the same logic to determine the feasibility of scalable-vectorization, it seems inaccurate to assert that if something is legal it must have a valid cost.

I actually consider that one of the main benefits, because this way the legaliser and cost-model keep each other in check. I've already identified several gaps in the Loop Vectorizer and in the CostModel because of this, which we wouldn't as easily have found if the LV ignored the VFs due to Invalid costs. We're now forced to fix these issues early, instead of going through each loop and trying to understand if the cost is Invalid because that was a case we purposely wanted to avoid vectorising, or whether it's a genuine gap in the cost-model/LV that we never got around to fix.

But if that assertion is relaxed, then as you mentioned in another reply, it would be a de facto choice for option 2. If we don't relax this assertion then for it to be accurate legalizer and cost-model will both do redundant work.

While there needs to be agreement between both models on what is legal/valid, at runtime no redundant work is being done. If the legalisation phase discards scalable auto-vectorisation as a legal/feasible option, then it will not pursue to cost those VFs (and so work done upfront prevents redundant work done later).

Also, IIRC, an invalid cost also models a too-expensive-to-be-useful operation but that doesn't really imply illegal operation.

I'd say that is a general misconception about the Invalid state of InstructionCost, which has persisted a bit since the early stages when it wasn't yet clear how to interpret the meaning of Invalid. The state Invalid means "cannot possibly handle this operation". If it would have been possible to code-generate (albeit expensive), the cost-model should use a high-but-valid cost instead. I agree this isn't well described in the class' DoxyGen comments, so I'll create a patch to clarify its wording.

I am not sure if this makes sense, but with option 2, since VPlans are not discarded upfront there is a chance that a transofrmation on VPlan might actually make it feasible.

That's an interesting point, David Green mentioned VPlan-to-Vplan transforms in this context as well recently. Do you have any examples of illegal VPlans that could be transformed into legal VPlans? It may be a lack of imagination, but I can't quickly think of any transformations that would ever change an illegal VPlan with scalable vectors into a valid VPlan.

Thanks,

Sander

Hi Vineet,

IIUC, for option 1, unless the legalizer and the cost-model follow the same logic to determine the feasibility of scalable-vectorization, it seems inaccurate to assert that if something is legal it must have a valid cost.

I actually consider that one of the main benefits, because this way the legaliser and cost-model keep each other in check. I've already identified several gaps in the Loop Vectorizer and in the CostModel because of this, which we wouldn't as easily have found if the LV ignored the VFs due to Invalid costs. We're now forced to fix these issues early, instead of going through each loop and trying to understand if the cost is Invalid because that was a case we purposely wanted to avoid vectorising, or whether it's a genuine gap in the cost-model/LV that we never got around to fix.

While I agree with your point from an implementation perspective, my comment was more from a theoretical POV with an assumption that the cost-model and legalizer are already correct.

But if that assertion is relaxed, then as you mentioned in another reply, it would be a de facto choice for option 2. If we don't relax this assertion then for it to be accurate legalizer and cost-model will both do redundant work.

While there needs to be agreement between both models on what is legal/valid, at runtime no redundant work is being done. If the legalisation phase discards scalable auto-vectorisation as a legal/feasible option, then it will not pursue to cost those VFs (and so work done upfront prevents redundant work done later).

Perhaps, the assertion we want in the legalizer is that if an instruction is illegal then it cannot have a valid cost. A legal instruction might still have an invalid/prohibitively expensive cost. This way we get a model where the legalizer does a quick weed-out of illegal instructions and cost-model can do a more fine-grained analysis.

When you say "at runtime no redundant work is being done", does it mean that at the implementation level there might be some duplication of logic?

Also, IIRC, an invalid cost also models a too-expensive-to-be-useful operation but that doesn't really imply illegal operation.

I'd say that is a general misconception about the Invalid state of InstructionCost, which has persisted a bit since the early stages when it wasn't yet clear how to interpret the meaning of Invalid. The state Invalid means "cannot possibly handle this operation". If it would have been possible to code-generate (albeit expensive), the cost-model should use a high-but-valid cost instead. I agree this isn't well described in the class' DoxyGen comments, so I'll create a patch to clarify its wording.

Looking at the code, the `InstructionCost` class does not yet model a separate prohibitively-high-but-valid cost, right? IIRC, some places use a hard-coded value to represent a high cost.

I am not sure if this makes sense, but with option 2, since VPlans are not discarded upfront there is a chance that a transofrmation on VPlan might actually make it feasible.

That's an interesting point, David Green mentioned VPlan-to-Vplan transforms in this context as well recently. Do you have any examples of illegal VPlans that could be transformed into legal VPlans? It may be a lack of imagination, but I can't quickly think of any transformations that would ever change an illegal VPlan with scalable vectors into a valid VPlan.

Hmmm, to be honest I can't really imagine a concrete example here. Just thinking out loud here, wondering if there be a case where a non-predicated instruction is invalid but a VP version of it is valid, in which case the VPlan can transform into using a VP version. May be a far-fetched example would be a VP instruction with a fixed EVL that makes it feasible to compute its cost.

Hi,

Thanks for bringing this up!

Hi,

Last year we added the InstructionCost class which adds the ability to
represent that an operation cannot be costed, i.e. operations that cannot
be expanded by the code-generator will have an invalid cost.

We started using this information in the Loop Vectorizer for scalable
auto-vectorization. The LV has a legality- and a cost-model stage, which are
conceptually separate concepts with different purposes. But with the
introduction of having valid/invalid costs it’s more inviting to use the
cost-model as ‘legalisation’, which leads us to the following question:

Should we be using the cost-model to do legalisation?

‘Legalisation’ in this context means asking the question beforehand if the
code-generator can handle the IR emitted from the LV. Examples of
operations that need such legalisation are predicated divides (at least
until we can use the llvm.vp intrinsics), or intrinsic calls that have no
scalable-vector equivalent. For fixed-width vectors this legalisation issue
is mostly moot, since operations on fixed-width vectors can be scalarised.
For scalable vectors this is neither supported nor feasible [1].

I think this is one of the key points. LoopVectorLegality at the moment is mostly concerned whether the loop is not vectorizable due to conceptual issues (not sure sure what the best term would be) preventing vectorization (like unvectorizable dependences or FP constraints), not target specific instruction legality constraints like whether there is a vector version of a given instruction. It is also independent of any given concrete VF. IIUC this is also not limited to predication, e.g. we also need to check whether intrinsic calls are supported natively. Are there any others?

As you mentioned, predicated instructions that are not supported by a target can be scalarized and predicated. Even for fixed vectors, this is quite expensive in general, but it allows LoopVectorLegality to work mostly independently of other target constraints and focus on general legality checks. IIUC conceptually there’s nothing preventing us from scalarizing operations on scalable vectors in a similar fashion, but it requires an explicit loop dependent on the vector width at runtime and deciding the cost depends on an unknown (the target vector width).

I don’t have any insight in hardware-specific costs for scalable vector operations, so I can’t comment on any specifics there. I am wondering if the following hypothetical scenario is feasible/realistic: consider a loop where all operations expect one can be widened directly and a single operations needs scalarizing. I would expect there to be a point where the number of widenable operations gets large enough to offset the cost of emitting a loop to scalarize the single operation needing predication. Granted, depending on the maximum vector width of the hardware this number might be quite large, but I could imagine a scenario where we want to optimize given a tighter upper bound on the maximum vector width than allowed by the hardware spec. E.g. considering the an upper bound of 256 for the vector width, we’d only need to execute 2 iterations of such a loop on AArch64. It would also be interesting to get addition perspective on the cost question from people familiar with other hardware supporting scalable vectors.

As a side-note, there is a use of TTI in LVL, but it is rather problematic. At the moment, LVL considers loops with non-temporal memory operations as unvectorizable up-front, if the vector version for an arbitrary VF (in that case 2 was chosen) is illegal on the target. This has the unfortunate side effect that using non-temporal stores flat out blocks vectorization if there’s no legal non-temporal load/store for VF = 2, which can be very surprising to our users, especially on AArch64 where a single element non-temporal memory operation may not be legal to start with and non-temporal stores may be legal at higher VFs (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp#L788). It’s not a perfect example, illustrates one of the issues of bailing out too early.

This means there’s the option to do one of two things:

[Option 1]

Add checks to the LV legalisation to see if scalable-vectorisation is
feasible. If so, assert the cost must be valid. Otherwise discard scalable
VFs as possible candidates.

  • This has the benefit that the compiler can avoid
    calculating/considering VPlans that we know cannot be costed.
  • Legalisation and cost-model keep each other in check. If something
    cannot be costed then either the cost-model or legalisation was
    incomplete.

[Option 2]

Leave the question about legalisation to the CostModel, i.e. if the
CostModel says that for VF=vscale x N is Invalid, then avoid
selecting that VF.

  • This has the benefit that we don’t need to do work up-front to
    discard scalable VFs, keeping the LV design simpler.
  • This makes gaps in the cost-model more difficult to spot.

I think if we would support scalarization for scalable vectors via a loop, then the cost of predicating any given instruction would not be invalid, just possibly quite high (depending on the upper bound for the vector), right? So we should still be able to assert that each result != Invalid.

Note that it’s not useful to combine Option 1 and Option 2, because having
two ways to choose from takes away the need to do legalisation beforehand,
and so that’s basically a choice for Option 2.

Both approaches lead to the same end-result, but we currently have a few
patches in flight that have taken Option 1, and this led to some questions
about the approach from both Florian and David Green. So we’re looking to
reach to a consensus and decision on what way to move forward.

I think one concern that came up during the discussion was that option 1 means that we need to add multiple isLegalToXXX helpers to TTI, which need to be kept in sync with the already existing cost functions. It also more closely couples legality checks and cost-modeling. I’m not sure if/how much native predication support will complicate things and there may be more places that need to be taught about native predication support. I’m not saying this is a blocker, just a trade-off to consider. Also, if we missed a case or add a new case that may require scalarization we would need to update/add additional checks.

Cheers,
Florian

Hi Florian, Vineet,

Thanks both for your input!

From the comments here and from conversations off-list, it seems there is no objection -and perhaps even a slight preference- to use the 'Invalid' state of InstructionCost as a feature. That's probably also the easiest way forward for now, since it avoids us adding all sorts of TTI methods for legalisation that we may need to delete again later when we have a scalarization mechanism for scalable VFs.

Assuming there is no further objection to this, we'll change our patches that are currently on Phabricator to pursue this approach instead.

[just sharing my thoughts/understanding here]

* Calculating a more accurate scalarization cost probably requires extending `TTI::getScalarizationOverhead` with an extra operand to tell whether the pass which calls the cost-function can create a scalarisation loop. If the flag is false the cost would be Invalid, otherwise it could return some high-but-valid cost. That's because the cost-model doesn't know whether a scalable loop can/will be created.

* Having a scalarization mechanism is orthogonal to the question how we interpret Invalid costs, although it would significantly reduce the cases where an Invalid cost occurs. If we could emit an opt-remark each time an Invalid cost prevents vectorization, that would help identify any such cases.

* If we interpret Invalid as 'infinitely' expensive, we could remove the calls to TTI->isLegalNTLoad/Store from LVL in favour of them having invalid costs. Otherwise, we'd still need to fix/improve the case for the NT loads/stores. We come across something similar for the VFDatabase, where for some VFs a vectorised function may (not) be available, and so we'd otherwise need to filter out + maintain a set/bitmask of feasible VFs.

Florian Hahn wrote:

I don’t have any insight in hardware-specific costs for scalable vector operations, so I can’t comment on any specifics there. I am wondering if the following hypothetical scenario is feasible/realistic: consider a loop where all operations expect one can be widened directly and a single operations needs scalarizing. I would expect there to be a point where the number of widenable operations gets large enough to offset the cost of emitting a loop to scalarize the single operation needing predication. Granted, depending on the maximum vector width of the hardware this number might be quite large, but I could imagine a scenario where we want to optimize given a tighter upper bound on the maximum vector width than allowed by the hardware spec. E.g. considering the an upper bound of 256 for the vector width, we’d only need to execute 2 iterations of such a loop on AArch64. It would also be interesting to get addition perspective on the cost question from people familiar with other hardware supporting scalable vectors.

You're right that there could be loops big enough for the scalarised sub-loop to be negligible. Our experiments haven't really found any where scalarisation was still beneficial, so this seems very much a corner-case. This may be different for other architectures, although I expect that extracting/inserting arbitrary elements from/to scalable vectors are likely to be more expensive than for fixed-width vectors.

When we have knowledge about the bounds of vscale, we should indeed be able to calculate a more accurate cost. The min_vscale/max_vscale IR attributes can be used for this.

Vineet Kumar wrote:

Perhaps, the assertion we want in the legalizer is that if an instruction is illegal then it cannot have a valid cost. A legal instruction might still have an invalid/prohibitively expensive cost. This way we get a model where the legalizer does a quick weed-out of illegal instructions and cost-model can do a more fine-grained analysis.

I'm not sure if such an assert would add much value, because the cases we try to guard against are the cases where the LV lets something through (Legal == true), but the code-generator can't handle them (Cost = invalid), since the operations are considered legal unless explicitly stated otherwise.

When you say "at runtime no redundant work is being done", does it mean that at the implementation level there might be some duplication of logic?

Yes.

Looking at the code, the `InstructionCost` class does not yet model a separate prohibitively-high-but-valid cost, right? IIRC, some places use a hard-coded value to represent a high cost.

That's right. That would require saturation support in InstructionCost (i.e. infinitely-expensive + 10 = (still) infinitely-expensive). This is something on we haven't started on yet, but is on our wish-list.

Hmmm, to be honest I can't really imagine a concrete example here. Just thinking out loud here, wondering if there be a case where a non-predicated instruction is invalid but a VP version of it is valid, in which case the VPlan can transform into using a VP version. May be a far-fetched example would be a VP instruction with a fixed EVL that makes it feasible to compute its cost.

I think the other way around would be more problematic, because you could generate a 'all true' predicate for the predicated instruction. When the loop requires predication and there is only an unpredicated instruction available, that might actually prevent vectorization. The VFDatabase case I mentioned above would be such a case. If there are function calls in the scalar loop and the VFDatabase has corresponding vector variants but only unpredicated ones, then the predicated style of vectorization would lead to having an invalid cost, whereas the unpredicated vector body would result in a valid cost.

Thanks,

Sander

Sounds good to me, thanks! Is this approach suitable for all current patches in-flight you mentioned that deal with constructs/types that are not legal with scalable vectors on AArch64?

AFAICT those in-flight patches include some of the patches below?

https://reviews.llvm.org/D102253
https://reviews.llvm.org/D102394
https://reviews.llvm.org/D101916

Cheers,
Florian

Hi Florian,

Sounds good to me, thanks! Is this approach suitable for all current patches in-flight you mentioned that deal with constructs/types that are not legal with scalable vectors on AArch64?

AFAICT those in-flight patches include some of the patches below?

⚙ D102253 [LV] Prevent vectorization with unsupported element types.
⚙ D102394 [LoopVectorize] Don't attempt to widen certain calls for scalable vectors
⚙ D101916 [LoopVectorize] Fix crash for predicated instruction with scalable VF

The approach is suitable for at least D102394 and probably D101916 as well (so those patches will need to be changed).

It is still useful to have a separate TTI interface to discard certain vector types early (D102253), because otherwise it requires some bigger changes to propagate InstructionCost in `struct RegisterUsage` and it's uses (because TargetTransformInfo::getRegUsageForType may return Invalid).

Hope that makes sense.

Thanks,

Sander