RFC: [LV] any objections in moving isLegalMasked* check from Legal to CostModel? (Cleaning up LoopVectorizationLegality)

All,

I'm trying to refactor LoopVectorize such that it has better conformance to VPlan vision going forward
(http://www.llvm.org/docs/Proposals/VectorizationPlan.html). All VP*Recipe class definitions are now
moved to VPlan.h, and I have a patch under review to move LoopVectorizationPlanner class out of
LoopVectorize.cpp (⚙ D41420 [LV][VPlan] NFC patch to move LoopVectorizationPlanner class out of LoopVectorize.cpp).

Next thing I'm working on is LoopVectorizationLegality, and I noticed that it has a component of
CostModel and optimization, which doesn't seem right from vectorizer's architectural perspective.
It appears that we are currently abusing Legal as the attic to throw a lot of things in order to avoid passing
many pointers around. From vectorizer's architectural point of view, we should distinguish Legal from
"Vectorization Context Information" (I'd call it LoopVectorizationAnalysisInfo), some of which (such as
induction, reduction, etc.) are populated during the Legal step. InterleaveInfo shouldn't even be
a member of Legal. Nothing to do with Legality. It would be a good member of LoopVectorizationAnalysisInfo.
Eventually, I'd like to see these under Analysis subtree (instead of Transform), since they are indeed Analysis.

As a first step of this LoopVectorizationLegality cleanup, I propose to move the following checks
(and member functions) to LoopVectorizationCostModel.
  isLegalMaskedStore
   isLegalMaskedLoad
  isLegalMaskedScatter
   isLegalMaskedGather
My assumption is that all SIMD architectures should support serialization of those operations
at some cost (e.g., lowering in CG prepare) and thus failing to vectorize due to "false" return values
of those calls is incorrect behavior. I'll make sure to use a very high initial cost such that this cleanup is
NFC for all practical purposes ---- and leave the tuning work as TODO.

The down side I can think of is that this will end up running more parts of vectorizer for those kind
of loops ---- can expose pre-existing bugs and compile time would be a bit longer since we are bailing
out later. Upside is that we can tune the cost model ---- if other parts of the loop has enough
speedup, we don't have to give up entire vectorization simply because masked load/store/gather/scatter
aren't supported on the target.

If anyone still thinks "early bailout" is valuable, splitting into a separate HWLegal class would be
a cleaner approach than what we have today. We should be able to disable/enable it under an option.

Let me know what you think.

Thanks,
Hideki Saito

I support this direction, and agree in principal that these are strictly cost model queries, but are there actually any real world workloads where gather/scatter scalarisation would be worth it, on any micro-architecture? If we don’t have examples and the compile time cost is non-negligible then I think we’d still like to keep the early bailouts in some form.’

Amara

Amara,

I support this direction

Thanks for the support.

but are there actually any real world workloads where gather/scatter scalarisation would be worth it, on any micro-architecture? If we don’t have examples and the compile time cost is non-negligible then I think we’d still like to keep the early >bailouts in some form.’

It's not like I have specific application code in mind. This all depends on what your code has and how often the emulated code has to
kick-in. I'm more than happy to implement HWLegal class, and have early bailout available under the switch. I'm also fine having early
bail out as the default if that helps sweeten the deal. :slight_smile:

My intention is to make LV components more modular, reusable, and easier to maintain. Moving things where they actually belong
is part of that picture. That makes it a lot easier for us to place those things within VPlan infrastructure and explain why that's
the best place.

Thanks,
Hideki

Amara,

I support this direction

Thanks for the support.

but are there actually any real world workloads where gather/scatter scalarisation would be worth it, on any micro-architecture? If we don’t have examples and the compile time cost is non-negligible then I think we’d still like to keep the early >bailouts in some form.’

It's not like I have specific application code in mind. This all depends on what your code has and how often the emulated code has to
kick-in. I'm more than happy to implement HWLegal class, and have early bailout available under the switch. I'm also fine having early
bail out as the default if that helps sweeten the deal. :slight_smile:

My intention is to make LV components more modular, reusable, and easier to maintain. Moving things where they actually belong
is part of that picture. That makes it a lot easier for us to place those things within VPlan infrastructure and explain why that's
the best place.

I think that we should proceed on this basis. It is not clear to me that the extra compile time will be significant, and we shouldn't prematurely optimize (although we should check for any significant impact).

The intrinsics should be scalarized on all targets without support (by lib/CodeGen/ScalarizeMaskedMemIntrin.cpp which is scheduled by default in TargetPassConfig::addIRPasses).

  -Hal

Thanks, Hal.

I plan to post a patch w/o HW Legality early bailout first. That should enable further discussion on where the initial very high cost for "illegal masked load/store/gather/scatter" should be coming from --- like should LoopVectorize provide it? Or should it be provided by TTI? I prefer the latter (TTI) but the first revision of the patch will intentionally do the former (LoopVectorize hacked cost) such that I'll be sure to have enough "explicit" voice behind me to help me push for the TTI approach if others feel the same way (you can also reply to this message saying "go with TTI" without waiting for the code). If we are to go with TTI, we certainly can do this in two steps (i.e., LoopVectorize hacked cost first and then transition to TTI), and that may be actually desirable.

While the cost discussion is going on, I can think a bit more about how best to do HW Legality early bailout. If we are doing it, I don't think it's appropriate to limit it to masked load/store/gather/scatter. Would need some extensibility, and I think that should be discussed as a separate review.

Stay Tuned.
Hideki

Thanks, Hal.

I plan to post a patch w/o HW Legality early bailout first. That should enable further discussion on where the initial very high cost for "illegal masked load/store/gather/scatter" should be coming from --- like should LoopVectorize provide it? Or should it be provided by TTI? I prefer the latter (TTI) but the first revision of the patch will intentionally do the former (LoopVectorize hacked cost) such that I'll be sure to have enough "explicit" voice behind me to help me push for the TTI approach if others feel the same way (you can also reply to this message saying "go with TTI" without waiting for the code). If we are to go with TTI, we certainly can do this in two steps (i.e., LoopVectorize hacked cost first and then transition to TTI), and that may be actually desirable.

I'd prefer the TTI cost approach, it's cleaner, but what you propose sounds good because we'll be able to measure it both ways.

While the cost discussion is going on, I can think a bit more about how best to do HW Legality early bailout. If we are doing it, I don't think it's appropriate to limit it to masked load/store/gather/scatter. Would need some extensibility, and I think that should be discussed as a separate review.

I agree. We could, for example, have a general target "don't bother with this" hook of some kind if necessary.

  -Hal