[RFC] Goals for VPlan-based cost modelling

Hi all,

I am looking into the benefits of a VPlan-based cost model, and into how such a cost model should be implemented to make the most out of these benefits.

Over the last year I have been working with LLVM, mainly focused on the ARM backend, in the course of a one-year internship at Arm Ltd. My main project from December 2019 to August 2020 was to introduce gather/scatters for MVE auto-vectorization. One of the recurring challenges during this work was to get things right in the cost model.
For example, gathers can extend the data they load, while scatters can truncate their input, meaning that an extend following the load, or a truncate preceding the store, is for free if it meets certain type conditions. As the current cost model is not designed for context-based analysis, this was a pain to model.

I have done some research and found out that one of the proposed benefits of VPlan is that a new cost model making use of it would be able to better support context-dependent decisions like this.
However, there does not exist much specification about what such a cost model should look like.

Also, I have read through the respective code to understand how loop vectorization is currently done and how far the introduction of VPlan has progressed and have realised that the only recipe that actually handles more than one instruction from the input IR is the one for interleaved groups. When the VPlan is generated on the VPlan-native path, every IR instruction is considered and transformed into a recipe separately, ignoring its context (to give a code location, I am looking at VPlanTransforms::VPInstructionsToVPRecipes).
And maybe there are architectures that for some cases do not have the same vector instructions, so a pattern that works great for one could be useless for others. So I am wondering: Is there any plan to have target-dependent flavours of recipes, or how will those things be handled?
Right now it makes sense that nothing like this has been implemented yet, as there is no cost model that could guide the transformation. But if recipes are a general thing, should the cost model be the component actually providing the target-specific pattern for a recipe, together with its cost?

I am considering choosing a topic related to VPlan, possibly cost modelling, for my Master thesis, with the goal to present a solution and implement a prototype.

What I would like to ask the community is

(1) what goals there are or should be for VPlan-based cost modelling,
(2) whether there have been any discussions on target-dependent patterns yet, and
(3) which examples of inefficiencies and shortcomings of the current cost model they have come across.

I am looking forward to your feedback.

Many thanks,
Anna Welker

Hi,

One goal is a more modular approach to cost-modeling and transformations. The current cost-model is quite monolithic and keeps some context information in various tables, which is hard to reason about and also hard to change and extend. One idea with VPlan is that cost decisions could be materialized in the VPlans directly by transformations working in concert to find the best overall strategy.

I think the example with gathers that extend the loaded value illustrates that quite well. In the legacy cost model, you could maintain a set of values that are free to extend and add the loaded values to it. When visiting a zext, return cost of zero for those.

With VPlan, instead you could add a new VPInstruction opcode ZextLoad and have a transformation that replaces all zext (load) instructions with the new ZExtLoad one. The cost-model needs to be taught about this special instruction and how much it costs. Then you could apply this transformation to an existing plan and check if the overall cost improved.

Of course there’s also the issue of how to generalize the TTI APIs to allow for computing the cost of instructions with more context.

I think we probably want to keep things as generic as possible, try to model generic concepts and use TTI to decide whether to use it or not (e.g. see how masked loads/stores are handled).

I am not sure what patterns specifically you are thinking about, but I think the cost model should just evaluate the cost of a given plan and not provide anything beyond that. Of course, this still can mean that there might be certain recipes/instructions that are not available/profitable on some targets and we decide to never generate them, based on the cost-information.

I am hoping to make some progress on this in the next months (hopefully the work on modeling the def-use chains between recipes in VPlan will be wrapped up soon) and I expect there to be a few moving parts. Not sure what that means for a master thesis in this area.

Cheers,
Florian

Hello Anna,

There was a VPlan round-table at the US LLVM dev conference just a few weeks ago, and I have linked in the people who were involved in this as cost-modeling was one of the topics that was discussed. You may have seen this already, but Dave started doing some work in this area recently:

https://reviews.llvm.org/D89322
https://reviews.llvm.org/D89323

Just some first high-level remarks on your proposal: I think work in this area would be very valuable and solve a real problem that we currently have. If you plan to choose this as a subject for a master’s thesis, you probably need to think about how big of a task this is and having some motivating examples would be good too (but you mentioned some already). To me it sounds like a big task that needs a lot of plumbing in VPlan, but that’s why having this discussion here is interesting and hopefully people with more VPlan experience can provide insights.

Cheers,
Sjoerd.

Hi Anna,

This sounds like a great work for a master!

The main difference with VPlan (versus the original vectoriser) is that we can have multiple paths in parallel and compare them to each other via some cost function. The problem, as you have stated, is that the cost model is really naive and has a lot of heuristics that were fudged to suit the old vectoriser. Transitioning to the new model will likely produce a cost model that has nothing to do with the previous one. At all.

I see the problem from two angles. Both need to be thought about in parallel, as their designs will need to be in sync.

  1. How to build the cost model

The cost model structure needs to be target independent but with heavy target-specific knowledge. The kind of specialisation we created for the previous vectoriser is too naive, and assumes the concepts are the same, just not necessarily the costs. That’s not true at all for different hardware.

Another big flaw is that the cost is per instruction, not per pattern. So casts have zero cost because we expect the targets to optimise them away, but in some cases they don’t, so the heuristics is an average of how efficient the back-end is at getting rid of those. When you’re choosing between two VPlan paths, this will generate noise beyond usefulness.

So we need to find the cost of instructions that are actually generated and not those that aren’t. One idea is to ignore most instructions and focus on key ones (memory, arithmetic, etc) and then follow the use-def chain to see the patterns, and then, per target, know what the costs are for the pattern, not each individual instruction. Different targets can have different key instructions, so the walk through the basic-block can be target-dependent (ex: for (auto inst: TTI->AllKeyInsts(bb))). Once you try to take the cost of a key instruction, it will follow all of the others that you have ignored, to see if they would make a difference in the pattern, stopping short at other key instructions, for example, to avoid counting twice the same cost.

There are probably many other ways to do this, I’m just thinking out loud to give an idea of the problems we faced earlier. I’m sure you’ll find better ways to do this yourself once you start looking at it more.

Regardless of how we do it, having the ability to say “this shuffle is free on this instruction” but “not with that other one” will make the cost model a lot more precise and need a lot less fudged heuristics.

  1. How to use the cost model

Once we have costs of a VPlan, we need to traverse the problem space efficiently. It would be awesome if we could use some smart traversing (like RLO) but that also brings highly uncertain execution times and the need for training and genralisation, so not generally feasible. But we can do something simpler while still having the same idea. What we really don’t want is to be as greedy as the original vectoriser. We must have a way to occasionally increase costs without giving up, but for that, we need a policy that tells us that it’s ok to go that way and not some other (less optimal) way.

This policy has to be target-specific. Given it’s more deterministic nature (more than RLO at least), this will probably be the place where we fudge most of our heuristics. Things like “keep adding shuffles that it’s likely they’ll disappear”, even if they don’t all the time, it’s worth pushing a bit more, in case they do. So we’ll also need to have hard limits, possibly per target, and possibly benchmark-driven heuristics.

Once we have the policy and the cost, we can traverse one or many paths (depending on the budget we give the compiler, which could be a command line option), and then push as hard as we can through all paths within the budget and when we stop, we take the lowest cost and apply the series. How we traverse this will depend on the implementation of the cost model and the policy, but some simple heuristics search would be fine as a first approach.

Folks working on VPlan are creating a really nice structure to work with the different concepts in vectorisation strategies (predication, scalable vectors, use-def chain costs), so I’m sure you’ll have at least a few good ways to proceed, and hopefully help define the interfaces between the different parts of the vectoriser.

cheers,
–renato

Hi Florian,

With VPlan, instead you could add a new VPInstruction opcode ZextLoad and have a transformation that replaces all zext (load) instructions with the new ZExtLoad one. The cost-model needs to be taught about this special instruction and how much it costs. Then you could apply this transformation to an existing plan and check if the overall cost improved.

Of course there’s also the issue of how to generalize the TTI APIs to allow for computing the cost of instructions with more context.

I think we probably want to keep things as generic as possible, try to model generic concepts and use TTI to decide whether to use it or not (e.g. see how masked loads/stores are handled).

Hm, so what you are saying is that for any instruction that could be merged into another one on some architecture, there should be a new VPInstruction (naturally based on the ‘plain’ VPInstruction for that instruction) that symbolises this and would be assigned the cost 0 by the TTI, or whatever a new cost model would use, of only the respective architecture. I had to think about this a bit, but now it makes sense to me. Handling it a smart way, such that TTIs that do not know about this special case fall back to what they do for the normal case, would also save us from generating additional VPlans. And with the load being the ingredient of the extend, and thus accessible, everything is available to do some type checking and ensure the merge is doable for the architecture.
This really helped me to understand how things are supposed to be working, thanks a lot for explaining!
I do still have a few doubts as to whether this approach might still bear some risk of over-generating (see my reply to Renato’s email which I’ll link you to), but those might simply originate from me being too cautious because I do not have a good overview yet of how many (different) examples like the one I mentioned there are, so how many special cases would have to get their own VPInstruction flavour.

I am not sure what patterns specifically you are thinking about, but I think the cost model should just evaluate the cost of a given plan and not provide anything beyond that. Of course, this still can mean that there might be certain recipes/instructions that are not available/profitable on some targets and we decide to never generate them, based on the cost-information.

I was wondering about how one would merge the extend from the example above into the load without any changes to the VPlan structure/classes. So basically without the trick of adding special instructions that you mentioned above. But that does not seem a nice way to do it.

I am hoping to make some progress on this in the next months (hopefully the work on modeling the def-use chains between recipes in VPlan will be wrapped up soon) and I expect there to be a few moving parts. Not sure what that means for a master thesis in this area.

I guess the extend of impact that has on a thesis strongly depends on what exactly ‘this’ entails - could you elaborate a bit, please? Are you planning to do work on a cost model for VPlan, or to change how VPInstructions and VPRecipes work, or something else?
If I would start working on a thesis named ‘Cost modelling with VPlan’ or similar, the first thing to do would be a lot of thinking and drafting to come up with a good and solid concept. I don’t think that would depend too much on concrete implementation details of VPlan, as long as the core of the individual components stays the same…

Best,
Anna

Hi Renato,

The cost model structure needs to be target independent but with heavy target-specific knowledge. The kind of specialisation we created for the previous vectoriser is too naive, and assumes the concepts are the same, just not necessarily the costs. That's not true at all for different hardware.

Yes, that's a big issue to be solved. I do like the idea from Florians reply, basically encoding an instruction's environment in the VPInstruction that is generated and make it such that target-specific cost components would overwrite the cost calculations for the general class if they can be treated in a special way by the architecture. Though currently the gather that is extended and the scatter whose input is truncated are my only examples for this kind of thing, which is why I am hoping that the community will have encountered more examples like these and is willing to share. Maybe there are cases that are more complex, that e.g. merge or depend on far more then two instructions, in which case the generation of the right VPInstructions might become tricky - especially if there are multiple possible 'merge patterns' (for one or different architectures) that are overlapping, which would require to generate alternative VPlans, whose number for long loops with many special cases might grow exponentially to cover all possible combinations.

There are probably many other ways to do this, I'm just thinking out loud to give an idea of the problems we faced earlier. I'm sure you'll find better ways to do this yourself once you start looking at it more.

Regardless of how we do it, having the ability to say "this shuffle is free on this instruction" but "not with that other one" will make the cost model a lot more precise and need a lot less fudged heuristics.

Well thank you for thinking out loud, that is exactly what I was hoping to get out of this RFC!
Yes, there needs to be a good way to say things like that, and preferrably 'good' should not only mean efficient, but also easy to understand, maintain and extend - but that's just me dreaming :slight_smile:
Are you thinking of a concrete example when mentioning those shuffles? Because concrete examples would really help me a lot when thinking further about this - if I decide to do this as a thesis, I might have to constrain the actual implementation to (a subset of instructions for) a single architecture, but for thinking about the issue itself any concrete examples for any architecture would be a great help!

2. How to use the cost model

Once we have costs of a VPlan, we need to traverse the problem space efficiently. It would be awesome if we could use some smart traversing (like RLO) but that also brings highly uncertain execution times and the need for training and genralisation, so not generally feasible. But we can do something simpler while still having the same idea. What we really don't want is to be as greedy as the original vectoriser. We must have a way to occasionally increase costs without giving up, but for that, we need a policy that tells us that it's ok to go that way and not some other (less optimal) way.

This policy has to be target-specific. Given it's more deterministic nature (more than RLO at least), this will probably be the place where we fudge most of our heuristics. Things like "keep adding shuffles that it's likely they'll disappear", even if they don't all the time, it's worth pushing a bit more, in case they do. So we'll also need to have hard limits, possibly per target, and possibly benchmark-driven heuristics.

Once we have the policy and the cost, we can traverse one or many paths (depending on the budget we give the compiler, which could be a command line option), and then push as hard as we can through all paths within the budget and when we stop, we take the lowest cost and apply the series. How we traverse this will depend on the implementation of the cost model and the policy, but some simple heuristics search would be fine as a first approach.

Folks working on VPlan are creating a really nice structure to work with the different concepts in vectorisation strategies (predication, scalable vectors, use-def chain costs), so I'm sure you'll have at least a few good ways to proceed, and hopefully help define the interfaces between the different parts of the vectoriser.

Hm, I hadn't yet gotten as far as how to use it, but you're right of course - that will be another challenge. I know that there are many things being planned and on the way for VPlan, but I'll willingly admit that I am currently still having some trouble gaining an overview of what is there, what is not, and what might be on its way. It's simply hard and takes a lot of time to find documentation for all the new things, but I'm sure I'll get there.

Thanks for your detailed reply and all the ideas, they do help a lot.

Best,
Anna

Yes, that’s a big issue to be solved. I do like the idea from Florians
reply, basically encoding an instruction’s environment in the
VPInstruction that is generated and make it such that target-specific
cost components would overwrite the cost calculations for the general
class if they can be treated in a special way by the architecture.

Merging instructions as a single VPInst is a good target-specific optimisation to do to simplify the cost model, especially if the target can lower them into a single instruction. But I worry that we’ll have to create too many VPInst variants for each platform, where for the cost model, it doesn’t usually matter what instructions are inside, other than their cost.

In the example of a “zext (load)”, instead of having two VPInsts, you have one, and the hardware will have to have a cost of zext, load and zext-load. This may be enough, or it may lead to a combinatorial explosion. If it does, something leaner like adding a check on Zext for operands, see that it’s a load, and return zero. This also keeps the logic inside the instruction itself, so it’s easy to understand the code.

I’m honestly not sure which approach is best. Perhaps a mix of both? Florian is working on this, so I’d trust his judgement more than mine.

But I’d keep in mind that the goals of the cost model need to account for patterns that we don’t necessarily know about now.

Maybe there are cases that are more
complex, that e.g. merge or depend on far more then two instructions, in
which case the generation of the right VPInstructions might become
tricky - especially if there are multiple possible ‘merge patterns’ (for
one or different architectures) that are overlapping, which would
require to generate alternative VPlans, whose number for long loops with
many special cases might grow exponentially to cover all possible
combinations.

Exactly my thinking…

Yes, there needs to be a good way to say things like that, and
preferrably ‘good’ should not only mean efficient, but also easy to
understand, maintain and extend - but that’s just me dreaming :slight_smile:

Agreed!

Are you thinking of a concrete example when mentioning those shuffles?
Because concrete examples would really help me a lot when thinking
further about this - if I decide to do this as a thesis, I might have to
constrain the actual implementation to (a subset of instructions for) a
single architecture, but for thinking about the issue itself any
concrete examples for any architecture would be a great help!

Unfortunately, the last time I worked on this was many years ago. There should be examples on the vectoriser tests for each platform.

I’m thinking about special instructions in each arch that resolve into a complicated set of LLVM IR. For example strides loads/stores in Arm (LDn/STn), gather-scatter in AVX/SVE, vector predication, reduction, etc.

Some of those things have gained intrinsics since the last time I worked with the vectoriser (ex. reduction, which was particularly messy), and that would make the cost model a lot simpler to implement.

Hm, I hadn’t yet gotten as far as how to use it, but you’re right of
course - that will be another challenge.

Indeed. One problem at a time. :slight_smile:

But as you help design the cost model, we need to make sure its structure is the most sensible (efficient, simple to read, easy to understand and extend) so that others can make use of it (and extend later), as they implement the VPlan transformations.

I didn’t mean you’d have to do it all, just to have it in mind how it could be used, perhaps even write (in pseudo-code) a few examples from each use-case in VPlan, to guide you on the cost model design. This would also look nice in your thesis introduction, to guide the reader into why we need your work in the first place and why your design is what it is.

I know that there are many
things being planned and on the way for VPlan, but I’ll willingly admit
that I am currently still having some trouble gaining an overview of
what is there, what is not, and what might be on its way. It’s simply
hard and takes a lot of time to find documentation for all the new
things, but I’m sure I’ll get there.

It’s still very much work in progress, and a lot of work and a lot of progress, so it’s hard to keep up with what’s done or in flight.

Initially it looks daunting, but soon enough you’ll get the hang of it, don’t worry. When in doubt, ask the list (or IRC). There are enough people that can help you around.

cheers,
–renato

Hello

The gramma idea sounds interesting. My brain tends to just work in terms of code, but it might make a great master thesis :slight_smile: The pattern matches were modelled after the existing ones in llvm that are used all over the place, especially instcombine. It was meant to feel familiar with what already existed.

I imagine a lot of people must have thought about the same kinds of ideas and there must be some research out there on it. Either whether it worked well or not, it would be good to look into what was already tried. I imagine there will be cases where it works well and times where it doesn't. The same concept of specifying patterns in some way that isn't C++ code could be said for instcombine or DAG combining too. The fact that no-one ever did it is perhaps a bad sign? I know there are some tablegen defined combines for global isel now, they might give some ideas.

Combining a few instructions into a vmulh is simple enough, and I was hoping that there would not be a huge number of combines needed. There are bigger transforms that perhaps wouldn't fit that very well. Imagine trying to take code that does:
x, y = vld2
x *= -1
vst2 x, y

and turn that into
x = load
x *= [-1,1]
str x

That is like "de-interleaving" an interleaving group, much like slp vectorization. Plus it would need to make that transform generic across all the different operations that could be between the loads/store and on the size of the interleaving groups. It, of course, could get quite complex.

Looking forward to seeing what you come up with.
Dave