RFC: Model register pressure on a per-VPlan basis in the LoopVectorizer

We want to improve the way register pressure is modelled in the loop vectorizer so as to avoid pruning beneficial VFs.

At present, the legacy cost model for the vectorizer will examine the types present in the loop (as used by memory operations and phi nodes) and constrain the VF based on the supported vector register size and the size of the largest scalar type in the loop. For example, if the largest type used in a load/store/phi instruction is 32-bit, and the size of a vector register is 128-bit, then the VF will be capped at 4.

However, the legacy cost model attempts to calculate the register pressure in the loop based on those types without context of how they are being used. It just asks the target how many registers are required to support a given type and compares that against the known maximum number of registers for that class, then rejects a VF if the required number exceeds the maximum. The AArch64 backend will report that 4 vector registers are required to support a <16 x i32>, so if the extended types are materialized at a VF of 16 they will be quite expensive.

If the target has instructions that can hide extensions and limit the effective width (such as the AArch64 dot product instructions, which can take 2 vectors of <16 x i8> and perform 4 simultaneous 4-element dot product calculations with built-in extensions and produce a <4 x i32> output), then the legacy cost model has lists of types and instructions to ignore entirely when determining the maximum VF. These lists, however, are independent of the candidate VFs and so types and instructions have to be ignored for all or none of the VFs, which doesn’t accurately mirror what the backend is capable of lowering to, where each VF could have different optimisation potential.

We believe a better approach would be to avoid calculating register pressure in the legacy cost model and instead use individual VPlans, since partial reductions will present phi nodes with the narrower accumulator type. That way we will both avoid excluding VFs based on register numbers for types that never actually materialize and get a more accurate figure for register usage, since we won’t be completely ignoring the accumulators anymore.

For some example code, if we have a C loop with a single accumulator:

uint32_t mulacc(int8_t *a, int8_t *b, int n) {
  uint32_t sum = 0;
  for (int i=0; i<n; i+=1)
    sum += a[i] * b[i];
  return sum;
}

At a VF of 16, the legacy cost model would see loads of <16 x i8> accumulating into a <16 x i32>, and determine that 6 vector registers (on AArch64, 4 for the <16 x i32> accumulator, 2 for the 2 <16 x i8> loads) are required at most. This is perfectly within the bounds of the 32 available registers present on AArch64 NEON, so VPlan could work with VFs up to 16.
If instead someone was calculating multiple independent sums from one shared term (similar to some industry-leading benchmarks):

uint32_t multi_mulacc(uint8_t *a, int8_t *b, int * restrict sum, int n) {
  for (int i = 0; i<n; i++) {
    uint8_t a_val = a[i];
    sum[0] += a_val * b[i * 8];
    sum[1] += a_val * b[i * 8 + 1];
    sum[2] += a_val * b[i * 8 + 2];
    sum[3] += a_val * b[i * 8 + 3];
    sum[4] += a_val * b[i * 8 + 4];
    sum[5] += a_val * b[i * 8 + 5];
    sum[6] += a_val * b[i * 8 + 6];
    sum[7] += a_val * b[i * 8 + 7];
  }
}

Then the cost model would come to a different conclusion. The registers required for the loads are only needed until they are combined into their respective accumulator phi, but the phis themselves become a problem. At a VF of 16, we would have 8 <16 x i32> phis, taking up all 32 registers, which would leave none for the loads from memory and would require spilling and filling if actually materialized. So the legacy cost model would restrict the maximum VF to 8, reducing our potential performance due to inaccurate data.

Doing the calculation after VPlan creation at VF 16, however, will show that the accumulator phis only need to be <4 x i32>, taking up a single register each giving a maximum vector register pressure of 8 + 2 (for loads from a and b) at any one point in the loop, expanding the range of possible VFs to choose from.

Comments are very welcome on the merit of this idea and its best approaches, thank you.

That sounds great to me!

I think for partial reductions there was a workaround put into place (calculateRegisterUsage is first calling collectInLoopReductions to find out which ones will be scalar) and I am excited to see no new workarounds will be needed. We should have everything that’s needed in VPlan now to make such a move.

IIUC that would mean potentially constructing more (potentially unprofitable) VPlans. This should fit well into the design and roadmap, i.e. using VPlans to model multiple options and then pruning unprofitable ones.

One thing we might need to be a bit careful about is to not increase compile-time noticeably, but hopefully the overhead will not be too bad.

As a first step, it might be good to port the existing calculateRegisterUsage and use it when picking the interleave factor, which happens after the best VPlan is already selected. I had some older WIP patches lying around. I put them up as [LV] Compute register usage for interleaving on VPlan. by fhahn · Pull Request #126437 · llvm/llvm-project · GitHub as I thought it may be of help.

One thing we might need to be a bit careful about is to not increase compile-time noticeably…

I second this. In fact, it would be important to understand the impact on some of the key industry-leading benchmarks.

I understand that it is not easy to model all sorts of interactions between instructions to model the register pressure accurately but this is good in the right direction.

I am curious. For the motivating example in your post, do you see a performance uplift at a higher VF?

Thanks for the feedback, both. I’ve had a look at your collectRegisterUsage patch, Florian, and it doesn’t look like it would be too hard to also use that at the point of choosing between the plans (based on register usage). Perhaps some caching of the results would be a good idea so it could also be used for the interleave factor without being re-computed.

When it comes to the compile time overhead, that’s definitely a concern. Once I’ve put some work into this I’ll see how it affects the compile time of certain benchmarks, and I would hope that not many VFs are being prematurely discarded at the moment, and that the overhead won’t be that high.

I haven’t got any performance numbers with a modified compiler that allows a higher VF, but such a higher VF in the manually-unrolled example above would allow dot products on AArch64, which come with a performance benefit over the individual extends, muls and adds. I’ll get some concrete numbers if you’d like to see them.