We want to improve the way register pressure is modelled in the loop vectorizer so as to avoid pruning beneficial VFs.
At present, the legacy cost model for the vectorizer will examine the types present in the loop (as used by memory operations and phi nodes) and constrain the VF based on the supported vector register size and the size of the largest scalar type in the loop. For example, if the largest type used in a load/store/phi instruction is 32-bit, and the size of a vector register is 128-bit, then the VF will be capped at 4.
However, the legacy cost model attempts to calculate the register pressure in the loop based on those types without context of how they are being used. It just asks the target how many registers are required to support a given type and compares that against the known maximum number of registers for that class, then rejects a VF if the required number exceeds the maximum. The AArch64 backend will report that 4 vector registers are required to support a <16 x i32>, so if the extended types are materialized at a VF of 16 they will be quite expensive.
If the target has instructions that can hide extensions and limit the effective width (such as the AArch64 dot product instructions, which can take 2 vectors of <16 x i8> and perform 4 simultaneous 4-element dot product calculations with built-in extensions and produce a <4 x i32> output), then the legacy cost model has lists of types and instructions to ignore entirely when determining the maximum VF. These lists, however, are independent of the candidate VFs and so types and instructions have to be ignored for all or none of the VFs, which doesn’t accurately mirror what the backend is capable of lowering to, where each VF could have different optimisation potential.
We believe a better approach would be to avoid calculating register pressure in the legacy cost model and instead use individual VPlans, since partial reductions will present phi nodes with the narrower accumulator type. That way we will both avoid excluding VFs based on register numbers for types that never actually materialize and get a more accurate figure for register usage, since we won’t be completely ignoring the accumulators anymore.
For some example code, if we have a C loop with a single accumulator:
uint32_t mulacc(int8_t *a, int8_t *b, int n) {
uint32_t sum = 0;
for (int i=0; i<n; i+=1)
sum += a[i] * b[i];
return sum;
}
At a VF of 16, the legacy cost model would see loads of <16 x i8> accumulating into a <16 x i32>, and determine that 6 vector registers (on AArch64, 4 for the <16 x i32> accumulator, 2 for the 2 <16 x i8> loads) are required at most. This is perfectly within the bounds of the 32 available registers present on AArch64 NEON, so VPlan could work with VFs up to 16.
If instead someone was calculating multiple independent sums from one shared term (similar to some industry-leading benchmarks):
uint32_t multi_mulacc(uint8_t *a, int8_t *b, int * restrict sum, int n) {
for (int i = 0; i<n; i++) {
uint8_t a_val = a[i];
sum[0] += a_val * b[i * 8];
sum[1] += a_val * b[i * 8 + 1];
sum[2] += a_val * b[i * 8 + 2];
sum[3] += a_val * b[i * 8 + 3];
sum[4] += a_val * b[i * 8 + 4];
sum[5] += a_val * b[i * 8 + 5];
sum[6] += a_val * b[i * 8 + 6];
sum[7] += a_val * b[i * 8 + 7];
}
}
Then the cost model would come to a different conclusion. The registers required for the loads are only needed until they are combined into their respective accumulator phi, but the phis themselves become a problem. At a VF of 16, we would have 8 <16 x i32> phis, taking up all 32 registers, which would leave none for the loads from memory and would require spilling and filling if actually materialized. So the legacy cost model would restrict the maximum VF to 8, reducing our potential performance due to inaccurate data.
Doing the calculation after VPlan creation at VF 16, however, will show that the accumulator phis only need to be <4 x i32>, taking up a single register each giving a maximum vector register pressure of 8 + 2 (for loads from a and b) at any one point in the loop, expanding the range of possible VFs to choose from.
Comments are very welcome on the merit of this idea and its best approaches, thank you.