As we know, the LV (loop vectorize) is optimized for the loop body, while SLP vectorization is optimized for all code, including the loop body. so the SLP may hinder better optimization on LV because the SLP pass is placed before the front of the LV pass.
In particular, some current architectures support SVE, who usually has a wider parallelism than SLP with a fixed vectorized width，so it seems the LV will usually have better performance than that of SVE.
Now, in order to limit the scope of “unusual” SLP where the codegen ends up being quite poor, we set a conservative cost of vector instructions under AArch64（⚙ D155459 [AArch64] Change the cost of vector insert/extract to 2), so if the LV is placed before the SLP, then we may don’t need such a workaround?
Looking at https://github.com/llvm/llvm-project/blob/main/llvm/lib/Passes/PassBuilderPipelines.cpp#L1134, LV should run before SLP in the default pipelines. Are you using a custom pipeline?