I am currently trying to optimize LLVM for the ARM A64FX processor mainly focusing on Loop vectorization. For auto-vectorization of outer loop, I am working on VPlan from past couple of months. I have done few code changes for automating the usage of Vplan without external compilation flag(-enable-vplan-native-path=true) as well as without compiler hints in user’s source code(pragmas).
I came across the patch developed by @iamlouk (⚙ D157371 [VPlan] Support interleaving for outer-loop vectorization), which supports interleaving of loop iterations. I integrated the patch and observed a performance improvement of nearly 2X for matrix multiplication program (on specifying the interleave count as 2 as mentioned by @iamlouk). I further experimented with different values of interleave count and came to a conclusion that for matrix multiplication, the optimal interleave count is 4 and it gives an extra performance improvement of almost 20-25% than with the interleave count of 2. I am currently working on automating this patch and am also validating the results with other benchmarks.
Meanwhile, I also tried to deploy the SVE patch (⚙ D157484 [VPlan] Support scalable vectors in outer-loop vectorization) by @iamlouk, but was not successful as I was facing some issues (trying to fix the issues with @iamlouk’s help).
In future, I plan to work on the vectorization cost model of LLVM which plays a crucial role in the vectorization of outer loops