Hi. My team in Arm have found that Loop Access Analysis is being a little conservative when it comes to detecting store-to-load forwarding conflicts between memory access dependencies (at least when targeting modern hardware like SVE). Specifically we have seen a pessimization in the NAS Parallel Benchmarks because of LAA preventing vectorization. My observations are that in case of unknown loop bounds LAA cannot reason whether the dependence distance is safe and so it checks for potential store-to-load forwarding conflicts. This topic has been brought up before: Loop Vectorization and Store-Load Forwarding issue.
Here is an example:
double A[N][N];
void unknown_loop_bounds(int x, int y) {
for (int i = 0; i < x; i++)
for (int j = 0; j < y; j++)
A[i+1][j] = A[i][j];
}
Compiling the above for N=37 yields:
clang -target aarch64-linux-gnu -march=armv8-a+sve -O3 -fno-unroll-loops -S -emit-llvm -DN=37 laa.c -mllvm -debug-only=loop-accesses -o - |& less
LAA: Src Scev: {{@A,+,296}<nuw><%for.cond1.preheader.us>,+,8}<nuw><%for.body4.us>Sink Scev: {{(296 + @A)<nuw>,+,296}<nuw><%for.cond1.preheader.us>,+,8}<nuw><%for.body4.us>(Induction step: 1)
LAA: Distance for %0 = load double, ptr %arrayidx6.us, align 8, !tbaa !6 to store double %0, ptr %arrayidx10.us, align 8, !tbaa !6: 296
LAA: Distance 296 that could cause a store-load forwarding conflict
I appreciate this would be a target independent change so it could potentially affect others. Do others have similar observations when targeting i.e x86 etc?