[RFC][LAA] Should we disable the store-to-load-forwarding-conflict-detection?

labrinea · November 29, 2023, 4:34pm

Hi. My team in Arm have found that Loop Access Analysis is being a little conservative when it comes to detecting store-to-load forwarding conflicts between memory access dependencies (at least when targeting modern hardware like SVE). Specifically we have seen a pessimization in the NAS Parallel Benchmarks because of LAA preventing vectorization. My observations are that in case of unknown loop bounds LAA cannot reason whether the dependence distance is safe and so it checks for potential store-to-load forwarding conflicts. This topic has been brought up before: Loop Vectorization and Store-Load Forwarding issue.

Here is an example:

double A[N][N];

void unknown_loop_bounds(int x, int y) {
  for (int i = 0; i < x; i++)
    for (int j = 0; j < y; j++)
      A[i+1][j] = A[i][j];
}

Compiling the above for N=37 yields:

clang -target aarch64-linux-gnu -march=armv8-a+sve -O3 -fno-unroll-loops -S -emit-llvm -DN=37 laa.c -mllvm -debug-only=loop-accesses -o - |& less

LAA: Src Scev: {{@A,+,296}<nuw><%for.cond1.preheader.us>,+,8}<nuw><%for.body4.us>Sink Scev: {{(296 + @A)<nuw>,+,296}<nuw><%for.cond1.preheader.us>,+,8}<nuw><%for.body4.us>(Induction step: 1)
LAA: Distance for   %0 = load double, ptr %arrayidx6.us, align 8, !tbaa !6 to   store double %0, ptr %arrayidx10.us, align 8, !tbaa !6: 296
LAA: Distance 296 that could cause a store-load forwarding conflict

I appreciate this would be a target independent change so it could potentially affect others. Do others have similar observations when targeting i.e x86 etc?

jdoerfert · November 29, 2023, 5:16pm

I’m confused. If I call this with x=1 and y=N+1=38, wouldn’t there be a loop carried dependence?

Mitch_Alsup · December 1, 2023, 3:15am

Consider the case where parameter y arrives with the value of 1492 !!

labrinea · December 1, 2023, 11:39am

I think we are missing the point. Both these examples exhibit undefined behaviour. The way I understand it, the store-to-load forwarding check is not about validity of vectorization, but rather a performance related check which seems somewhat conservative for todays standards. If we disable the check, then Loop Access Analysis will say “Positive distance 296 with max VF = 37”, even if the bounds are known:

double A[N][N];

void known_loop_bounds() {
  for (int i = 0; i < N-1; i++)
    for (int j = 0; j < N+1; j++)
      A[i+1][j] = A[i][j];
}

The compiler generates both vectorized and scalar code for the above example, which seems right to me:

; Function Attrs: nofree norecurse nosync nounwind memory(readwrite, argmem: none, inaccessiblemem: none) uwtable vscale_range(1,16)
define dso_local void @loop_bounds_known() local_unnamed_addr #0 {
entry:
  %0 = tail call i64 @llvm.vscale.i64()
  %.neg = mul nuw nsw i64 %0, 62
  %n.vec = and i64 %.neg, 38
  %1 = tail call i64 @llvm.vscale.i64()
  %2 = shl nuw nsw i64 %1, 1
  %cmp.n = icmp eq i64 %n.vec, 38
  br label %for.cond1.preheader

for.cond1.preheader:                              ; preds = %entry, %for.cond.cleanup3
  %indvars.iv24 = phi i64 [ 0, %entry ], [ %indvars.iv.next25, %for.cond.cleanup3 ]
  %indvars.iv.next25 = add nuw nsw i64 %indvars.iv24, 1
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %for.cond1.preheader
  %index = phi i64 [ 0, %for.cond1.preheader ], [ %index.next, %vector.body ]
  %3 = getelementptr inbounds [37 x [37 x double]], ptr @A, i64 0, i64 %indvars.iv24, i64 %index
  %wide.load = load <vscale x 2 x double>, ptr %3, align 8, !tbaa !6
  %4 = getelementptr inbounds [37 x [37 x double]], ptr @A, i64 0, i64 %indvars.iv.next25, i64 %index
  store <vscale x 2 x double> %wide.load, ptr %4, align 8, !tbaa !6
  %index.next = add nuw i64 %index, %2
  %5 = icmp eq i64 %index.next, %n.vec
  br i1 %5, label %middle.block, label %vector.body, !llvm.loop !10

middle.block:                                     ; preds = %vector.body
  br i1 %cmp.n, label %for.cond.cleanup3, label %for.body4

for.cond.cleanup:                                 ; preds = %for.cond.cleanup3
  ret void

for.cond.cleanup3:                                ; preds = %for.body4, %middle.block
  %exitcond27.not = icmp eq i64 %indvars.iv.next25, 36
  br i1 %exitcond27.not, label %for.cond.cleanup, label %for.cond1.preheader, !llvm.loop !15

for.body4:                                        ; preds = %middle.block, %for.body4
  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body4 ], [ %n.vec, %middle.block ]
  %arrayidx6 = getelementptr inbounds [37 x [37 x double]], ptr @A, i64 0, i64 %indvars.iv24, i64 %indvars.iv
  %6 = load double, ptr %arrayidx6, align 8, !tbaa !6
  %arrayidx10 = getelementptr inbounds [37 x [37 x double]], ptr @A, i64 0, i64 %indvars.iv.next25, i64 %indvars.iv
  store double %6, ptr %arrayidx10, align 8, !tbaa !6
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 38
  br i1 %exitcond.not, label %for.cond.cleanup3, label %for.body4, !llvm.loop !16
}

labrinea · December 12, 2023, 2:18pm

Adding a few remarks regarding the above two examples (when targeting SVE):

If N is a multiple of 2 then the compiler will vectorize unknown_loop_bounds() with VF = 2, unless N is a multiple of 32 in which case it vectorizes with VF = vscale x 2.
For every other N the compiler will not vectorize unknown_loop_bounds().
The function known_loop_bounds() is always vectorized with VF = vscale x 2 regardless of the value of N.

labrinea · January 5, 2024, 12:32pm

I am adding just a summary of my findings on this issue. When we call unknown_loop_bounds with an upper bound for the inner loop which is exceeding the number of array elements (like Johannes’s example), then loop access analysis finds a positive dependence distance but allows a maximum VF equal to the array dimension. In this case the compiler generates code for checking how many times the VF fits in the loop bound and generates vector code followed by scalar code for the remaining iterations. Forcing the vectorizer to use a larger VF than the array dimension yields loop access analysis to disallow vectorization because of positive dependence distance. The verdict is that lifting the store-to-load forwarding check is safe it terms of correctness.

Topic		Replies	Views
Loop Vectorization and Store-Load Forwarding issue LLVM Dev List Archives	4	125	June 13, 2015
Is it feasible for LV to vectorize a loop accessing A[i] using VF>2 when A has only 2 elements? LLVM Dev List Archives	4	89	December 7, 2020
[RFC] Supporting more early-exit loops Loop Optimizations rfc , llvm	8	674	May 21, 2025
loop vectorizer LLVM Dev List Archives	34	153	November 6, 2013
RFC: Loop versioning for LICM LLVM Dev List Archives	19	90	March 27, 2015

[RFC][LAA] Should we disable the store-to-load-forwarding-conflict-detection?

Related topics